Some manual changes done to the library were lost with this update. They will be added in the next commit.
		
			
				
	
	
		
			10079 lines
		
	
	
		
			484 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			10079 lines
		
	
	
		
			484 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
-----------------------------------------------------------------------------
 | 
						|
This file contains a concatenation of the PCRE2 man pages, converted to plain
 | 
						|
text format for ease of searching with a text editor, or for use on systems
 | 
						|
that do not have a man page processor. The small individual files that give
 | 
						|
synopses of each function in the library have not been included. Neither has
 | 
						|
the pcre2demo program. There are separate text files for the pcre2grep and
 | 
						|
pcre2test commands.
 | 
						|
-----------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2(3)                   Library Functions Manual                   PCRE2(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
INTRODUCTION
 | 
						|
 | 
						|
       PCRE2 is the name used for a revised API for the PCRE library, which is
 | 
						|
       a set of functions, written in C,  that  implement  regular  expression
 | 
						|
       pattern matching using the same syntax and semantics as Perl, with just
 | 
						|
       a few differences. Some features that appeared in Python and the origi-
 | 
						|
       nal  PCRE  before  they  appeared  in Perl are also available using the
 | 
						|
       Python syntax. There is also some support for one or two .NET and Onig-
 | 
						|
       uruma  syntax  items,  and  there are options for requesting some minor
 | 
						|
       changes that give better ECMAScript (aka JavaScript) compatibility.
 | 
						|
 | 
						|
       The source code for PCRE2 can be compiled to support 8-bit, 16-bit,  or
 | 
						|
       32-bit  code units, which means that up to three separate libraries may
 | 
						|
       be installed.  The original work to extend PCRE to  16-bit  and  32-bit
 | 
						|
       code  units  was  done  by Zoltan Herczeg and Christian Persch, respec-
 | 
						|
       tively. In all three cases, strings can be interpreted  either  as  one
 | 
						|
       character  per  code  unit, or as UTF-encoded Unicode, with support for
 | 
						|
       Unicode general category properties. Unicode  support  is  optional  at
 | 
						|
       build  time  (but  is  the default). However, processing strings as UTF
 | 
						|
       code units must be enabled explicitly at run time. The version of  Uni-
 | 
						|
       code in use can be discovered by running
 | 
						|
 | 
						|
         pcre2test -C
 | 
						|
 | 
						|
       The  three  libraries  contain  identical sets of functions, with names
 | 
						|
       ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
 | 
						|
       pile_8()).  However,  by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
 | 
						|
       32, a program that uses just one code unit width can be  written  using
 | 
						|
       generic names such as pcre2_compile(), and the documentation is written
 | 
						|
       assuming that this is the case.
 | 
						|
 | 
						|
       In addition to the Perl-compatible matching function, PCRE2 contains an
 | 
						|
       alternative  function that matches the same compiled patterns in a dif-
 | 
						|
       ferent way. In certain circumstances, the alternative function has some
 | 
						|
       advantages.   For  a discussion of the two matching algorithms, see the
 | 
						|
       pcre2matching page.
 | 
						|
 | 
						|
       Details of exactly which Perl regular expression features are  and  are
 | 
						|
       not  supported  by  PCRE2  are  given  in  separate  documents. See the
 | 
						|
       pcre2pattern and pcre2compat pages. There is a syntax  summary  in  the
 | 
						|
       pcre2syntax page.
 | 
						|
 | 
						|
       Some  features  of PCRE2 can be included, excluded, or changed when the
 | 
						|
       library is built. The pcre2_config() function makes it possible  for  a
 | 
						|
       client  to  discover  which  features are available. The features them-
 | 
						|
       selves are described in the pcre2build page. Documentation about build-
 | 
						|
       ing  PCRE2 for various operating systems can be found in the README and
 | 
						|
       NON-AUTOTOOLS_BUILD files in the source distribution.
 | 
						|
 | 
						|
       The libraries contains a number of undocumented internal functions  and
 | 
						|
       data  tables  that  are  used by more than one of the exported external
 | 
						|
       functions, but which are not intended  for  use  by  external  callers.
 | 
						|
       Their  names  all begin with "_pcre2", which hopefully will not provoke
 | 
						|
       any name clashes. In some environments, it is possible to control which
 | 
						|
       external  symbols  are  exported when a shared library is built, and in
 | 
						|
       these cases the undocumented symbols are not exported.
 | 
						|
 | 
						|
 | 
						|
SECURITY CONSIDERATIONS
 | 
						|
 | 
						|
       If you are using PCRE2 in a non-UTF application that permits  users  to
 | 
						|
       supply  arbitrary  patterns  for  compilation, you should be aware of a
 | 
						|
       feature that allows users to turn on UTF support from within a pattern.
 | 
						|
       For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
 | 
						|
       mode, which interprets patterns and subjects as strings of  UTF-8  code
 | 
						|
       units instead of individual 8-bit characters. This causes both the pat-
 | 
						|
       tern and any data against which it is matched to be checked  for  UTF-8
 | 
						|
       validity.  If the data string is very long, such a check might use suf-
 | 
						|
       ficiently many resources as to cause your application to  lose  perfor-
 | 
						|
       mance.
 | 
						|
 | 
						|
       One  way  of guarding against this possibility is to use the pcre2_pat-
 | 
						|
       tern_info() function  to  check  the  compiled  pattern's  options  for
 | 
						|
       PCRE2_UTF.  Alternatively,  you can set the PCRE2_NEVER_UTF option when
 | 
						|
       calling pcre2_compile(). This causes an compile time error if a pattern
 | 
						|
       contains a UTF-setting sequence.
 | 
						|
 | 
						|
       The  use  of Unicode properties for character types such as \d can also
 | 
						|
       be enabled from within the pattern, by specifying "(*UCP)".  This  fea-
 | 
						|
       ture can be disallowed by setting the PCRE2_NEVER_UCP option.
 | 
						|
 | 
						|
       If  your  application  is one that supports UTF, be aware that validity
 | 
						|
       checking can take time. If the same data string is to be  matched  many
 | 
						|
       times,  you  can  use  the PCRE2_NO_UTF_CHECK option for the second and
 | 
						|
       subsequent matches to avoid running redundant checks.
 | 
						|
 | 
						|
       The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
 | 
						|
       to  problems,  because  it  may leave the current matching point in the
 | 
						|
       middle of  a  multi-code-unit  character.  The  PCRE2_NEVER_BACKSLASH_C
 | 
						|
       option can be used by an application to lock out the use of \C, causing
 | 
						|
       a compile-time error if it is encountered. It is also possible to build
 | 
						|
       PCRE2 with the use of \C permanently disabled.
 | 
						|
 | 
						|
       Another  way  that  performance can be hit is by running a pattern that
 | 
						|
       has a very large search tree against a string that  will  never  match.
 | 
						|
       Nested  unlimited repeats in a pattern are a common example. PCRE2 pro-
 | 
						|
       vides some protection against  this:  see  the  pcre2_set_match_limit()
 | 
						|
       function in the pcre2api page.
 | 
						|
 | 
						|
 | 
						|
USER DOCUMENTATION
 | 
						|
 | 
						|
       The  user  documentation for PCRE2 comprises a number of different sec-
 | 
						|
       tions. In the "man" format, each of these is a separate "man page".  In
 | 
						|
       the  HTML  format, each is a separate page, linked from the index page.
 | 
						|
       In the plain  text  format,  the  descriptions  of  the  pcre2grep  and
 | 
						|
       pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
 | 
						|
       respectively. The remaining sections, except for the pcre2demo  section
 | 
						|
       (which  is a program listing), and the short pages for individual func-
 | 
						|
       tions, are concatenated in pcre2.txt, for ease of searching.  The  sec-
 | 
						|
       tions are as follows:
 | 
						|
 | 
						|
         pcre2              this document
 | 
						|
         pcre2-config       show PCRE2 installation configuration information
 | 
						|
         pcre2api           details of PCRE2's native C API
 | 
						|
         pcre2build         building PCRE2
 | 
						|
         pcre2callout       details of the callout feature
 | 
						|
         pcre2compat        discussion of Perl compatibility
 | 
						|
         pcre2demo          a demonstration C program that uses PCRE2
 | 
						|
         pcre2grep          description of the pcre2grep command (8-bit only)
 | 
						|
         pcre2jit           discussion of just-in-time optimization support
 | 
						|
         pcre2limits        details of size and other limits
 | 
						|
         pcre2matching      discussion of the two matching algorithms
 | 
						|
         pcre2partial       details of the partial matching facility
 | 
						|
         pcre2pattern       syntax and semantics of supported regular
 | 
						|
                              expression patterns
 | 
						|
         pcre2perform       discussion of performance issues
 | 
						|
         pcre2posix         the POSIX-compatible C API for the 8-bit library
 | 
						|
         pcre2sample        discussion of the pcre2demo program
 | 
						|
         pcre2stack         discussion of stack usage
 | 
						|
         pcre2syntax        quick syntax reference
 | 
						|
         pcre2test          description of the pcre2test command
 | 
						|
         pcre2unicode       discussion of Unicode and UTF support
 | 
						|
 | 
						|
       In  the  "man"  and HTML formats, there is also a short page for each C
 | 
						|
       library function, listing its arguments and results.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
       Putting an actual email address here is a spam magnet. If you  want  to
 | 
						|
       email  me,  use  my two initials, followed by the two digits 10, at the
 | 
						|
       domain cam.ac.uk.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 16 October 2015
 | 
						|
       Copyright (c) 1997-2015 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2API(3)                Library Functions Manual                PCRE2API(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
       #include <pcre2.h>
 | 
						|
 | 
						|
       PCRE2  is  a  new API for PCRE. This document contains a description of
 | 
						|
       all its functions. See the pcre2 document for an overview  of  all  the
 | 
						|
       PCRE2 documentation.
 | 
						|
 | 
						|
 | 
						|
PCRE2 NATIVE API BASIC FUNCTIONS
 | 
						|
 | 
						|
       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
 | 
						|
         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
 | 
						|
         pcre2_compile_context *ccontext);
 | 
						|
 | 
						|
       void pcre2_code_free(pcre2_code *code);
 | 
						|
 | 
						|
       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
 | 
						|
         pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       pcre2_match_data *pcre2_match_data_create_from_pattern(
 | 
						|
         const pcre2_code *code, pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
 | 
						|
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | 
						|
         uint32_t options, pcre2_match_data *match_data,
 | 
						|
         pcre2_match_context *mcontext);
 | 
						|
 | 
						|
       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
 | 
						|
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | 
						|
         uint32_t options, pcre2_match_data *match_data,
 | 
						|
         pcre2_match_context *mcontext,
 | 
						|
         int *workspace, PCRE2_SIZE wscount);
 | 
						|
 | 
						|
       void pcre2_match_data_free(pcre2_match_data *match_data);
 | 
						|
 | 
						|
 | 
						|
PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS
 | 
						|
 | 
						|
       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
 | 
						|
 | 
						|
       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
 | 
						|
 | 
						|
       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
 | 
						|
 | 
						|
       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
 | 
						|
 | 
						|
 | 
						|
PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS
 | 
						|
 | 
						|
       pcre2_general_context *pcre2_general_context_create(
 | 
						|
         void *(*private_malloc)(PCRE2_SIZE, void *),
 | 
						|
         void (*private_free)(void *, void *), void *memory_data);
 | 
						|
 | 
						|
       pcre2_general_context *pcre2_general_context_copy(
 | 
						|
         pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       void pcre2_general_context_free(pcre2_general_context *gcontext);
 | 
						|
 | 
						|
 | 
						|
PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
 | 
						|
 | 
						|
       pcre2_compile_context *pcre2_compile_context_create(
 | 
						|
         pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       pcre2_compile_context *pcre2_compile_context_copy(
 | 
						|
         pcre2_compile_context *ccontext);
 | 
						|
 | 
						|
       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
 | 
						|
 | 
						|
       int pcre2_set_bsr(pcre2_compile_context *ccontext,
 | 
						|
         uint32_t value);
 | 
						|
 | 
						|
       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
 | 
						|
         const unsigned char *tables);
 | 
						|
 | 
						|
       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
 | 
						|
         PCRE2_SIZE value);
 | 
						|
 | 
						|
       int pcre2_set_newline(pcre2_compile_context *ccontext,
 | 
						|
         uint32_t value);
 | 
						|
 | 
						|
       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
 | 
						|
         uint32_t value);
 | 
						|
 | 
						|
       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
 | 
						|
         int (*guard_function)(uint32_t, void *), void *user_data);
 | 
						|
 | 
						|
 | 
						|
PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
 | 
						|
 | 
						|
       pcre2_match_context *pcre2_match_context_create(
 | 
						|
         pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       pcre2_match_context *pcre2_match_context_copy(
 | 
						|
         pcre2_match_context *mcontext);
 | 
						|
 | 
						|
       void pcre2_match_context_free(pcre2_match_context *mcontext);
 | 
						|
 | 
						|
       int pcre2_set_callout(pcre2_match_context *mcontext,
 | 
						|
         int (*callout_function)(pcre2_callout_block *, void *),
 | 
						|
         void *callout_data);
 | 
						|
 | 
						|
       int pcre2_set_match_limit(pcre2_match_context *mcontext,
 | 
						|
         uint32_t value);
 | 
						|
 | 
						|
       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
 | 
						|
         PCRE2_SIZE value);
 | 
						|
 | 
						|
       int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
 | 
						|
         uint32_t value);
 | 
						|
 | 
						|
       int pcre2_set_recursion_memory_management(
 | 
						|
         pcre2_match_context *mcontext,
 | 
						|
         void *(*private_malloc)(PCRE2_SIZE, void *),
 | 
						|
         void (*private_free)(void *, void *), void *memory_data);
 | 
						|
 | 
						|
 | 
						|
PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
 | 
						|
 | 
						|
       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
 | 
						|
         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
 | 
						|
 | 
						|
       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
 | 
						|
         uint32_t number, PCRE2_UCHAR *buffer,
 | 
						|
         PCRE2_SIZE *bufflen);
 | 
						|
 | 
						|
       void pcre2_substring_free(PCRE2_UCHAR *buffer);
 | 
						|
 | 
						|
       int pcre2_substring_get_byname(pcre2_match_data *match_data,
 | 
						|
         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
 | 
						|
 | 
						|
       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
 | 
						|
         uint32_t number, PCRE2_UCHAR **bufferptr,
 | 
						|
         PCRE2_SIZE *bufflen);
 | 
						|
 | 
						|
       int pcre2_substring_length_byname(pcre2_match_data *match_data,
 | 
						|
         PCRE2_SPTR name, PCRE2_SIZE *length);
 | 
						|
 | 
						|
       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
 | 
						|
         uint32_t number, PCRE2_SIZE *length);
 | 
						|
 | 
						|
       int pcre2_substring_nametable_scan(const pcre2_code *code,
 | 
						|
         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
 | 
						|
 | 
						|
       int pcre2_substring_number_from_name(const pcre2_code *code,
 | 
						|
         PCRE2_SPTR name);
 | 
						|
 | 
						|
       void pcre2_substring_list_free(PCRE2_SPTR *list);
 | 
						|
 | 
						|
       int pcre2_substring_list_get(pcre2_match_data *match_data,
 | 
						|
         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
 | 
						|
 | 
						|
 | 
						|
PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION
 | 
						|
 | 
						|
       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
 | 
						|
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | 
						|
         uint32_t options, pcre2_match_data *match_data,
 | 
						|
         pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP,
 | 
						|
         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
 | 
						|
         PCRE2_SIZE *outlengthptr);
 | 
						|
 | 
						|
 | 
						|
PCRE2 NATIVE API JIT FUNCTIONS
 | 
						|
 | 
						|
       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
 | 
						|
 | 
						|
       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
 | 
						|
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | 
						|
         uint32_t options, pcre2_match_data *match_data,
 | 
						|
         pcre2_match_context *mcontext);
 | 
						|
 | 
						|
       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
 | 
						|
         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
 | 
						|
         pcre2_jit_callback callback_function, void *callback_data);
 | 
						|
 | 
						|
       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
 | 
						|
 | 
						|
 | 
						|
PCRE2 NATIVE API SERIALIZATION FUNCTIONS
 | 
						|
 | 
						|
       int32_t pcre2_serialize_decode(pcre2_code **codes,
 | 
						|
         int32_t number_of_codes, const uint8_t *bytes,
 | 
						|
         pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       int32_t pcre2_serialize_encode(const pcre2_code **codes,
 | 
						|
         int32_t number_of_codes, uint8_t **serialized_bytes,
 | 
						|
         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       void pcre2_serialize_free(uint8_t *bytes);
 | 
						|
 | 
						|
       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
 | 
						|
 | 
						|
 | 
						|
PCRE2 NATIVE API AUXILIARY FUNCTIONS
 | 
						|
 | 
						|
       pcre2_code *pcre2_code_copy(const pcre2_code *code);
 | 
						|
 | 
						|
       pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
 | 
						|
 | 
						|
       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
 | 
						|
         PCRE2_SIZE bufflen);
 | 
						|
 | 
						|
       const unsigned char *pcre2_maketables(pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
 | 
						|
 | 
						|
       int pcre2_callout_enumerate(const pcre2_code *code,
 | 
						|
         int (*callback)(pcre2_callout_enumerate_block *, void *),
 | 
						|
         void *user_data);
 | 
						|
 | 
						|
       int pcre2_config(uint32_t what, void *where);
 | 
						|
 | 
						|
 | 
						|
PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
 | 
						|
 | 
						|
       There  are  three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
 | 
						|
       code units, respectively. However,  there  is  just  one  header  file,
 | 
						|
       pcre2.h.   This  contains the function prototypes and other definitions
 | 
						|
       for all three libraries. One, two, or all three can be installed simul-
 | 
						|
       taneously.  On  Unix-like  systems the libraries are called libpcre2-8,
 | 
						|
       libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
 | 
						|
       inal PCRE libraries.
 | 
						|
 | 
						|
       Character  strings are passed to and from a PCRE2 library as a sequence
 | 
						|
       of unsigned integers in code units  of  the  appropriate  width.  Every
 | 
						|
       PCRE2  function  comes  in three different forms, one for each library,
 | 
						|
       for example:
 | 
						|
 | 
						|
         pcre2_compile_8()
 | 
						|
         pcre2_compile_16()
 | 
						|
         pcre2_compile_32()
 | 
						|
 | 
						|
       There are also three different sets of data types:
 | 
						|
 | 
						|
         PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
 | 
						|
         PCRE2_SPTR8,  PCRE2_SPTR16,  PCRE2_SPTR32
 | 
						|
 | 
						|
       The UCHAR types define unsigned code units of the  appropriate  widths.
 | 
						|
       For  example,  PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR
 | 
						|
       types are constant pointers to the equivalent  UCHAR  types,  that  is,
 | 
						|
       they are pointers to vectors of unsigned code units.
 | 
						|
 | 
						|
       Many  applications use only one code unit width. For their convenience,
 | 
						|
       macros are defined whose names are the generic forms such as pcre2_com-
 | 
						|
       pile()  and  PCRE2_SPTR.  These  macros  use  the  value  of  the macro
 | 
						|
       PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific  func-
 | 
						|
       tion and macro names.  PCRE2_CODE_UNIT_WIDTH is not defined by default.
 | 
						|
       An application must define it to be  8,  16,  or  32  before  including
 | 
						|
       pcre2.h in order to make use of the generic names.
 | 
						|
 | 
						|
       Applications  that use more than one code unit width can be linked with
 | 
						|
       more than one PCRE2 library, but must define  PCRE2_CODE_UNIT_WIDTH  to
 | 
						|
       be  0  before  including pcre2.h, and then use the real function names.
 | 
						|
       Any code that is to be included in an environment where  the  value  of
 | 
						|
       PCRE2_CODE_UNIT_WIDTH  is  unknown  should  also  use the real function
 | 
						|
       names. (Unfortunately, it is not possible in C code to save and restore
 | 
						|
       the value of a macro.)
 | 
						|
 | 
						|
       If  PCRE2_CODE_UNIT_WIDTH  is  not  defined before including pcre2.h, a
 | 
						|
       compiler error occurs.
 | 
						|
 | 
						|
       When using multiple libraries in an application,  you  must  take  care
 | 
						|
       when  processing  any  particular  pattern to use only functions from a
 | 
						|
       single library.  For example, if you want to run a match using  a  pat-
 | 
						|
       tern  that  was  compiled  with pcre2_compile_16(), you must do so with
 | 
						|
       pcre2_match_16(), not pcre2_match_8().
 | 
						|
 | 
						|
       In the function summaries above, and in the rest of this  document  and
 | 
						|
       other  PCRE2  documents,  functions  and data types are described using
 | 
						|
       their generic names, without the 8, 16, or 32 suffix.
 | 
						|
 | 
						|
 | 
						|
PCRE2 API OVERVIEW
 | 
						|
 | 
						|
       PCRE2 has its own native API, which  is  described  in  this  document.
 | 
						|
       There are also some wrapper functions for the 8-bit library that corre-
 | 
						|
       spond to the POSIX regular expression API, but they do not give  access
 | 
						|
       to all the functionality. They are described in the pcre2posix documen-
 | 
						|
       tation. Both these APIs define a set of C function calls.
 | 
						|
 | 
						|
       The native API C data types, function prototypes,  option  values,  and
 | 
						|
       error codes are defined in the header file pcre2.h, which contains def-
 | 
						|
       initions of PCRE2_MAJOR and PCRE2_MINOR, the major  and  minor  release
 | 
						|
       numbers  for the library. Applications can use these to include support
 | 
						|
       for different releases of PCRE2.
 | 
						|
 | 
						|
       In a Windows environment, if you want to statically link an application
 | 
						|
       program  against  a non-dll PCRE2 library, you must define PCRE2_STATIC
 | 
						|
       before including pcre2.h.
 | 
						|
 | 
						|
       The functions pcre2_compile(), and pcre2_match() are used for compiling
 | 
						|
       and  matching regular expressions in a Perl-compatible manner. A sample
 | 
						|
       program that demonstrates the simplest way of using them is provided in
 | 
						|
       the file called pcre2demo.c in the PCRE2 source distribution. A listing
 | 
						|
       of this program is  given  in  the  pcre2demo  documentation,  and  the
 | 
						|
       pcre2sample documentation describes how to compile and run it.
 | 
						|
 | 
						|
       Just-in-time  compiler support is an optional feature of PCRE2 that can
 | 
						|
       be built in appropriate hardware environments. It greatly speeds up the
 | 
						|
       matching  performance of many patterns. Programs can request that it be
 | 
						|
       used if available, by calling pcre2_jit_compile() after a  pattern  has
 | 
						|
       been successfully compiled by pcre2_compile(). This does nothing if JIT
 | 
						|
       support is not available.
 | 
						|
 | 
						|
       More complicated programs might need to  make  use  of  the  specialist
 | 
						|
       functions    pcre2_jit_stack_create(),    pcre2_jit_stack_free(),   and
 | 
						|
       pcre2_jit_stack_assign() in order to  control  the  JIT  code's  memory
 | 
						|
       usage.
 | 
						|
 | 
						|
       JIT matching is automatically used by pcre2_match() if it is available,
 | 
						|
       unless the PCRE2_NO_JIT option is set. There is also a direct interface
 | 
						|
       for  JIT  matching,  which gives improved performance. The JIT-specific
 | 
						|
       functions are discussed in the pcre2jit documentation.
 | 
						|
 | 
						|
       A second matching function, pcre2_dfa_match(), which is  not  Perl-com-
 | 
						|
       patible,  is  also  provided.  This  uses a different algorithm for the
 | 
						|
       matching. The alternative algorithm finds all possible  matches  (at  a
 | 
						|
       given  point  in  the subject), and scans the subject just once (unless
 | 
						|
       there are lookbehind assertions).  However,  this  algorithm  does  not
 | 
						|
       return  captured  substrings.  A  description of the two matching algo-
 | 
						|
       rithms  and  their  advantages  and  disadvantages  is  given  in   the
 | 
						|
       pcre2matching    documentation.   There   is   no   JIT   support   for
 | 
						|
       pcre2_dfa_match().
 | 
						|
 | 
						|
       In addition to the main compiling and  matching  functions,  there  are
 | 
						|
       convenience functions for extracting captured substrings from a subject
 | 
						|
       string that has been matched by pcre2_match(). They are:
 | 
						|
 | 
						|
         pcre2_substring_copy_byname()
 | 
						|
         pcre2_substring_copy_bynumber()
 | 
						|
         pcre2_substring_get_byname()
 | 
						|
         pcre2_substring_get_bynumber()
 | 
						|
         pcre2_substring_list_get()
 | 
						|
         pcre2_substring_length_byname()
 | 
						|
         pcre2_substring_length_bynumber()
 | 
						|
         pcre2_substring_nametable_scan()
 | 
						|
         pcre2_substring_number_from_name()
 | 
						|
 | 
						|
       pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro-
 | 
						|
       vided, to free the memory used for extracted strings.
 | 
						|
 | 
						|
       The  function  pcre2_substitute()  can be called to match a pattern and
 | 
						|
       return a copy of the subject string with substitutions for  parts  that
 | 
						|
       were matched.
 | 
						|
 | 
						|
       Functions  whose  names begin with pcre2_serialize_ are used for saving
 | 
						|
       compiled patterns on disc or elsewhere, and reloading them later.
 | 
						|
 | 
						|
       Finally, there are functions for finding out information about  a  com-
 | 
						|
       piled  pattern  (pcre2_pattern_info()) and about the configuration with
 | 
						|
       which PCRE2 was built (pcre2_config()).
 | 
						|
 | 
						|
       Functions with names ending with _free() are used  for  freeing  memory
 | 
						|
       blocks  of  various  sorts.  In all cases, if one of these functions is
 | 
						|
       called with a NULL argument, it does nothing.
 | 
						|
 | 
						|
 | 
						|
STRING LENGTHS AND OFFSETS
 | 
						|
 | 
						|
       The PCRE2 API uses string lengths and  offsets  into  strings  of  code
 | 
						|
       units  in  several  places. These values are always of type PCRE2_SIZE,
 | 
						|
       which is an unsigned integer type, currently always defined as  size_t.
 | 
						|
       The  largest  value  that  can  be  stored  in  such  a  type  (that is
 | 
						|
       ~(PCRE2_SIZE)0) is reserved as a special indicator for  zero-terminated
 | 
						|
       strings  and  unset offsets.  Therefore, the longest string that can be
 | 
						|
       handled is one less than this maximum.
 | 
						|
 | 
						|
 | 
						|
NEWLINES
 | 
						|
 | 
						|
       PCRE2 supports five different conventions for indicating line breaks in
 | 
						|
       strings:  a  single  CR (carriage return) character, a single LF (line-
 | 
						|
       feed) character, the two-character sequence CRLF, any of the three pre-
 | 
						|
       ceding,  or any Unicode newline sequence. The Unicode newline sequences
 | 
						|
       are the three just mentioned, plus the single characters  VT  (vertical
 | 
						|
       tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
 | 
						|
       separator, U+2028), and PS (paragraph separator, U+2029).
 | 
						|
 | 
						|
       Each of the first three conventions is used by at least  one  operating
 | 
						|
       system as its standard newline sequence. When PCRE2 is built, a default
 | 
						|
       can be specified.  The default default is LF, which is the  Unix  stan-
 | 
						|
       dard.  However, the newline convention can be changed by an application
 | 
						|
       when calling pcre2_compile(), or it can be specified by special text at
 | 
						|
       the start of the pattern itself; this overrides any other settings. See
 | 
						|
       the pcre2pattern page for details of the special character sequences.
 | 
						|
 | 
						|
       In the PCRE2 documentation the word "newline"  is  used  to  mean  "the
 | 
						|
       character or pair of characters that indicate a line break". The choice
 | 
						|
       of newline convention affects the handling of the dot, circumflex,  and
 | 
						|
       dollar metacharacters, the handling of #-comments in /x mode, and, when
 | 
						|
       CRLF is a recognized line ending sequence, the match position  advance-
 | 
						|
       ment for a non-anchored pattern. There is more detail about this in the
 | 
						|
       section on pcre2_match() options below.
 | 
						|
 | 
						|
       The choice of newline convention does not affect the interpretation  of
 | 
						|
       the \n or \r escape sequences, nor does it affect what \R matches; this
 | 
						|
       has its own separate convention.
 | 
						|
 | 
						|
 | 
						|
MULTITHREADING
 | 
						|
 | 
						|
       In a multithreaded application it is important to keep  thread-specific
 | 
						|
       data  separate  from data that can be shared between threads. The PCRE2
 | 
						|
       library code itself is thread-safe: it contains  no  static  or  global
 | 
						|
       variables.  The  API  is  designed to be fairly simple for non-threaded
 | 
						|
       applications while at the same time ensuring that multithreaded  appli-
 | 
						|
       cations can use it.
 | 
						|
 | 
						|
       There are several different blocks of data that are used to pass infor-
 | 
						|
       mation between the application and the PCRE2 libraries.
 | 
						|
 | 
						|
   The compiled pattern
 | 
						|
 | 
						|
       A pointer to the compiled form of a pattern is  returned  to  the  user
 | 
						|
       when pcre2_compile() is successful. The data in the compiled pattern is
 | 
						|
       fixed, and does not change when the pattern is matched.  Therefore,  it
 | 
						|
       is  thread-safe, that is, the same compiled pattern can be used by more
 | 
						|
       than one thread simultaneously. For example, an application can compile
 | 
						|
       all its patterns at the start, before forking off multiple threads that
 | 
						|
       use them. However, if the just-in-time optimization  feature  is  being
 | 
						|
       used,  it  needs  separate  memory stack areas for each thread. See the
 | 
						|
       pcre2jit documentation for more details.
 | 
						|
 | 
						|
       In a more complicated situation, where patterns are compiled only  when
 | 
						|
       they  are  first needed, but are still shared between threads, pointers
 | 
						|
       to compiled patterns must be protected  from  simultaneous  writing  by
 | 
						|
       multiple threads, at least until a pattern has been compiled. The logic
 | 
						|
       can be something like this:
 | 
						|
 | 
						|
         Get a read-only (shared) lock (mutex) for pointer
 | 
						|
         if (pointer == NULL)
 | 
						|
           {
 | 
						|
           Get a write (unique) lock for pointer
 | 
						|
           pointer = pcre2_compile(...
 | 
						|
           }
 | 
						|
         Release the lock
 | 
						|
         Use pointer in pcre2_match()
 | 
						|
 | 
						|
       Of course, testing for compilation errors should also  be  included  in
 | 
						|
       the code.
 | 
						|
 | 
						|
       If JIT is being used, but the JIT compilation is not being done immedi-
 | 
						|
       ately, (perhaps waiting to see if the pattern  is  used  often  enough)
 | 
						|
       similar logic is required. JIT compilation updates a pointer within the
 | 
						|
       compiled code block, so a thread must gain unique write access  to  the
 | 
						|
       pointer     before    calling    pcre2_jit_compile().    Alternatively,
 | 
						|
       pcre2_code_copy()  or  pcre2_code_copy_with_tables()  can  be  used  to
 | 
						|
       obtain a private copy of the compiled code.
 | 
						|
 | 
						|
   Context blocks
 | 
						|
 | 
						|
       The  next main section below introduces the idea of "contexts" in which
 | 
						|
       PCRE2 functions are called. A context is nothing more than a collection
 | 
						|
       of parameters that control the way PCRE2 operates. Grouping a number of
 | 
						|
       parameters together in a context is a convenient way of passing them to
 | 
						|
       a  PCRE2  function without using lots of arguments. The parameters that
 | 
						|
       are stored in contexts are in some sense  "advanced  features"  of  the
 | 
						|
       API. Many straightforward applications will not need to use contexts.
 | 
						|
 | 
						|
       In a multithreaded application, if the parameters in a context are val-
 | 
						|
       ues that are never changed, the same context can be  used  by  all  the
 | 
						|
       threads. However, if any thread needs to change any value in a context,
 | 
						|
       it must make its own thread-specific copy.
 | 
						|
 | 
						|
   Match blocks
 | 
						|
 | 
						|
       The matching functions need a block of memory for working space and for
 | 
						|
       storing  the  results  of  a  match.  This includes details of what was
 | 
						|
       matched, as well as additional  information  such  as  the  name  of  a
 | 
						|
       (*MARK) setting. Each thread must provide its own copy of this memory.
 | 
						|
 | 
						|
 | 
						|
PCRE2 CONTEXTS
 | 
						|
 | 
						|
       Some  PCRE2  functions have a lot of parameters, many of which are used
 | 
						|
       only by specialist applications, for example,  those  that  use  custom
 | 
						|
       memory  management  or  non-standard character tables. To keep function
 | 
						|
       argument lists at a reasonable size, and at the same time to  keep  the
 | 
						|
       API  extensible,  "uncommon" parameters are passed to certain functions
 | 
						|
       in a context instead of directly. A context is just a block  of  memory
 | 
						|
       that  holds  the  parameter  values.   Applications that do not need to
 | 
						|
       adjust any of the context parameters  can  pass  NULL  when  a  context
 | 
						|
       pointer is required.
 | 
						|
 | 
						|
       There  are  three different types of context: a general context that is
 | 
						|
       relevant for several PCRE2 operations, a compile-time  context,  and  a
 | 
						|
       match-time context.
 | 
						|
 | 
						|
   The general context
 | 
						|
 | 
						|
       At  present,  this  context  just  contains  pointers to (and data for)
 | 
						|
       external memory management  functions  that  are  called  from  several
 | 
						|
       places in the PCRE2 library. The context is named `general' rather than
 | 
						|
       specifically `memory' because in future other fields may be  added.  If
 | 
						|
       you  do not want to supply your own custom memory management functions,
 | 
						|
       you do not need to bother with a general context. A general context  is
 | 
						|
       created by:
 | 
						|
 | 
						|
       pcre2_general_context *pcre2_general_context_create(
 | 
						|
         void *(*private_malloc)(PCRE2_SIZE, void *),
 | 
						|
         void (*private_free)(void *, void *), void *memory_data);
 | 
						|
 | 
						|
       The  two  function pointers specify custom memory management functions,
 | 
						|
       whose prototypes are:
 | 
						|
 | 
						|
         void *private_malloc(PCRE2_SIZE, void *);
 | 
						|
         void  private_free(void *, void *);
 | 
						|
 | 
						|
       Whenever code in PCRE2 calls these functions, the final argument is the
 | 
						|
       value of memory_data. Either of the first two arguments of the creation
 | 
						|
       function may be NULL, in which case the system memory management  func-
 | 
						|
       tions  malloc()  and free() are used. (This is not currently useful, as
 | 
						|
       there are no other fields in a general context,  but  in  future  there
 | 
						|
       might  be.)   The  private_malloc()  function  is used (if supplied) to
 | 
						|
       obtain memory for storing the context, and all three values  are  saved
 | 
						|
       as part of the context.
 | 
						|
 | 
						|
       Whenever  PCRE2  creates a data block of any kind, the block contains a
 | 
						|
       pointer to the free() function that matches the malloc() function  that
 | 
						|
       was  used.  When  the  time  comes  to free the block, this function is
 | 
						|
       called.
 | 
						|
 | 
						|
       A general context can be copied by calling:
 | 
						|
 | 
						|
       pcre2_general_context *pcre2_general_context_copy(
 | 
						|
         pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       The memory used for a general context should be freed by calling:
 | 
						|
 | 
						|
       void pcre2_general_context_free(pcre2_general_context *gcontext);
 | 
						|
 | 
						|
 | 
						|
   The compile context
 | 
						|
 | 
						|
       A compile context is required if you want to change the default  values
 | 
						|
       of any of the following compile-time parameters:
 | 
						|
 | 
						|
         What \R matches (Unicode newlines or CR, LF, CRLF only)
 | 
						|
         PCRE2's character tables
 | 
						|
         The newline character sequence
 | 
						|
         The compile time nested parentheses limit
 | 
						|
         The maximum length of the pattern string
 | 
						|
         An external function for stack checking
 | 
						|
 | 
						|
       A  compile context is also required if you are using custom memory man-
 | 
						|
       agement.  If none of these apply, just pass NULL as the  context  argu-
 | 
						|
       ment of pcre2_compile().
 | 
						|
 | 
						|
       A  compile context is created, copied, and freed by the following func-
 | 
						|
       tions:
 | 
						|
 | 
						|
       pcre2_compile_context *pcre2_compile_context_create(
 | 
						|
         pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       pcre2_compile_context *pcre2_compile_context_copy(
 | 
						|
         pcre2_compile_context *ccontext);
 | 
						|
 | 
						|
       void pcre2_compile_context_free(pcre2_compile_context *ccontext);
 | 
						|
 | 
						|
       A compile context is created with default values  for  its  parameters.
 | 
						|
       These can be changed by calling the following functions, which return 0
 | 
						|
       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
 | 
						|
 | 
						|
       int pcre2_set_bsr(pcre2_compile_context *ccontext,
 | 
						|
         uint32_t value);
 | 
						|
 | 
						|
       The value must be PCRE2_BSR_ANYCRLF, to specify that  \R  matches  only
 | 
						|
       CR,  LF,  or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any
 | 
						|
       Unicode line ending sequence. The value is used by the JIT compiler and
 | 
						|
       by   the   two   interpreted   matching  functions,  pcre2_match()  and
 | 
						|
       pcre2_dfa_match().
 | 
						|
 | 
						|
       int pcre2_set_character_tables(pcre2_compile_context *ccontext,
 | 
						|
         const unsigned char *tables);
 | 
						|
 | 
						|
       The value must be the result of a  call  to  pcre2_maketables(),  whose
 | 
						|
       only argument is a general context. This function builds a set of char-
 | 
						|
       acter tables in the current locale.
 | 
						|
 | 
						|
       int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
 | 
						|
         PCRE2_SIZE value);
 | 
						|
 | 
						|
       This sets a maximum length, in code units, for the pattern string  that
 | 
						|
       is  to  be  compiled.  If the pattern is longer, an error is generated.
 | 
						|
       This facility is provided so that  applications  that  accept  patterns
 | 
						|
       from  external sources can limit their size. The default is the largest
 | 
						|
       number that a PCRE2_SIZE variable can hold, which is effectively unlim-
 | 
						|
       ited.
 | 
						|
 | 
						|
       int pcre2_set_newline(pcre2_compile_context *ccontext,
 | 
						|
         uint32_t value);
 | 
						|
 | 
						|
       This specifies which characters or character sequences are to be recog-
 | 
						|
       nized as newlines. The value must be one of PCRE2_NEWLINE_CR  (carriage
 | 
						|
       return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
 | 
						|
       two-character sequence CR followed by LF),  PCRE2_NEWLINE_ANYCRLF  (any
 | 
						|
       of the above), or PCRE2_NEWLINE_ANY (any Unicode newline sequence).
 | 
						|
 | 
						|
       When a pattern is compiled with the PCRE2_EXTENDED option, the value of
 | 
						|
       this parameter affects the recognition of white space and  the  end  of
 | 
						|
       internal comments starting with #. The value is saved with the compiled
 | 
						|
       pattern for subsequent use by the JIT compiler and by  the  two  inter-
 | 
						|
       preted matching functions, pcre2_match() and pcre2_dfa_match().
 | 
						|
 | 
						|
       int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
 | 
						|
         uint32_t value);
 | 
						|
 | 
						|
       This parameter ajusts the limit, set when PCRE2 is built (default 250),
 | 
						|
       on the depth of parenthesis nesting in  a  pattern.  This  limit  stops
 | 
						|
       rogue  patterns using up too much system stack when being compiled. The
 | 
						|
       limit applies to parentheses of all kinds, not just capturing parenthe-
 | 
						|
       ses.
 | 
						|
 | 
						|
       int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
 | 
						|
         int (*guard_function)(uint32_t, void *), void *user_data);
 | 
						|
 | 
						|
       There  is at least one application that runs PCRE2 in threads with very
 | 
						|
       limited system stack, where running out of stack is to  be  avoided  at
 | 
						|
       all  costs. The parenthesis limit above cannot take account of how much
 | 
						|
       stack is actually available. For a finer  control,  you  can  supply  a
 | 
						|
       function  that  is  called whenever pcre2_compile() starts to compile a
 | 
						|
       parenthesized part of a pattern. This function  can  check  the  actual
 | 
						|
       stack size (or anything else that it wants to, of course).
 | 
						|
 | 
						|
       The  first  argument to the callout function gives the current depth of
 | 
						|
       nesting, and the second is user data that is set up by the  last  argu-
 | 
						|
       ment   of  pcre2_set_compile_recursion_guard().  The  callout  function
 | 
						|
       should return zero if all is well, or non-zero to force an error.
 | 
						|
 | 
						|
   The match context
 | 
						|
 | 
						|
       A match context is required if you want to change the default values of
 | 
						|
       any of the following match-time parameters:
 | 
						|
 | 
						|
         A callout function
 | 
						|
         The offset limit for matching an unanchored pattern
 | 
						|
         The limit for calling match() (see below)
 | 
						|
         The limit for calling match() recursively
 | 
						|
 | 
						|
       A match context is also required if you are using custom memory manage-
 | 
						|
       ment.  If none of these apply, just pass NULL as the  context  argument
 | 
						|
       of pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
 | 
						|
 | 
						|
       A  match  context  is created, copied, and freed by the following func-
 | 
						|
       tions:
 | 
						|
 | 
						|
       pcre2_match_context *pcre2_match_context_create(
 | 
						|
         pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       pcre2_match_context *pcre2_match_context_copy(
 | 
						|
         pcre2_match_context *mcontext);
 | 
						|
 | 
						|
       void pcre2_match_context_free(pcre2_match_context *mcontext);
 | 
						|
 | 
						|
       A match context is created with  default  values  for  its  parameters.
 | 
						|
       These can be changed by calling the following functions, which return 0
 | 
						|
       on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
 | 
						|
 | 
						|
       int pcre2_set_callout(pcre2_match_context *mcontext,
 | 
						|
         int (*callout_function)(pcre2_callout_block *, void *),
 | 
						|
         void *callout_data);
 | 
						|
 | 
						|
       This sets up a "callout" function, which PCRE2 will call  at  specified
 | 
						|
       points during a matching operation. Details are given in the pcre2call-
 | 
						|
       out documentation.
 | 
						|
 | 
						|
       int pcre2_set_offset_limit(pcre2_match_context *mcontext,
 | 
						|
         PCRE2_SIZE value);
 | 
						|
 | 
						|
       The offset_limit parameter limits how  far  an  unanchored  search  can
 | 
						|
       advance  in  the  subject string. The default value is PCRE2_UNSET. The
 | 
						|
       pcre2_match()     and      pcre2_dfa_match()      functions      return
 | 
						|
       PCRE2_ERROR_NOMATCH  if  a match with a starting point before or at the
 | 
						|
       given offset is not found. For example, if the pattern /abc/ is matched
 | 
						|
       against  "123abc"  with  an  offset  limit  less  than 3, the result is
 | 
						|
       PCRE2_ERROR_NO_MATCH.  A match can never be found  if  the  startoffset
 | 
						|
       argument of pcre2_match() or pcre2_dfa_match() is greater than the off-
 | 
						|
       set limit.
 | 
						|
 | 
						|
       When using this facility,  you  must  set  PCRE2_USE_OFFSET_LIMIT  when
 | 
						|
       calling  pcre2_compile() so that when JIT is in use, different code can
 | 
						|
       be compiled. If a match is started with a non-default match limit  when
 | 
						|
       PCRE2_USE_OFFSET_LIMIT is not set, an error is generated.
 | 
						|
 | 
						|
       The  offset limit facility can be used to track progress when searching
 | 
						|
       large subject strings.  See  also  the  PCRE2_FIRSTLINE  option,  which
 | 
						|
       requires a match to start within the first line of the subject. If this
 | 
						|
       is set with an offset limit, a match must occur in the first  line  and
 | 
						|
       also  within  the  offset limit.  In other words, whichever limit comes
 | 
						|
       first is used.
 | 
						|
 | 
						|
       int pcre2_set_match_limit(pcre2_match_context *mcontext,
 | 
						|
         uint32_t value);
 | 
						|
 | 
						|
       The match_limit parameter provides a means  of  preventing  PCRE2  from
 | 
						|
       using up too many resources when processing patterns that are not going
 | 
						|
       to match, but which have a very large number of possibilities in  their
 | 
						|
       search  trees. The classic example is a pattern that uses nested unlim-
 | 
						|
       ited repeats.
 | 
						|
 | 
						|
       Internally, pcre2_match() uses a  function  called  match(),  which  it
 | 
						|
       calls  repeatedly (sometimes recursively). The limit set by match_limit
 | 
						|
       is imposed on the number of times this  function  is  called  during  a
 | 
						|
       match, which has the effect of limiting the amount of backtracking that
 | 
						|
       can take place. For patterns that are not anchored, the count  restarts
 | 
						|
       from  zero  for  each position in the subject string. This limit is not
 | 
						|
       relevant to pcre2_dfa_match(), which ignores it.
 | 
						|
 | 
						|
       When pcre2_match() is called with a pattern that was successfully  pro-
 | 
						|
       cessed by pcre2_jit_compile(), the way in which matching is executed is
 | 
						|
       entirely different. However, there is still the possibility of  runaway
 | 
						|
       matching  that  goes  on  for  a very long time, and so the match_limit
 | 
						|
       value is also used in this case (but in a different way) to  limit  how
 | 
						|
       long the matching can continue.
 | 
						|
 | 
						|
       The  default  value  for  the limit can be set when PCRE2 is built; the
 | 
						|
       default default is 10 million, which handles all but the  most  extreme
 | 
						|
       cases.    If    the    limit   is   exceeded,   pcre2_match()   returns
 | 
						|
       PCRE2_ERROR_MATCHLIMIT. A value for the match limit may  also  be  sup-
 | 
						|
       plied by an item at the start of a pattern of the form
 | 
						|
 | 
						|
         (*LIMIT_MATCH=ddd)
 | 
						|
 | 
						|
       where  ddd  is  a  decimal  number.  However, such a setting is ignored
 | 
						|
       unless ddd is less than the limit set by the  caller  of  pcre2_match()
 | 
						|
       or, if no such limit is set, less than the default.
 | 
						|
 | 
						|
       int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
 | 
						|
         uint32_t value);
 | 
						|
 | 
						|
       The recursion_limit parameter is similar to match_limit, but instead of
 | 
						|
       limiting the total number of times that match() is  called,  it  limits
 | 
						|
       the  depth  of  recursion. The recursion depth is a smaller number than
 | 
						|
       the total number of calls, because not all calls to match() are  recur-
 | 
						|
       sive.  This limit is of use only if it is set smaller than match_limit.
 | 
						|
 | 
						|
       Limiting the recursion depth limits the amount of system stack that can
 | 
						|
       be used, or, when PCRE2 has been compiled to use  memory  on  the  heap
 | 
						|
       instead  of the stack, the amount of heap memory that can be used. This
 | 
						|
       limit is not relevant, and is ignored, when matching is done using  JIT
 | 
						|
       compiled  code.  However,  it  is supported by pcre2_dfa_match(), which
 | 
						|
       uses recursive function calls less frequently than  pcre2_match(),  but
 | 
						|
       which  can  be caused to use a lot of stack by a recursive pattern such
 | 
						|
       as /(.)(?1)/ matched to a very long string.
 | 
						|
 | 
						|
       The default value for recursion_limit can be set when PCRE2  is  built;
 | 
						|
       the  default  default is the same value as the default for match_limit.
 | 
						|
       If the limit is exceeded, pcre2_match()  and  pcre2_dfa_match()  return
 | 
						|
       PCRE2_ERROR_RECURSIONLIMIT. A value for the recursion limit may also be
 | 
						|
       supplied by an item at the start of a pattern of the form
 | 
						|
 | 
						|
         (*LIMIT_RECURSION=ddd)
 | 
						|
 | 
						|
       where ddd is a decimal number.  However,  such  a  setting  is  ignored
 | 
						|
       unless ddd is less than the limit set by the caller of pcre2_match() or
 | 
						|
       pcre2_dfa_match() or, if no such limit is set, less than the default.
 | 
						|
 | 
						|
       int pcre2_set_recursion_memory_management(
 | 
						|
         pcre2_match_context *mcontext,
 | 
						|
         void *(*private_malloc)(PCRE2_SIZE, void *),
 | 
						|
         void (*private_free)(void *, void *), void *memory_data);
 | 
						|
 | 
						|
       This function sets up two additional custom memory management functions
 | 
						|
       for  use  by  pcre2_match()  when PCRE2 is compiled to use the heap for
 | 
						|
       remembering backtracking data, instead of recursive function calls that
 | 
						|
       use  the  system stack. There is a discussion about PCRE2's stack usage
 | 
						|
       in the pcre2stack documentation. See the pcre2build  documentation  for
 | 
						|
       details of how to build PCRE2.
 | 
						|
 | 
						|
       Using  the  heap for recursion is a non-standard way of building PCRE2,
 | 
						|
       for use in environments  that  have  limited  stacks.  Because  of  the
 | 
						|
       greater use of memory management, pcre2_match() runs more slowly. Func-
 | 
						|
       tions that are different to the general  custom  memory  functions  are
 | 
						|
       provided  so  that  special-purpose  external code can be used for this
 | 
						|
       case, because the memory blocks are all the same size. The  blocks  are
 | 
						|
       retained by pcre2_match() until it is about to exit so that they can be
 | 
						|
       re-used when possible during the match. In the absence of  these  func-
 | 
						|
       tions,  the normal custom memory management functions are used, if sup-
 | 
						|
       plied, otherwise the system functions.
 | 
						|
 | 
						|
 | 
						|
CHECKING BUILD-TIME OPTIONS
 | 
						|
 | 
						|
       int pcre2_config(uint32_t what, void *where);
 | 
						|
 | 
						|
       The function pcre2_config() makes it possible for  a  PCRE2  client  to
 | 
						|
       discover  which  optional  features  have  been compiled into the PCRE2
 | 
						|
       library. The pcre2build documentation  has  more  details  about  these
 | 
						|
       optional features.
 | 
						|
 | 
						|
       The  first  argument  for pcre2_config() specifies which information is
 | 
						|
       required. The second argument is a pointer to  memory  into  which  the
 | 
						|
       information  is  placed.  If  NULL  is passed, the function returns the
 | 
						|
       amount of memory that is needed  for  the  requested  information.  For
 | 
						|
       calls  that  return  numerical  values,  the  value  is  in bytes; when
 | 
						|
       requesting these values, where should point  to  appropriately  aligned
 | 
						|
       memory.  For calls that return strings, the required length is given in
 | 
						|
       code units, not counting the terminating zero.
 | 
						|
 | 
						|
       When requesting information, the returned value from pcre2_config()  is
 | 
						|
       non-negative  on success, or the negative error code PCRE2_ERROR_BADOP-
 | 
						|
       TION if the value in the first argument is not recognized. The  follow-
 | 
						|
       ing information is available:
 | 
						|
 | 
						|
         PCRE2_CONFIG_BSR
 | 
						|
 | 
						|
       The  output  is a uint32_t integer whose value indicates what character
 | 
						|
       sequences the \R  escape  sequence  matches  by  default.  A  value  of
 | 
						|
       PCRE2_BSR_UNICODE  means  that  \R  matches  any  Unicode  line  ending
 | 
						|
       sequence; a value of PCRE2_BSR_ANYCRLF means that \R matches  only  CR,
 | 
						|
       LF, or CRLF. The default can be overridden when a pattern is compiled.
 | 
						|
 | 
						|
         PCRE2_CONFIG_JIT
 | 
						|
 | 
						|
       The  output  is  a  uint32_t  integer that is set to one if support for
 | 
						|
       just-in-time compiling is available; otherwise it is set to zero.
 | 
						|
 | 
						|
         PCRE2_CONFIG_JITTARGET
 | 
						|
 | 
						|
       The where argument should point to a buffer that is at  least  48  code
 | 
						|
       units  long.  (The  exact  length  required  can  be  found  by calling
 | 
						|
       pcre2_config() with where set to NULL.) The buffer  is  filled  with  a
 | 
						|
       string  that  contains  the  name of the architecture for which the JIT
 | 
						|
       compiler is  configured,  for  example  "x86  32bit  (little  endian  +
 | 
						|
       unaligned)".  If JIT support is not available, PCRE2_ERROR_BADOPTION is
 | 
						|
       returned, otherwise the number of code units used is returned. This  is
 | 
						|
       the length of the string, plus one unit for the terminating zero.
 | 
						|
 | 
						|
         PCRE2_CONFIG_LINKSIZE
 | 
						|
 | 
						|
       The output is a uint32_t integer that contains the number of bytes used
 | 
						|
       for internal linkage in compiled regular  expressions.  When  PCRE2  is
 | 
						|
       configured,  the value can be set to 2, 3, or 4, with the default being
 | 
						|
       2. This is the value that is returned by pcre2_config(). However,  when
 | 
						|
       the  16-bit  library  is compiled, a value of 3 is rounded up to 4, and
 | 
						|
       when the 32-bit library is compiled, internal  linkages  always  use  4
 | 
						|
       bytes, so the configured value is not relevant.
 | 
						|
 | 
						|
       The default value of 2 for the 8-bit and 16-bit libraries is sufficient
 | 
						|
       for all but the most massive patterns, since it allows the size of  the
 | 
						|
       compiled pattern to be up to 64K code units. Larger values allow larger
 | 
						|
       regular expressions to be compiled by those two libraries, but  at  the
 | 
						|
       expense of slower matching.
 | 
						|
 | 
						|
         PCRE2_CONFIG_MATCHLIMIT
 | 
						|
 | 
						|
       The  output  is a uint32_t integer that gives the default limit for the
 | 
						|
       number of internal matching function calls in  a  pcre2_match()  execu-
 | 
						|
       tion. Further details are given with pcre2_match() below.
 | 
						|
 | 
						|
         PCRE2_CONFIG_NEWLINE
 | 
						|
 | 
						|
       The  output  is  a  uint32_t  integer whose value specifies the default
 | 
						|
       character sequence that is recognized as meaning "newline". The  values
 | 
						|
       are:
 | 
						|
 | 
						|
         PCRE2_NEWLINE_CR       Carriage return (CR)
 | 
						|
         PCRE2_NEWLINE_LF       Linefeed (LF)
 | 
						|
         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
 | 
						|
         PCRE2_NEWLINE_ANY      Any Unicode line ending
 | 
						|
         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
 | 
						|
 | 
						|
       The  default  should  normally  correspond to the standard sequence for
 | 
						|
       your operating system.
 | 
						|
 | 
						|
         PCRE2_CONFIG_PARENSLIMIT
 | 
						|
 | 
						|
       The output is a uint32_t integer that gives the maximum depth of  nest-
 | 
						|
       ing of parentheses (of any kind) in a pattern. This limit is imposed to
 | 
						|
       cap the amount of system stack used when a pattern is compiled.  It  is
 | 
						|
       specified  when PCRE2 is built; the default is 250. This limit does not
 | 
						|
       take into account the stack that may already be  used  by  the  calling
 | 
						|
       application.  For  finer  control  over  compilation  stack  usage, see
 | 
						|
       pcre2_set_compile_recursion_guard().
 | 
						|
 | 
						|
         PCRE2_CONFIG_RECURSIONLIMIT
 | 
						|
 | 
						|
       The output is a uint32_t integer that gives the default limit  for  the
 | 
						|
       depth  of  recursion  when  calling the internal matching function in a
 | 
						|
       pcre2_match() execution. Further details are given  with  pcre2_match()
 | 
						|
       below.
 | 
						|
 | 
						|
         PCRE2_CONFIG_STACKRECURSE
 | 
						|
 | 
						|
       The  output is a uint32_t integer that is set to one if internal recur-
 | 
						|
       sion when running pcre2_match() is implemented  by  recursive  function
 | 
						|
       calls  that  use  the system stack to remember their state. This is the
 | 
						|
       usual way that PCRE2 is compiled. The output is zero if PCRE2 was  com-
 | 
						|
       piled  to  use blocks of data on the heap instead of recursive function
 | 
						|
       calls.
 | 
						|
 | 
						|
         PCRE2_CONFIG_UNICODE_VERSION
 | 
						|
 | 
						|
       The where argument should point to a buffer that is at  least  24  code
 | 
						|
       units  long.  (The  exact  length  required  can  be  found  by calling
 | 
						|
       pcre2_config() with where set to NULL.)  If  PCRE2  has  been  compiled
 | 
						|
       without  Unicode  support,  the buffer is filled with the text "Unicode
 | 
						|
       not supported". Otherwise, the Unicode  version  string  (for  example,
 | 
						|
       "8.0.0")  is  inserted. The number of code units used is returned. This
 | 
						|
       is the length of the string plus one unit for the terminating zero.
 | 
						|
 | 
						|
         PCRE2_CONFIG_UNICODE
 | 
						|
 | 
						|
       The output is a uint32_t integer that is set to one if Unicode  support
 | 
						|
       is  available; otherwise it is set to zero. Unicode support implies UTF
 | 
						|
       support.
 | 
						|
 | 
						|
         PCRE2_CONFIG_VERSION
 | 
						|
 | 
						|
       The where argument should point to a buffer that is at  least  12  code
 | 
						|
       units  long.  (The  exact  length  required  can  be  found  by calling
 | 
						|
       pcre2_config() with where set to NULL.) The buffer is filled  with  the
 | 
						|
       PCRE2 version string, zero-terminated. The number of code units used is
 | 
						|
       returned. This is the length of the string plus one unit for the termi-
 | 
						|
       nating zero.
 | 
						|
 | 
						|
 | 
						|
COMPILING A PATTERN
 | 
						|
 | 
						|
       pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
 | 
						|
         uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
 | 
						|
         pcre2_compile_context *ccontext);
 | 
						|
 | 
						|
       void pcre2_code_free(pcre2_code *code);
 | 
						|
 | 
						|
       pcre2_code *pcre2_code_copy(const pcre2_code *code);
 | 
						|
 | 
						|
       pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
 | 
						|
 | 
						|
       The  pcre2_compile() function compiles a pattern into an internal form.
 | 
						|
       The pattern is defined by a pointer to a string of  code  units  and  a
 | 
						|
       length.  If the pattern is zero-terminated, the length can be specified
 | 
						|
       as PCRE2_ZERO_TERMINATED. The function returns a pointer to a block  of
 | 
						|
       memory  that contains the compiled pattern and related data, or NULL if
 | 
						|
       an error occurred.
 | 
						|
 | 
						|
       If the compile context argument ccontext is NULL, memory for  the  com-
 | 
						|
       piled  pattern  is  obtained  by  calling  malloc().  Otherwise,  it is
 | 
						|
       obtained from the same memory function that was used  for  the  compile
 | 
						|
       context.  The  caller must free the memory by calling pcre2_code_free()
 | 
						|
       when it is no longer needed.
 | 
						|
 | 
						|
       The function pcre2_code_copy() makes a copy of the compiled code in new
 | 
						|
       memory,  using  the same memory allocator as was used for the original.
 | 
						|
       However, if the code has  been  processed  by  the  JIT  compiler  (see
 | 
						|
       below),  the  JIT information cannot be copied (because it is position-
 | 
						|
       dependent).  The new copy can initially be used only for non-JIT match-
 | 
						|
       ing, though it can be passed to pcre2_jit_compile() if required.
 | 
						|
 | 
						|
       The pcre2_code_copy() function provides a way for individual threads in
 | 
						|
       a multithreaded application to acquire a private copy  of  shared  com-
 | 
						|
       piled  code.   However, it does not make a copy of the character tables
 | 
						|
       used by the compiled pattern; the new pattern code points to  the  same
 | 
						|
       tables  as  the original code.  (See "Locale Support" below for details
 | 
						|
       of these character tables.) In many applications the  same  tables  are
 | 
						|
       used  throughout, so this behaviour is appropriate. Nevertheless, there
 | 
						|
       are occasions when a copy of a compiled pattern and the relevant tables
 | 
						|
       are  needed.  The pcre2_code_copy_with_tables() provides this facility.
 | 
						|
       Copies of both the code and the tables are  made,  with  the  new  code
 | 
						|
       pointing  to the new tables. The memory for the new tables is automati-
 | 
						|
       cally freed when pcre2_code_free() is called for the new  copy  of  the
 | 
						|
       compiled code.
 | 
						|
 | 
						|
       NOTE:  When  one  of  the matching functions is called, pointers to the
 | 
						|
       compiled pattern and the subject string are set in the match data block
 | 
						|
       so  that  they can be referenced by the substring extraction functions.
 | 
						|
       After running a match, you must not free a compiled pattern (or a  sub-
 | 
						|
       ject  string)  until  after all operations on the match data block have
 | 
						|
       taken place.
 | 
						|
 | 
						|
       The options argument for pcre2_compile() contains various bit  settings
 | 
						|
       that  affect  the  compilation.  It  should  be  zero if no options are
 | 
						|
       required. The available options are described below. Some of  them  (in
 | 
						|
       particular,  those  that  are  compatible with Perl, but some others as
 | 
						|
       well) can also be set and  unset  from  within  the  pattern  (see  the
 | 
						|
       detailed description in the pcre2pattern documentation).
 | 
						|
 | 
						|
       For  those options that can be different in different parts of the pat-
 | 
						|
       tern, the contents of the options argument specifies their settings  at
 | 
						|
       the  start  of  compilation.  The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK
 | 
						|
       options can be set at the time of matching as well as at compile time.
 | 
						|
 | 
						|
       Other, less frequently required compile-time parameters  (for  example,
 | 
						|
       the newline setting) can be provided in a compile context (as described
 | 
						|
       above).
 | 
						|
 | 
						|
       If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
 | 
						|
       diately.  Otherwise,  the  variables to which these point are set to an
 | 
						|
       error code and an offset (number of code  units)  within  the  pattern,
 | 
						|
       respectively,  when  pcre2_compile() returns NULL because a compilation
 | 
						|
       error has occurred. The values are not defined when compilation is suc-
 | 
						|
       cessful and pcre2_compile() returns a non-NULL value.
 | 
						|
 | 
						|
       The value returned in erroroffset is an indication of where in the pat-
 | 
						|
       tern the error occurred. It is not necessarily the  furthest  point  in
 | 
						|
       the  pattern  that  was  read. For example, after the error "lookbehind
 | 
						|
       assertion is not fixed length", the error offset points to the start of
 | 
						|
       the failing assertion.
 | 
						|
 | 
						|
       The  pcre2_get_error_message() function (see "Obtaining a textual error
 | 
						|
       message" below) provides a textual message for each error code.  Compi-
 | 
						|
       lation errors have positive error codes; UTF formatting error codes are
 | 
						|
       negative. For an invalid UTF-8 or UTF-16 string, the offset is that  of
 | 
						|
       the first code unit of the failing character.
 | 
						|
 | 
						|
       Some  errors are not detected until the whole pattern has been scanned;
 | 
						|
       in these cases, the offset passed back is the length  of  the  pattern.
 | 
						|
       Note  that  the  offset is in code units, not characters, even in a UTF
 | 
						|
       mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
 | 
						|
       acter.
 | 
						|
 | 
						|
       This  code  fragment shows a typical straightforward call to pcre2_com-
 | 
						|
       pile():
 | 
						|
 | 
						|
         pcre2_code *re;
 | 
						|
         PCRE2_SIZE erroffset;
 | 
						|
         int errorcode;
 | 
						|
         re = pcre2_compile(
 | 
						|
           "^A.*Z",                /* the pattern */
 | 
						|
           PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
 | 
						|
           0,                      /* default options */
 | 
						|
           &errorcode,             /* for error code */
 | 
						|
           &erroffset,             /* for error offset */
 | 
						|
           NULL);                  /* no compile context */
 | 
						|
 | 
						|
       The following names for option bits are defined in the  pcre2.h  header
 | 
						|
       file:
 | 
						|
 | 
						|
         PCRE2_ANCHORED
 | 
						|
 | 
						|
       If this bit is set, the pattern is forced to be "anchored", that is, it
 | 
						|
       is constrained to match only at the first matching point in the  string
 | 
						|
       that  is being searched (the "subject string"). This effect can also be
 | 
						|
       achieved by appropriate constructs in the pattern itself, which is  the
 | 
						|
       only way to do it in Perl.
 | 
						|
 | 
						|
         PCRE2_ALLOW_EMPTY_CLASS
 | 
						|
 | 
						|
       By  default, for compatibility with Perl, a closing square bracket that
 | 
						|
       immediately follows an opening one is treated as a data  character  for
 | 
						|
       the  class.  When  PCRE2_ALLOW_EMPTY_CLASS  is  set,  it terminates the
 | 
						|
       class, which therefore contains no characters and so can never match.
 | 
						|
 | 
						|
         PCRE2_ALT_BSUX
 | 
						|
 | 
						|
       This option request alternative handling  of  three  escape  sequences,
 | 
						|
       which  makes  PCRE2's  behaviour more like ECMAscript (aka JavaScript).
 | 
						|
       When it is set:
 | 
						|
 | 
						|
       (1) \U matches an upper case "U" character; by default \U causes a com-
 | 
						|
       pile time error (Perl uses \U to upper case subsequent characters).
 | 
						|
 | 
						|
       (2) \u matches a lower case "u" character unless it is followed by four
 | 
						|
       hexadecimal digits, in which case the hexadecimal  number  defines  the
 | 
						|
       code  point  to match. By default, \u causes a compile time error (Perl
 | 
						|
       uses it to upper case the following character).
 | 
						|
 | 
						|
       (3) \x matches a lower case "x" character unless it is followed by  two
 | 
						|
       hexadecimal  digits,  in  which case the hexadecimal number defines the
 | 
						|
       code point to match. By default, as in Perl, a  hexadecimal  number  is
 | 
						|
       always expected after \x, but it may have zero, one, or two digits (so,
 | 
						|
       for example, \xz matches a binary zero character followed by z).
 | 
						|
 | 
						|
         PCRE2_ALT_CIRCUMFLEX
 | 
						|
 | 
						|
       In  multiline  mode  (when  PCRE2_MULTILINE  is  set),  the  circumflex
 | 
						|
       metacharacter  matches at the start of the subject (unless PCRE2_NOTBOL
 | 
						|
       is set), and also after any internal  newline.  However,  it  does  not
 | 
						|
       match after a newline at the end of the subject, for compatibility with
 | 
						|
       Perl. If you want a multiline circumflex also to match after  a  termi-
 | 
						|
       nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
 | 
						|
 | 
						|
         PCRE2_ALT_VERBNAMES
 | 
						|
 | 
						|
       By  default, for compatibility with Perl, the name in any verb sequence
 | 
						|
       such as (*MARK:NAME) is  any  sequence  of  characters  that  does  not
 | 
						|
       include  a  closing  parenthesis. The name is not processed in any way,
 | 
						|
       and it is not possible to include a closing parenthesis  in  the  name.
 | 
						|
       However,  if  the  PCRE2_ALT_VERBNAMES  option is set, normal backslash
 | 
						|
       processing is applied to verb  names  and  only  an  unescaped  closing
 | 
						|
       parenthesis  terminates the name. A closing parenthesis can be included
 | 
						|
       in a name either as \) or between \Q  and  \E.  If  the  PCRE2_EXTENDED
 | 
						|
       option is set, unescaped whitespace in verb names is skipped and #-com-
 | 
						|
       ments are recognized, exactly as in the rest of the pattern.
 | 
						|
 | 
						|
         PCRE2_AUTO_CALLOUT
 | 
						|
 | 
						|
       If this bit  is  set,  pcre2_compile()  automatically  inserts  callout
 | 
						|
       items,  all  with  number 255, before each pattern item, except immedi-
 | 
						|
       ately before or after a callout in the pattern. For discussion  of  the
 | 
						|
       callout facility, see the pcre2callout documentation.
 | 
						|
 | 
						|
         PCRE2_CASELESS
 | 
						|
 | 
						|
       If  this  bit is set, letters in the pattern match both upper and lower
 | 
						|
       case letters in the subject. It is equivalent to Perl's /i option,  and
 | 
						|
       it can be changed within a pattern by a (?i) option setting.
 | 
						|
 | 
						|
         PCRE2_DOLLAR_ENDONLY
 | 
						|
 | 
						|
       If  this bit is set, a dollar metacharacter in the pattern matches only
 | 
						|
       at the end of the subject string. Without this option,  a  dollar  also
 | 
						|
       matches  immediately before a newline at the end of the string (but not
 | 
						|
       before any other newlines). The PCRE2_DOLLAR_ENDONLY option is  ignored
 | 
						|
       if  PCRE2_MULTILINE  is  set.  There is no equivalent to this option in
 | 
						|
       Perl, and no way to set it within a pattern.
 | 
						|
 | 
						|
         PCRE2_DOTALL
 | 
						|
 | 
						|
       If this bit is set, a dot metacharacter  in  the  pattern  matches  any
 | 
						|
       character,  including  one  that  indicates a newline. However, it only
 | 
						|
       ever matches one character, even if newlines are coded as CRLF. Without
 | 
						|
       this option, a dot does not match when the current position in the sub-
 | 
						|
       ject is at a newline. This option is equivalent to  Perl's  /s  option,
 | 
						|
       and it can be changed within a pattern by a (?s) option setting. A neg-
 | 
						|
       ative class such as [^a] always matches newline characters, independent
 | 
						|
       of the setting of this option.
 | 
						|
 | 
						|
         PCRE2_DUPNAMES
 | 
						|
 | 
						|
       If  this  bit is set, names used to identify capturing subpatterns need
 | 
						|
       not be unique. This can be helpful for certain types of pattern when it
 | 
						|
       is  known  that  only  one instance of the named subpattern can ever be
 | 
						|
       matched. There are more details of named subpatterns  below;  see  also
 | 
						|
       the pcre2pattern documentation.
 | 
						|
 | 
						|
         PCRE2_EXTENDED
 | 
						|
 | 
						|
       If  this  bit  is  set,  most white space characters in the pattern are
 | 
						|
       totally ignored except when escaped or inside a character  class.  How-
 | 
						|
       ever,  white  space  is  not  allowed within sequences such as (?> that
 | 
						|
       introduce various parenthesized subpatterns, nor within numerical quan-
 | 
						|
       tifiers  such  as {1,3}.  Ignorable white space is permitted between an
 | 
						|
       item and a following quantifier and between a quantifier and a  follow-
 | 
						|
       ing + that indicates possessiveness.
 | 
						|
 | 
						|
       PCRE2_EXTENDED  also causes characters between an unescaped # outside a
 | 
						|
       character class and the next newline, inclusive, to be  ignored,  which
 | 
						|
       makes it possible to include comments inside complicated patterns. Note
 | 
						|
       that the end of this type of comment is a literal newline  sequence  in
 | 
						|
       the pattern; escape sequences that happen to represent a newline do not
 | 
						|
       count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can  be
 | 
						|
       changed within a pattern by a (?x) option setting.
 | 
						|
 | 
						|
       Which characters are interpreted as newlines can be specified by a set-
 | 
						|
       ting in the compile context that is passed to pcre2_compile() or  by  a
 | 
						|
       special  sequence at the start of the pattern, as described in the sec-
 | 
						|
       tion entitled "Newline conventions" in the pcre2pattern  documentation.
 | 
						|
       A default is defined when PCRE2 is built.
 | 
						|
 | 
						|
         PCRE2_FIRSTLINE
 | 
						|
 | 
						|
       If  this  option  is  set,  an  unanchored pattern is required to match
 | 
						|
       before or at the first  newline  in  the  subject  string,  though  the
 | 
						|
       matched  text  may  continue  over the newline. See also PCRE2_USE_OFF-
 | 
						|
       SET_LIMIT,  which  provides  a  more  general  limiting  facility.   If
 | 
						|
       PCRE2_FIRSTLINE  is set with an offset limit, a match must occur in the
 | 
						|
       first line and also within the offset limit. In other words,  whichever
 | 
						|
       limit comes first is used.
 | 
						|
 | 
						|
         PCRE2_MATCH_UNSET_BACKREF
 | 
						|
 | 
						|
       If  this  option  is set, a back reference to an unset subpattern group
 | 
						|
       matches an empty string (by default this causes  the  current  matching
 | 
						|
       alternative  to  fail).   A  pattern such as (\1)(a) succeeds when this
 | 
						|
       option is set (assuming it can find an "a" in the subject), whereas  it
 | 
						|
       fails  by  default,  for  Perl compatibility. Setting this option makes
 | 
						|
       PCRE2 behave more like ECMAscript (aka JavaScript).
 | 
						|
 | 
						|
         PCRE2_MULTILINE
 | 
						|
 | 
						|
       By default, for the purposes of matching "start of line"  and  "end  of
 | 
						|
       line",  PCRE2  treats the subject string as consisting of a single line
 | 
						|
       of characters, even if it actually contains  newlines.  The  "start  of
 | 
						|
       line"  metacharacter  (^)  matches only at the start of the string, and
 | 
						|
       the "end of line" metacharacter ($) matches only  at  the  end  of  the
 | 
						|
       string,  or  before  a  terminating  newline  (except  when  PCRE2_DOL-
 | 
						|
       LAR_ENDONLY is set). Note, however, that unless  PCRE2_DOTALL  is  set,
 | 
						|
       the "any character" metacharacter (.) does not match at a newline. This
 | 
						|
       behaviour (for ^, $, and dot) is the same as Perl.
 | 
						|
 | 
						|
       When PCRE2_MULTILINE it is set, the "start of line" and "end  of  line"
 | 
						|
       constructs  match  immediately following or immediately before internal
 | 
						|
       newlines in the subject string, respectively, as well as  at  the  very
 | 
						|
       start  and  end.  This is equivalent to Perl's /m option, and it can be
 | 
						|
       changed within a pattern by a (?m) option setting. Note that the "start
 | 
						|
       of line" metacharacter does not match after a newline at the end of the
 | 
						|
       subject, for compatibility with Perl.  However, you can change this  by
 | 
						|
       setting  the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
 | 
						|
       subject string, or no occurrences of ^  or  $  in  a  pattern,  setting
 | 
						|
       PCRE2_MULTILINE has no effect.
 | 
						|
 | 
						|
         PCRE2_NEVER_BACKSLASH_C
 | 
						|
 | 
						|
       This  option  locks out the use of \C in the pattern that is being com-
 | 
						|
       piled.  This escape can  cause  unpredictable  behaviour  in  UTF-8  or
 | 
						|
       UTF-16  modes,  because  it may leave the current matching point in the
 | 
						|
       middle of a multi-code-unit character. This option  may  be  useful  in
 | 
						|
       applications  that  process  patterns  from external sources. Note that
 | 
						|
       there is also a build-time option that permanently locks out the use of
 | 
						|
       \C.
 | 
						|
 | 
						|
         PCRE2_NEVER_UCP
 | 
						|
 | 
						|
       This  option  locks  out the use of Unicode properties for handling \B,
 | 
						|
       \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
 | 
						|
       described  for  the  PCRE2_UCP option below. In particular, it prevents
 | 
						|
       the creator of the pattern from enabling this facility by starting  the
 | 
						|
       pattern  with  (*UCP).  This  option may be useful in applications that
 | 
						|
       process patterns from external sources. The option combination PCRE_UCP
 | 
						|
       and PCRE_NEVER_UCP causes an error.
 | 
						|
 | 
						|
         PCRE2_NEVER_UTF
 | 
						|
 | 
						|
       This  option  locks out interpretation of the pattern as UTF-8, UTF-16,
 | 
						|
       or UTF-32, depending on which library is in use. In particular, it pre-
 | 
						|
       vents  the  creator of the pattern from switching to UTF interpretation
 | 
						|
       by starting the pattern with (*UTF).  This  option  may  be  useful  in
 | 
						|
       applications  that process patterns from external sources. The combina-
 | 
						|
       tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
 | 
						|
 | 
						|
         PCRE2_NO_AUTO_CAPTURE
 | 
						|
 | 
						|
       If this option is set, it disables the use of numbered capturing paren-
 | 
						|
       theses  in the pattern. Any opening parenthesis that is not followed by
 | 
						|
       ? behaves as if it were followed by ?: but named parentheses can  still
 | 
						|
       be  used  for  capturing  (and  they acquire numbers in the usual way).
 | 
						|
       There is no equivalent of this option  in  Perl.  Note  that,  if  this
 | 
						|
       option  is  set,  references  to  capturing  groups (back references or
 | 
						|
       recursion/subroutine calls) may only refer to named groups, though  the
 | 
						|
       reference can be by name or by number.
 | 
						|
 | 
						|
         PCRE2_NO_AUTO_POSSESS
 | 
						|
 | 
						|
       If this option is set, it disables "auto-possessification", which is an
 | 
						|
       optimization that, for example, turns a+b into a++b in order  to  avoid
 | 
						|
       backtracks  into  a+ that can never be successful. However, if callouts
 | 
						|
       are in use, auto-possessification means that some  callouts  are  never
 | 
						|
       taken. You can set this option if you want the matching functions to do
 | 
						|
       a full unoptimized search and run all the callouts, but  it  is  mainly
 | 
						|
       provided for testing purposes.
 | 
						|
 | 
						|
         PCRE2_NO_DOTSTAR_ANCHOR
 | 
						|
 | 
						|
       If this option is set, it disables an optimization that is applied when
 | 
						|
       .* is the first significant item in a top-level branch  of  a  pattern,
 | 
						|
       and  all  the  other branches also start with .* or with \A or \G or ^.
 | 
						|
       The optimization is automatically disabled for .* if it  is  inside  an
 | 
						|
       atomic  group or a capturing group that is the subject of a back refer-
 | 
						|
       ence, or if the pattern contains (*PRUNE) or (*SKIP).  When  the  opti-
 | 
						|
       mization  is  not disabled, such a pattern is automatically anchored if
 | 
						|
       PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
 | 
						|
       for  any  ^ items. Otherwise, the fact that any match must start either
 | 
						|
       at the start of the subject or following a newline is remembered.  Like
 | 
						|
       other optimizations, this can cause callouts to be skipped.
 | 
						|
 | 
						|
         PCRE2_NO_START_OPTIMIZE
 | 
						|
 | 
						|
       This  is  an  option whose main effect is at matching time. It does not
 | 
						|
       change what pcre2_compile() generates, but it does affect the output of
 | 
						|
       the JIT compiler.
 | 
						|
 | 
						|
       There  are  a  number of optimizations that may occur at the start of a
 | 
						|
       match, in order to speed up the process. For example, if  it  is  known
 | 
						|
       that  an  unanchored  match  must  start with a specific character, the
 | 
						|
       matching code searches the subject for that character, and fails  imme-
 | 
						|
       diately  if it cannot find it, without actually running the main match-
 | 
						|
       ing function. This means that a special item such as (*COMMIT)  at  the
 | 
						|
       start  of  a  pattern is not considered until after a suitable starting
 | 
						|
       point for the match has been found.  Also,  when  callouts  or  (*MARK)
 | 
						|
       items  are  in use, these "start-up" optimizations can cause them to be
 | 
						|
       skipped if the pattern is never actually used. The  start-up  optimiza-
 | 
						|
       tions  are  in effect a pre-scan of the subject that takes place before
 | 
						|
       the pattern is run.
 | 
						|
 | 
						|
       The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
 | 
						|
       possibly  causing  performance  to  suffer,  but ensuring that in cases
 | 
						|
       where the result is "no match", the callouts do occur, and  that  items
 | 
						|
       such as (*COMMIT) and (*MARK) are considered at every possible starting
 | 
						|
       position in the subject string.
 | 
						|
 | 
						|
       Setting PCRE2_NO_START_OPTIMIZE may change the outcome  of  a  matching
 | 
						|
       operation.  Consider the pattern
 | 
						|
 | 
						|
         (*COMMIT)ABC
 | 
						|
 | 
						|
       When  this  is compiled, PCRE2 records the fact that a match must start
 | 
						|
       with the character "A". Suppose the subject  string  is  "DEFABC".  The
 | 
						|
       start-up  optimization  scans along the subject, finds "A" and runs the
 | 
						|
       first match attempt from there. The (*COMMIT) item means that the  pat-
 | 
						|
       tern  must  match the current starting position, which in this case, it
 | 
						|
       does. However, if the same match is  run  with  PCRE2_NO_START_OPTIMIZE
 | 
						|
       set,  the  initial  scan  along the subject string does not happen. The
 | 
						|
       first match attempt is run starting  from  "D"  and  when  this  fails,
 | 
						|
       (*COMMIT)  prevents  any  further  matches  being tried, so the overall
 | 
						|
       result is "no match". There are also other start-up optimizations.  For
 | 
						|
       example, a minimum length for the subject may be recorded. Consider the
 | 
						|
       pattern
 | 
						|
 | 
						|
         (*MARK:A)(X|Y)
 | 
						|
 | 
						|
       The minimum length for a match is one  character.  If  the  subject  is
 | 
						|
       "ABC", there will be attempts to match "ABC", "BC", and "C". An attempt
 | 
						|
       to match an empty string at the end of the subject does not take place,
 | 
						|
       because  PCRE2  knows  that  the  subject  is now too short, and so the
 | 
						|
       (*MARK) is never encountered. In this case, the optimization  does  not
 | 
						|
       affect the overall match result, which is still "no match", but it does
 | 
						|
       affect the auxiliary information that is returned.
 | 
						|
 | 
						|
         PCRE2_NO_UTF_CHECK
 | 
						|
 | 
						|
       When PCRE2_UTF is set, the validity of the pattern as a UTF  string  is
 | 
						|
       automatically  checked.  There  are  discussions  about the validity of
 | 
						|
       UTF-8 strings, UTF-16 strings, and UTF-32 strings in  the  pcre2unicode
 | 
						|
       document.  If an invalid UTF sequence is found, pcre2_compile() returns
 | 
						|
       a negative error code.
 | 
						|
 | 
						|
       If you know that your pattern is valid, and you want to skip this check
 | 
						|
       for  performance  reasons,  you  can set the PCRE2_NO_UTF_CHECK option.
 | 
						|
       When it is set, the effect of passing an invalid UTF string as  a  pat-
 | 
						|
       tern  is  undefined.  It  may cause your program to crash or loop. Note
 | 
						|
       that  this  option  can   also   be   passed   to   pcre2_match()   and
 | 
						|
       pcre_dfa_match(), to suppress validity checking of the subject string.
 | 
						|
 | 
						|
         PCRE2_UCP
 | 
						|
 | 
						|
       This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
 | 
						|
       \w, and some of the POSIX character classes.  By  default,  only  ASCII
 | 
						|
       characters  are recognized, but if PCRE2_UCP is set, Unicode properties
 | 
						|
       are used instead to classify characters. More details are given in  the
 | 
						|
       section on generic character types in the pcre2pattern page. If you set
 | 
						|
       PCRE2_UCP, matching one of the items it affects takes much longer.  The
 | 
						|
       option  is  available only if PCRE2 has been compiled with Unicode sup-
 | 
						|
       port.
 | 
						|
 | 
						|
         PCRE2_UNGREEDY
 | 
						|
 | 
						|
       This option inverts the "greediness" of the quantifiers  so  that  they
 | 
						|
       are  not greedy by default, but become greedy if followed by "?". It is
 | 
						|
       not compatible with Perl. It can also be set by a (?U)  option  setting
 | 
						|
       within the pattern.
 | 
						|
 | 
						|
         PCRE2_USE_OFFSET_LIMIT
 | 
						|
 | 
						|
       This option must be set for pcre2_compile() if pcre2_set_offset_limit()
 | 
						|
       is going to be used to set a non-default offset limit in a  match  con-
 | 
						|
       text  for  matches  that  use this pattern. An error is generated if an
 | 
						|
       offset limit is set without this option.  For  more  details,  see  the
 | 
						|
       description  of  pcre2_set_offset_limit() in the section that describes
 | 
						|
       match contexts. See also the PCRE2_FIRSTLINE option above.
 | 
						|
 | 
						|
         PCRE2_UTF
 | 
						|
 | 
						|
       This option causes PCRE2 to regard both the  pattern  and  the  subject
 | 
						|
       strings  that  are  subsequently processed as strings of UTF characters
 | 
						|
       instead of single-code-unit strings. It  is  available  when  PCRE2  is
 | 
						|
       built  to  include  Unicode  support (which is the default). If Unicode
 | 
						|
       support is not available, the use of this  option  provokes  an  error.
 | 
						|
       Details  of how this option changes the behaviour of PCRE2 are given in
 | 
						|
       the pcre2unicode page.
 | 
						|
 | 
						|
 | 
						|
COMPILATION ERROR CODES
 | 
						|
 | 
						|
       There are over 80 positive error codes that pcre2_compile() may  return
 | 
						|
       (via  errorcode)  if  it  finds an error in the pattern. There are also
 | 
						|
       some negative error codes that are used for invalid UTF strings.  These
 | 
						|
       are  the  same as given by pcre2_match() and pcre2_dfa_match(), and are
 | 
						|
       described in the pcre2unicode page. The pcre2_get_error_message() func-
 | 
						|
       tion  (see  "Obtaining a textual error message" below) can be called to
 | 
						|
       obtain a textual error message from any error code.
 | 
						|
 | 
						|
 | 
						|
JUST-IN-TIME (JIT) COMPILATION
 | 
						|
 | 
						|
       int pcre2_jit_compile(pcre2_code *code, uint32_t options);
 | 
						|
 | 
						|
       int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
 | 
						|
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | 
						|
         uint32_t options, pcre2_match_data *match_data,
 | 
						|
         pcre2_match_context *mcontext);
 | 
						|
 | 
						|
       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
 | 
						|
         PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
 | 
						|
         pcre2_jit_callback callback_function, void *callback_data);
 | 
						|
 | 
						|
       void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
 | 
						|
 | 
						|
       These functions provide support for  JIT  compilation,  which,  if  the
 | 
						|
       just-in-time  compiler  is available, further processes a compiled pat-
 | 
						|
       tern into machine code that executes much faster than the pcre2_match()
 | 
						|
       interpretive  matching function. Full details are given in the pcre2jit
 | 
						|
       documentation.
 | 
						|
 | 
						|
       JIT compilation is a heavyweight optimization. It can  take  some  time
 | 
						|
       for  patterns  to  be analyzed, and for one-off matches and simple pat-
 | 
						|
       terns the benefit of faster execution might be offset by a much  slower
 | 
						|
       compilation  time.   Most, but not all patterns can be optimized by the
 | 
						|
       JIT compiler.
 | 
						|
 | 
						|
 | 
						|
LOCALE SUPPORT
 | 
						|
 | 
						|
       PCRE2 handles caseless matching, and determines whether characters  are
 | 
						|
       letters,  digits, or whatever, by reference to a set of tables, indexed
 | 
						|
       by character code point. This applies only  to  characters  whose  code
 | 
						|
       points  are  less than 256. By default, higher-valued code points never
 | 
						|
       match escapes such as \w or \d.  However, if PCRE2 is  built  with  UTF
 | 
						|
       support,  all  characters  can  be  tested with \p and \P, or, alterna-
 | 
						|
       tively, the PCRE2_UCP option can be set when  a  pattern  is  compiled;
 | 
						|
       this  causes  \w and friends to use Unicode property support instead of
 | 
						|
       the built-in tables.
 | 
						|
 | 
						|
       The use of locales with Unicode is discouraged.  If  you  are  handling
 | 
						|
       characters  with  code  points  greater than 128, you should either use
 | 
						|
       Unicode support, or use locales, but not try to mix the two.
 | 
						|
 | 
						|
       PCRE2 contains an internal set of character tables  that  are  used  by
 | 
						|
       default.   These  are  sufficient  for many applications. Normally, the
 | 
						|
       internal tables recognize only ASCII characters. However, when PCRE2 is
 | 
						|
       built, it is possible to cause the internal tables to be rebuilt in the
 | 
						|
       default "C" locale of the local system, which may cause them to be dif-
 | 
						|
       ferent.
 | 
						|
 | 
						|
       The  internal tables can be overridden by tables supplied by the appli-
 | 
						|
       cation that calls PCRE2. These may be created  in  a  different  locale
 | 
						|
       from  the  default.  As more and more applications change to using Uni-
 | 
						|
       code, the need for this locale support is expected to die away.
 | 
						|
 | 
						|
       External tables are built by calling the  pcre2_maketables()  function,
 | 
						|
       in  the relevant locale. The result can be passed to pcre2_compile() as
 | 
						|
       often  as  necessary,  by  creating  a  compile  context  and   calling
 | 
						|
       pcre2_set_character_tables()  to  set  the  tables pointer therein. For
 | 
						|
       example, to build and use tables that are appropriate  for  the  French
 | 
						|
       locale  (where  accented  characters  with  values greater than 128 are
 | 
						|
       treated as letters), the following code could be used:
 | 
						|
 | 
						|
         setlocale(LC_CTYPE, "fr_FR");
 | 
						|
         tables = pcre2_maketables(NULL);
 | 
						|
         ccontext = pcre2_compile_context_create(NULL);
 | 
						|
         pcre2_set_character_tables(ccontext, tables);
 | 
						|
         re = pcre2_compile(..., ccontext);
 | 
						|
 | 
						|
       The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
 | 
						|
       if  you  are using Windows, the name for the French locale is "french".
 | 
						|
       It is the caller's responsibility to ensure that the memory  containing
 | 
						|
       the tables remains available for as long as it is needed.
 | 
						|
 | 
						|
       The pointer that is passed (via the compile context) to pcre2_compile()
 | 
						|
       is saved with the compiled pattern, and the same  tables  are  used  by
 | 
						|
       pcre2_match()  and pcre_dfa_match(). Thus, for any single pattern, com-
 | 
						|
       pilation, and matching all happen in the  same  locale,  but  different
 | 
						|
       patterns can be processed in different locales.
 | 
						|
 | 
						|
 | 
						|
INFORMATION ABOUT A COMPILED PATTERN
 | 
						|
 | 
						|
       int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
 | 
						|
 | 
						|
       The  pcre2_pattern_info()  function returns general information about a
 | 
						|
       compiled pattern. For information about callouts, see the next section.
 | 
						|
       The  first  argument  for pcre2_pattern_info() is a pointer to the com-
 | 
						|
       piled pattern. The second argument specifies which piece of information
 | 
						|
       is  required,  and  the  third  argument  is a pointer to a variable to
 | 
						|
       receive the data. If the third argument is NULL, the first argument  is
 | 
						|
       ignored,  and  the  function  returns the size in bytes of the variable
 | 
						|
       that is required for the information requested. Otherwise, The yield of
 | 
						|
       the function is zero for success, or one of the following negative num-
 | 
						|
       bers:
 | 
						|
 | 
						|
         PCRE2_ERROR_NULL           the argument code was NULL
 | 
						|
         PCRE2_ERROR_BADMAGIC       the "magic number" was not found
 | 
						|
         PCRE2_ERROR_BADOPTION      the value of what was invalid
 | 
						|
         PCRE2_ERROR_UNSET          the requested field is not set
 | 
						|
 | 
						|
       The "magic number" is placed at the start of each compiled  pattern  as
 | 
						|
       an  simple check against passing an arbitrary memory pointer. Here is a
 | 
						|
       typical call of pcre2_pattern_info(), to obtain the length of the  com-
 | 
						|
       piled pattern:
 | 
						|
 | 
						|
         int rc;
 | 
						|
         size_t length;
 | 
						|
         rc = pcre2_pattern_info(
 | 
						|
           re,               /* result of pcre2_compile() */
 | 
						|
           PCRE2_INFO_SIZE,  /* what is required */
 | 
						|
           &length);         /* where to put the data */
 | 
						|
 | 
						|
       The possible values for the second argument are defined in pcre2.h, and
 | 
						|
       are as follows:
 | 
						|
 | 
						|
         PCRE2_INFO_ALLOPTIONS
 | 
						|
         PCRE2_INFO_ARGOPTIONS
 | 
						|
 | 
						|
       Return a copy of the pattern's options. The third argument should point
 | 
						|
       to  a  uint32_t  variable.  PCRE2_INFO_ARGOPTIONS  returns  exactly the
 | 
						|
       options that were passed to pcre2_compile(), whereas  PCRE2_INFO_ALLOP-
 | 
						|
       TIONS  returns  the compile options as modified by any top-level (*XXX)
 | 
						|
       option settings such as (*UTF) at the start of the pattern itself.
 | 
						|
 | 
						|
       For  example,  if  the  pattern  /(*UTF)abc/  is  compiled   with   the
 | 
						|
       PCRE2_EXTENDED   option,   the   result  for  PCRE2_INFO_ALLOPTIONS  is
 | 
						|
       PCRE2_EXTENDED and PCRE2_UTF.  Option settings such as  (?i)  that  can
 | 
						|
       change  within  a pattern do not affect the result of PCRE2_INFO_ALLOP-
 | 
						|
       TIONS, even if they appear right at the start of the pattern. (This was
 | 
						|
       different in some earlier releases.)
 | 
						|
 | 
						|
       A  pattern compiled without PCRE2_ANCHORED is automatically anchored by
 | 
						|
       PCRE2 if the first significant item in every top-level branch is one of
 | 
						|
       the following:
 | 
						|
 | 
						|
         ^     unless PCRE2_MULTILINE is set
 | 
						|
         \A    always
 | 
						|
         \G    always
 | 
						|
         .*    sometimes - see below
 | 
						|
 | 
						|
       When  .* is the first significant item, anchoring is possible only when
 | 
						|
       all the following are true:
 | 
						|
 | 
						|
         .* is not in an atomic group
 | 
						|
         .* is not in a capturing group that is the subject
 | 
						|
              of a back reference
 | 
						|
         PCRE2_DOTALL is in force for .*
 | 
						|
         Neither (*PRUNE) nor (*SKIP) appears in the pattern.
 | 
						|
         PCRE2_NO_DOTSTAR_ANCHOR is not set.
 | 
						|
 | 
						|
       For patterns that are auto-anchored, the PCRE2_ANCHORED bit is  set  in
 | 
						|
       the options returned for PCRE2_INFO_ALLOPTIONS.
 | 
						|
 | 
						|
         PCRE2_INFO_BACKREFMAX
 | 
						|
 | 
						|
       Return  the  number  of  the highest back reference in the pattern. The
 | 
						|
       third argument should point to an uint32_t variable. Named  subpatterns
 | 
						|
       acquire  numbers  as well as names, and these count towards the highest
 | 
						|
       back reference.  Back references such as \4 or \g{12}  match  the  cap-
 | 
						|
       tured  characters of the given group, but in addition, the check that a
 | 
						|
       capturing group is set in a conditional subpattern such as (?(3)a|b) is
 | 
						|
       also  a  back  reference.  Zero is returned if there are no back refer-
 | 
						|
       ences.
 | 
						|
 | 
						|
         PCRE2_INFO_BSR
 | 
						|
 | 
						|
       The output is a uint32_t whose value indicates what character sequences
 | 
						|
       the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that
 | 
						|
       \R matches any Unicode line ending sequence; a value of  PCRE2_BSR_ANY-
 | 
						|
       CRLF means that \R matches only CR, LF, or CRLF.
 | 
						|
 | 
						|
         PCRE2_INFO_CAPTURECOUNT
 | 
						|
 | 
						|
       Return  the highest capturing subpattern number in the pattern. In pat-
 | 
						|
       terns where (?| is not used, this is also the total number of capturing
 | 
						|
       subpatterns.  The third argument should point to an uint32_t variable.
 | 
						|
 | 
						|
         PCRE2_INFO_FIRSTBITMAP
 | 
						|
 | 
						|
       In  the absence of a single first code unit for a non-anchored pattern,
 | 
						|
       pcre2_compile() may construct a 256-bit table that defines a fixed  set
 | 
						|
       of  values for the first code unit in any match. For example, a pattern
 | 
						|
       that starts with [abc] results in a table with  three  bits  set.  When
 | 
						|
       code  unit  values greater than 255 are supported, the flag bit for 255
 | 
						|
       means "any code unit of value 255 or above". If such a table  was  con-
 | 
						|
       structed,  a pointer to it is returned. Otherwise NULL is returned. The
 | 
						|
       third argument should point to an const uint8_t * variable.
 | 
						|
 | 
						|
         PCRE2_INFO_FIRSTCODETYPE
 | 
						|
 | 
						|
       Return information about the first code unit of any matched string, for
 | 
						|
       a  non-anchored pattern. The third argument should point to an uint32_t
 | 
						|
       variable. If there is a fixed first value, for example, the letter  "c"
 | 
						|
       from a pattern such as (cat|cow|coyote), 1 is returned, and the charac-
 | 
						|
       ter value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there  is
 | 
						|
       no  fixed  first  value, but it is known that a match can occur only at
 | 
						|
       the start of the subject or following a newline in the  subject,  2  is
 | 
						|
       returned. Otherwise, and for anchored patterns, 0 is returned.
 | 
						|
 | 
						|
         PCRE2_INFO_FIRSTCODEUNIT
 | 
						|
 | 
						|
       Return  the  value  of the first code unit of any matched string in the
 | 
						|
       situation where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
 | 
						|
       The  third  argument should point to an uint32_t variable. In the 8-bit
 | 
						|
       library, the value is always less than 256. In the 16-bit  library  the
 | 
						|
       value  can  be  up  to 0xffff. In the 32-bit library in UTF-32 mode the
 | 
						|
       value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
 | 
						|
       mode.
 | 
						|
 | 
						|
         PCRE2_INFO_HASBACKSLASHC
 | 
						|
 | 
						|
       Return  1 if the pattern contains any instances of \C, otherwise 0. The
 | 
						|
       third argument should point to an uint32_t variable.
 | 
						|
 | 
						|
         PCRE2_INFO_HASCRORLF
 | 
						|
 | 
						|
       Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
 | 
						|
       characters, otherwise 0. The third argument should point to an uint32_t
 | 
						|
       variable. An explicit match is either a literal CR or LF character,  or
 | 
						|
       \r or \n.
 | 
						|
 | 
						|
         PCRE2_INFO_JCHANGED
 | 
						|
 | 
						|
       Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
 | 
						|
       otherwise 0. The third argument should point to an  uint32_t  variable.
 | 
						|
       (?J)  and  (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
 | 
						|
       tively.
 | 
						|
 | 
						|
         PCRE2_INFO_JITSIZE
 | 
						|
 | 
						|
       If the compiled pattern was successfully  processed  by  pcre2_jit_com-
 | 
						|
       pile(),  return  the  size  of  the JIT compiled code, otherwise return
 | 
						|
       zero. The third argument should point to a size_t variable.
 | 
						|
 | 
						|
         PCRE2_INFO_LASTCODETYPE
 | 
						|
 | 
						|
       Returns 1 if there is a rightmost literal code unit that must exist  in
 | 
						|
       any  matched string, other than at its start. The third argument should
 | 
						|
       point to an uint32_t  variable.  If  there  is  no  such  value,  0  is
 | 
						|
       returned.  When  1  is  returned,  the  code  unit  value itself can be
 | 
						|
       retrieved using PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a  last
 | 
						|
       literal  value  is  recorded  only  if it follows something of variable
 | 
						|
       length. For example, for the pattern /^a\d+z\d+/ the returned value  is
 | 
						|
       1  (with  "z" returned from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/
 | 
						|
       the returned value is 0.
 | 
						|
 | 
						|
         PCRE2_INFO_LASTCODEUNIT
 | 
						|
 | 
						|
       Return the value of the rightmost literal data unit that must exist  in
 | 
						|
       any  matched  string, other than at its start, if such a value has been
 | 
						|
       recorded. The third argument should point to an uint32_t  variable.  If
 | 
						|
       there is no such value, 0 is returned.
 | 
						|
 | 
						|
         PCRE2_INFO_MATCHEMPTY
 | 
						|
 | 
						|
       Return  1  if the pattern might match an empty string, otherwise 0. The
 | 
						|
       third argument should point to an uint32_t  variable.  When  a  pattern
 | 
						|
       contains recursive subroutine calls it is not always possible to deter-
 | 
						|
       mine whether or not it can match an empty string. PCRE2  takes  a  cau-
 | 
						|
       tious approach and returns 1 in such cases.
 | 
						|
 | 
						|
         PCRE2_INFO_MATCHLIMIT
 | 
						|
 | 
						|
       If  the  pattern  set  a  match  limit by including an item of the form
 | 
						|
       (*LIMIT_MATCH=nnnn) at the start, the  value  is  returned.  The  third
 | 
						|
       argument  should  point to an unsigned 32-bit integer. If no such value
 | 
						|
       has been set,  the  call  to  pcre2_pattern_info()  returns  the  error
 | 
						|
       PCRE2_ERROR_UNSET.
 | 
						|
 | 
						|
         PCRE2_INFO_MAXLOOKBEHIND
 | 
						|
 | 
						|
       Return the number of characters (not code units) in the longest lookbe-
 | 
						|
       hind assertion in the pattern. The third argument should  point  to  an
 | 
						|
       unsigned  32-bit  integer. This information is useful when doing multi-
 | 
						|
       segment matching using the partial matching facilities. Note  that  the
 | 
						|
       simple assertions \b and \B require a one-character lookbehind. \A also
 | 
						|
       registers a one-character  lookbehind,  though  it  does  not  actually
 | 
						|
       inspect  the  previous  character.  This is to ensure that at least one
 | 
						|
       character from the old segment is retained when a new segment  is  pro-
 | 
						|
       cessed. Otherwise, if there are no lookbehinds in the pattern, \A might
 | 
						|
       match incorrectly at the start of a new segment.
 | 
						|
 | 
						|
         PCRE2_INFO_MINLENGTH
 | 
						|
 | 
						|
       If a minimum length for matching  subject  strings  was  computed,  its
 | 
						|
       value  is  returned.  Otherwise the returned value is 0. The value is a
 | 
						|
       number of characters, which in UTF mode may be different from the  num-
 | 
						|
       ber  of  code  units.   The  third argument should point to an uint32_t
 | 
						|
       variable. The value is a lower bound to  the  length  of  any  matching
 | 
						|
       string.  There  may  not be any strings of that length that do actually
 | 
						|
       match, but every string that does match is at least that long.
 | 
						|
 | 
						|
         PCRE2_INFO_NAMECOUNT
 | 
						|
         PCRE2_INFO_NAMEENTRYSIZE
 | 
						|
         PCRE2_INFO_NAMETABLE
 | 
						|
 | 
						|
       PCRE2 supports the use of named as well as numbered capturing parenthe-
 | 
						|
       ses.  The names are just an additional way of identifying the parenthe-
 | 
						|
       ses, which still acquire numbers. Several convenience functions such as
 | 
						|
       pcre2_substring_get_byname()  are provided for extracting captured sub-
 | 
						|
       strings by name. It is also possible to extract the data  directly,  by
 | 
						|
       first  converting  the  name to a number in order to access the correct
 | 
						|
       pointers in the output vector (described with pcre2_match() below).  To
 | 
						|
       do  the  conversion,  you  need to use the name-to-number map, which is
 | 
						|
       described by these three values.
 | 
						|
 | 
						|
       The map consists of a number of  fixed-size  entries.  PCRE2_INFO_NAME-
 | 
						|
       COUNT  gives  the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
 | 
						|
       the size of each entry in code units; both of these return  a  uint32_t
 | 
						|
       value. The entry size depends on the length of the longest name.
 | 
						|
 | 
						|
       PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
 | 
						|
       This is a PCRE2_SPTR pointer to a block of code  units.  In  the  8-bit
 | 
						|
       library,  the  first two bytes of each entry are the number of the cap-
 | 
						|
       turing parenthesis, most significant byte first. In the 16-bit library,
 | 
						|
       the  pointer  points  to 16-bit code units, the first of which contains
 | 
						|
       the parenthesis number. In the 32-bit library, the  pointer  points  to
 | 
						|
       32-bit  code units, the first of which contains the parenthesis number.
 | 
						|
       The rest of the entry is the corresponding name, zero terminated.
 | 
						|
 | 
						|
       The names are in alphabetical order. If (?| is used to create  multiple
 | 
						|
       groups  with  the same number, as described in the section on duplicate
 | 
						|
       subpattern numbers in the pcre2pattern page, the groups  may  be  given
 | 
						|
       the  same  name,  but  there  is only one entry in the table. Different
 | 
						|
       names for groups of the same number are not permitted.
 | 
						|
 | 
						|
       Duplicate names for subpatterns with different numbers  are  permitted,
 | 
						|
       but  only  if  PCRE2_DUPNAMES  is  set. They appear in the table in the
 | 
						|
       order in which they were found in the pattern. In the  absence  of  (?|
 | 
						|
       this  is  the  order of increasing number; when (?| is used this is not
 | 
						|
       necessarily the case because later subpatterns may have lower numbers.
 | 
						|
 | 
						|
       As a simple example of the name/number table,  consider  the  following
 | 
						|
       pattern  after  compilation by the 8-bit library (assume PCRE2_EXTENDED
 | 
						|
       is set, so white space - including newlines - is ignored):
 | 
						|
 | 
						|
         (?<date> (?<year>(\d\d)?\d\d) -
 | 
						|
         (?<month>\d\d) - (?<day>\d\d) )
 | 
						|
 | 
						|
       There are four named subpatterns, so the table has  four  entries,  and
 | 
						|
       each  entry  in the table is eight bytes long. The table is as follows,
 | 
						|
       with non-printing bytes shows in hexadecimal, and undefined bytes shown
 | 
						|
       as ??:
 | 
						|
 | 
						|
         00 01 d  a  t  e  00 ??
 | 
						|
         00 05 d  a  y  00 ?? ??
 | 
						|
         00 04 m  o  n  t  h  00
 | 
						|
         00 02 y  e  a  r  00 ??
 | 
						|
 | 
						|
       When  writing  code  to  extract  data from named subpatterns using the
 | 
						|
       name-to-number map, remember that the length of the entries  is  likely
 | 
						|
       to be different for each compiled pattern.
 | 
						|
 | 
						|
         PCRE2_INFO_NEWLINE
 | 
						|
 | 
						|
       The output is a uint32_t with one of the following values:
 | 
						|
 | 
						|
         PCRE2_NEWLINE_CR       Carriage return (CR)
 | 
						|
         PCRE2_NEWLINE_LF       Linefeed (LF)
 | 
						|
         PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
 | 
						|
         PCRE2_NEWLINE_ANY      Any Unicode line ending
 | 
						|
         PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
 | 
						|
 | 
						|
       This  specifies  the default character sequence that will be recognized
 | 
						|
       as meaning "newline" while matching.
 | 
						|
 | 
						|
         PCRE2_INFO_RECURSIONLIMIT
 | 
						|
 | 
						|
       If the pattern set a recursion limit by including an item of  the  form
 | 
						|
       (*LIMIT_RECURSION=nnnn)  at the start, the value is returned. The third
 | 
						|
       argument should point to an unsigned 32-bit integer. If no  such  value
 | 
						|
       has  been  set,  the  call  to  pcre2_pattern_info()  returns the error
 | 
						|
       PCRE2_ERROR_UNSET.
 | 
						|
 | 
						|
         PCRE2_INFO_SIZE
 | 
						|
 | 
						|
       Return the size of  the  compiled  pattern  in  bytes  (for  all  three
 | 
						|
       libraries).  The third argument should point to a size_t variable. This
 | 
						|
       value includes the size of the general data  block  that  precedes  the
 | 
						|
       code  units of the compiled pattern itself. The value that is used when
 | 
						|
       pcre2_compile() is getting memory in which to place the  compiled  pat-
 | 
						|
       tern  may  be  slightly  larger than the value returned by this option,
 | 
						|
       because there are cases where the code that calculates the size has  to
 | 
						|
       over-estimate.  Processing  a  pattern  with  the JIT compiler does not
 | 
						|
       alter the value returned by this option.
 | 
						|
 | 
						|
 | 
						|
INFORMATION ABOUT A PATTERN'S CALLOUTS
 | 
						|
 | 
						|
       int pcre2_callout_enumerate(const pcre2_code *code,
 | 
						|
         int (*callback)(pcre2_callout_enumerate_block *, void *),
 | 
						|
         void *user_data);
 | 
						|
 | 
						|
       A script language that supports the use of string arguments in callouts
 | 
						|
       might  like  to  scan  all the callouts in a pattern before running the
 | 
						|
       match. This can be done by calling pcre2_callout_enumerate(). The first
 | 
						|
       argument  is  a  pointer  to a compiled pattern, the second points to a
 | 
						|
       callback function, and the third is arbitrary user data.  The  callback
 | 
						|
       function  is  called  for  every callout in the pattern in the order in
 | 
						|
       which they appear. Its first argument is a pointer to a callout enumer-
 | 
						|
       ation  block,  and  its second argument is the user_data value that was
 | 
						|
       passed to pcre2_callout_enumerate(). The contents of the  callout  enu-
 | 
						|
       meration  block  are described in the pcre2callout documentation, which
 | 
						|
       also gives further details about callouts.
 | 
						|
 | 
						|
 | 
						|
SERIALIZATION AND PRECOMPILING
 | 
						|
 | 
						|
       It is possible to save compiled patterns  on  disc  or  elsewhere,  and
 | 
						|
       reload  them  later, subject to a number of restrictions. The functions
 | 
						|
       whose names begin with pcre2_serialize_ are used for this purpose. They
 | 
						|
       are described in the pcre2serialize documentation.
 | 
						|
 | 
						|
 | 
						|
THE MATCH DATA BLOCK
 | 
						|
 | 
						|
       pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
 | 
						|
         pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       pcre2_match_data *pcre2_match_data_create_from_pattern(
 | 
						|
         const pcre2_code *code, pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       void pcre2_match_data_free(pcre2_match_data *match_data);
 | 
						|
 | 
						|
       Information  about  a  successful  or unsuccessful match is placed in a
 | 
						|
       match data block, which is an opaque  structure  that  is  accessed  by
 | 
						|
       function  calls.  In particular, the match data block contains a vector
 | 
						|
       of offsets into the subject string that define the matched part of  the
 | 
						|
       subject  and  any  substrings  that were captured. This is known as the
 | 
						|
       ovector.
 | 
						|
 | 
						|
       Before calling pcre2_match(), pcre2_dfa_match(),  or  pcre2_jit_match()
 | 
						|
       you must create a match data block by calling one of the creation func-
 | 
						|
       tions above. For pcre2_match_data_create(), the first argument  is  the
 | 
						|
       number  of  pairs  of  offsets  in  the ovector. One pair of offsets is
 | 
						|
       required to identify the string that matched the  whole  pattern,  with
 | 
						|
       another  pair  for  each  captured substring. For example, a value of 4
 | 
						|
       creates enough space to record the matched portion of the subject  plus
 | 
						|
       three  captured  substrings. A minimum of at least 1 pair is imposed by
 | 
						|
       pcre2_match_data_create(), so it is always possible to return the over-
 | 
						|
       all matched string.
 | 
						|
 | 
						|
       The second argument of pcre2_match_data_create() is a pointer to a gen-
 | 
						|
       eral context, which can specify custom memory management for  obtaining
 | 
						|
       the memory for the match data block. If you are not using custom memory
 | 
						|
       management, pass NULL, which causes malloc() to be used.
 | 
						|
 | 
						|
       For pcre2_match_data_create_from_pattern(), the  first  argument  is  a
 | 
						|
       pointer to a compiled pattern. The ovector is created to be exactly the
 | 
						|
       right size to hold all the substrings a pattern might capture. The sec-
 | 
						|
       ond  argument is again a pointer to a general context, but in this case
 | 
						|
       if NULL is passed, the memory is obtained using the same allocator that
 | 
						|
       was used for the compiled pattern (custom or default).
 | 
						|
 | 
						|
       A  match  data block can be used many times, with the same or different
 | 
						|
       compiled patterns. You can extract information from a match data  block
 | 
						|
       after  a  match  operation  has  finished,  using  functions  that  are
 | 
						|
       described in the sections on  matched  strings  and  other  match  data
 | 
						|
       below.
 | 
						|
 | 
						|
       When  a  call  of  pcre2_match()  fails, valid data is available in the
 | 
						|
       match   block   only   when   the   error    is    PCRE2_ERROR_NOMATCH,
 | 
						|
       PCRE2_ERROR_PARTIAL,  or  one  of  the  error  codes for an invalid UTF
 | 
						|
       string. Exactly what is available depends on the error, and is detailed
 | 
						|
       below.
 | 
						|
 | 
						|
       When  one of the matching functions is called, pointers to the compiled
 | 
						|
       pattern and the subject string are set in the match data block so  that
 | 
						|
       they  can  be  referenced  by the extraction functions. After running a
 | 
						|
       match, you must not free a compiled pattern or a subject  string  until
 | 
						|
       after  all  operations  on  the  match data block (for that match) have
 | 
						|
       taken place.
 | 
						|
 | 
						|
       When a match data block itself is no longer needed, it should be  freed
 | 
						|
       by calling pcre2_match_data_free().
 | 
						|
 | 
						|
 | 
						|
MATCHING A PATTERN: THE TRADITIONAL FUNCTION
 | 
						|
 | 
						|
       int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
 | 
						|
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | 
						|
         uint32_t options, pcre2_match_data *match_data,
 | 
						|
         pcre2_match_context *mcontext);
 | 
						|
 | 
						|
       The  function pcre2_match() is called to match a subject string against
 | 
						|
       a compiled pattern, which is passed in the code argument. You can  call
 | 
						|
       pcre2_match() with the same code argument as many times as you like, in
 | 
						|
       order to find multiple matches in the subject string or to  match  dif-
 | 
						|
       ferent subject strings with the same pattern.
 | 
						|
 | 
						|
       This  function  is  the  main  matching facility of the library, and it
 | 
						|
       operates in a Perl-like manner. For specialist use  there  is  also  an
 | 
						|
       alternative  matching function, which is described below in the section
 | 
						|
       about the pcre2_dfa_match() function.
 | 
						|
 | 
						|
       Here is an example of a simple call to pcre2_match():
 | 
						|
 | 
						|
         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
 | 
						|
         int rc = pcre2_match(
 | 
						|
           re,             /* result of pcre2_compile() */
 | 
						|
           "some string",  /* the subject string */
 | 
						|
           11,             /* the length of the subject string */
 | 
						|
           0,              /* start at offset 0 in the subject */
 | 
						|
           0,              /* default options */
 | 
						|
           match_data,     /* the match data block */
 | 
						|
           NULL);          /* a match context; NULL means use defaults */
 | 
						|
 | 
						|
       If the subject string is zero-terminated, the length can  be  given  as
 | 
						|
       PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
 | 
						|
       common matching parameters are to be changed. For details, see the sec-
 | 
						|
       tion on the match context above.
 | 
						|
 | 
						|
   The string to be matched by pcre2_match()
 | 
						|
 | 
						|
       The  subject string is passed to pcre2_match() as a pointer in subject,
 | 
						|
       a length in length, and a starting offset in  startoffset.  The  length
 | 
						|
       and  offset  are  in  code units, not characters.  That is, they are in
 | 
						|
       bytes for the 8-bit library, 16-bit code units for the 16-bit  library,
 | 
						|
       and  32-bit  code units for the 32-bit library, whether or not UTF pro-
 | 
						|
       cessing is enabled.
 | 
						|
 | 
						|
       If startoffset is greater than the length of the subject, pcre2_match()
 | 
						|
       returns  PCRE2_ERROR_BADOFFSET.  When  the starting offset is zero, the
 | 
						|
       search for a match starts at the beginning of the subject, and this  is
 | 
						|
       by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
 | 
						|
       set must point to the start of a character, or to the end of  the  sub-
 | 
						|
       ject  (in  UTF-32 mode, one code unit equals one character, so all off-
 | 
						|
       sets are valid). Like the  pattern  string,  the  subject  may  contain
 | 
						|
       binary zeroes.
 | 
						|
 | 
						|
       A  non-zero  starting offset is useful when searching for another match
 | 
						|
       in the same subject by calling pcre2_match()  again  after  a  previous
 | 
						|
       success.   Setting  startoffset  differs  from passing over a shortened
 | 
						|
       string and setting PCRE2_NOTBOL in the case of a  pattern  that  begins
 | 
						|
       with any kind of lookbehind. For example, consider the pattern
 | 
						|
 | 
						|
         \Biss\B
 | 
						|
 | 
						|
       which  finds  occurrences  of "iss" in the middle of words. (\B matches
 | 
						|
       only if the current position in the subject is not  a  word  boundary.)
 | 
						|
       When applied to the string "Mississipi" the first call to pcre2_match()
 | 
						|
       finds the first occurrence. If pcre2_match() is called again with  just
 | 
						|
       the  remainder  of  the  subject,  namely  "issipi", it does not match,
 | 
						|
       because \B is always false at the start of the subject, which is deemed
 | 
						|
       to  be  a word boundary. However, if pcre2_match() is passed the entire
 | 
						|
       string again, but with startoffset set to 4, it finds the second occur-
 | 
						|
       rence  of "iss" because it is able to look behind the starting point to
 | 
						|
       discover that it is preceded by a letter.
 | 
						|
 | 
						|
       Finding all the matches in a subject is tricky  when  the  pattern  can
 | 
						|
       match an empty string. It is possible to emulate Perl's /g behaviour by
 | 
						|
       first  trying  the  match  again  at  the   same   offset,   with   the
 | 
						|
       PCRE2_NOTEMPTY_ATSTART  and  PCRE2_ANCHORED  options,  and then if that
 | 
						|
       fails, advancing the starting  offset  and  trying  an  ordinary  match
 | 
						|
       again.  There  is  some  code  that  demonstrates how to do this in the
 | 
						|
       pcre2demo sample program. In the most general case, you have  to  check
 | 
						|
       to  see  if the newline convention recognizes CRLF as a newline, and if
 | 
						|
       so, and the current character is CR followed by LF, advance the  start-
 | 
						|
       ing offset by two characters instead of one.
 | 
						|
 | 
						|
       If  a  non-zero starting offset is passed when the pattern is anchored,
 | 
						|
       one attempt to match at the given offset is made. This can only succeed
 | 
						|
       if  the  pattern  does  not require the match to be at the start of the
 | 
						|
       subject.
 | 
						|
 | 
						|
   Option bits for pcre2_match()
 | 
						|
 | 
						|
       The unused bits of the options argument for pcre2_match() must be zero.
 | 
						|
       The  only  bits  that  may  be  set  are  PCRE2_ANCHORED, PCRE2_NOTBOL,
 | 
						|
       PCRE2_NOTEOL,  PCRE2_NOTEMPTY,  PCRE2_NOTEMPTY_ATSTART,   PCRE2_NO_JIT,
 | 
						|
       PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,  and PCRE2_PARTIAL_SOFT. Their
 | 
						|
       action is described below.
 | 
						|
 | 
						|
       Setting PCRE2_ANCHORED at match time is not supported by  the  just-in-
 | 
						|
       time  (JIT)  compiler.  If  it is set, JIT matching is disabled and the
 | 
						|
       normal  interpretive  code  in  pcre2_match()  is   run.   Apart   from
 | 
						|
       PCRE2_NO_JIT  (obviously),  the remaining options are supported for JIT
 | 
						|
       matching.
 | 
						|
 | 
						|
         PCRE2_ANCHORED
 | 
						|
 | 
						|
       The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
 | 
						|
       matching  position.  If  a pattern was compiled with PCRE2_ANCHORED, or
 | 
						|
       turned out to be anchored by virtue of its contents, it cannot be  made
 | 
						|
       unachored  at matching time. Note that setting the option at match time
 | 
						|
       disables JIT matching.
 | 
						|
 | 
						|
         PCRE2_NOTBOL
 | 
						|
 | 
						|
       This option specifies that first character of the subject string is not
 | 
						|
       the  beginning  of  a  line, so the circumflex metacharacter should not
 | 
						|
       match before it. Setting this without  having  set  PCRE2_MULTILINE  at
 | 
						|
       compile time causes circumflex never to match. This option affects only
 | 
						|
       the behaviour of the circumflex metacharacter. It does not affect \A.
 | 
						|
 | 
						|
         PCRE2_NOTEOL
 | 
						|
 | 
						|
       This option specifies that the end of the subject string is not the end
 | 
						|
       of  a line, so the dollar metacharacter should not match it nor (except
 | 
						|
       in multiline mode) a newline immediately before it. Setting this  with-
 | 
						|
       out  having  set PCRE2_MULTILINE at compile time causes dollar never to
 | 
						|
       match. This option affects only the behaviour of the dollar metacharac-
 | 
						|
       ter. It does not affect \Z or \z.
 | 
						|
 | 
						|
         PCRE2_NOTEMPTY
 | 
						|
 | 
						|
       An empty string is not considered to be a valid match if this option is
 | 
						|
       set. If there are alternatives in the pattern, they are tried.  If  all
 | 
						|
       the  alternatives  match  the empty string, the entire match fails. For
 | 
						|
       example, if the pattern
 | 
						|
 | 
						|
         a?b?
 | 
						|
 | 
						|
       is applied to a string not beginning with "a" or  "b",  it  matches  an
 | 
						|
       empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
 | 
						|
       match is not valid, so pcre2_match() searches further into  the  string
 | 
						|
       for occurrences of "a" or "b".
 | 
						|
 | 
						|
         PCRE2_NOTEMPTY_ATSTART
 | 
						|
 | 
						|
       This  is  like PCRE2_NOTEMPTY, except that it locks out an empty string
 | 
						|
       match only at the first matching position, that is, at the start of the
 | 
						|
       subject  plus  the  starting offset. An empty string match later in the
 | 
						|
       subject is permitted.  If the pattern is anchored,  such  a  match  can
 | 
						|
       occur only if the pattern contains \K.
 | 
						|
 | 
						|
         PCRE2_NO_JIT
 | 
						|
 | 
						|
       By   default,   if   a  pattern  has  been  successfully  processed  by
 | 
						|
       pcre2_jit_compile(), JIT is automatically used  when  pcre2_match()  is
 | 
						|
       called  with  options  that JIT supports. Setting PCRE2_NO_JIT disables
 | 
						|
       the use of JIT; it forces matching to be done by the interpreter.
 | 
						|
 | 
						|
         PCRE2_NO_UTF_CHECK
 | 
						|
 | 
						|
       When PCRE2_UTF is set at compile time, the validity of the subject as a
 | 
						|
       UTF  string  is  checked  by default when pcre2_match() is subsequently
 | 
						|
       called.  If a non-zero starting offset is given, the check  is  applied
 | 
						|
       only  to that part of the subject that could be inspected during match-
 | 
						|
       ing, and there is a check that the starting offset points to the  first
 | 
						|
       code  unit of a character or to the end of the subject. If there are no
 | 
						|
       lookbehind assertions in the pattern, the check starts at the  starting
 | 
						|
       offset.  Otherwise,  it  starts at the length of the longest lookbehind
 | 
						|
       before the starting offset, or at the start of the subject if there are
 | 
						|
       not  that  many  characters  before  the starting offset. Note that the
 | 
						|
       sequences \b and \B are one-character lookbehinds.
 | 
						|
 | 
						|
       The check is carried out before any other processing takes place, and a
 | 
						|
       negative  error  code is returned if the check fails. There are several
 | 
						|
       UTF error codes for each code unit width,  corresponding  to  different
 | 
						|
       problems  with  the code unit sequence. There are discussions about the
 | 
						|
       validity of UTF-8 strings, UTF-16 strings, and UTF-32  strings  in  the
 | 
						|
       pcre2unicode page.
 | 
						|
 | 
						|
       If  you  know  that  your  subject is valid, and you want to skip these
 | 
						|
       checks for performance reasons,  you  can  set  the  PCRE2_NO_UTF_CHECK
 | 
						|
       option  when  calling  pcre2_match(). You might want to do this for the
 | 
						|
       second and subsequent calls to pcre2_match() if you are making repeated
 | 
						|
       calls to find all the matches in a single subject string.
 | 
						|
 | 
						|
       NOTE:  When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
 | 
						|
       string as a subject, or an invalid value of startoffset, is  undefined.
 | 
						|
       Your program may crash or loop indefinitely.
 | 
						|
 | 
						|
         PCRE2_PARTIAL_HARD
 | 
						|
         PCRE2_PARTIAL_SOFT
 | 
						|
 | 
						|
       These  options  turn  on  the partial matching feature. A partial match
 | 
						|
       occurs if the end of the subject string is  reached  successfully,  but
 | 
						|
       there  are not enough subject characters to complete the match. If this
 | 
						|
       happens when PCRE2_PARTIAL_SOFT (but not  PCRE2_PARTIAL_HARD)  is  set,
 | 
						|
       matching  continues  by  testing any remaining alternatives. Only if no
 | 
						|
       complete match can be found is PCRE2_ERROR_PARTIAL returned instead  of
 | 
						|
       PCRE2_ERROR_NOMATCH.  In other words, PCRE2_PARTIAL_SOFT specifies that
 | 
						|
       the caller is prepared to handle a partial match, but only if  no  com-
 | 
						|
       plete match can be found.
 | 
						|
 | 
						|
       If  PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this
 | 
						|
       case, if a partial match is found,  pcre2_match()  immediately  returns
 | 
						|
       PCRE2_ERROR_PARTIAL,  without  considering  any  other alternatives. In
 | 
						|
       other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
 | 
						|
       ered to be more important that an alternative complete match.
 | 
						|
 | 
						|
       There is a more detailed discussion of partial and multi-segment match-
 | 
						|
       ing, with examples, in the pcre2partial documentation.
 | 
						|
 | 
						|
 | 
						|
NEWLINE HANDLING WHEN MATCHING
 | 
						|
 | 
						|
       When PCRE2 is built, a default newline convention is set; this is  usu-
 | 
						|
       ally  the standard convention for the operating system. The default can
 | 
						|
       be overridden in a compile context by calling  pcre2_set_newline().  It
 | 
						|
       can  also be overridden by starting a pattern string with, for example,
 | 
						|
       (*CRLF), as described in the section  on  newline  conventions  in  the
 | 
						|
       pcre2pattern  page. During matching, the newline choice affects the be-
 | 
						|
       haviour of the dot, circumflex, and dollar metacharacters. It may  also
 | 
						|
       alter  the  way  the  match starting position is advanced after a match
 | 
						|
       failure for an unanchored pattern.
 | 
						|
 | 
						|
       When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
 | 
						|
       set  as  the  newline convention, and a match attempt for an unanchored
 | 
						|
       pattern fails when the current starting position is at a CRLF sequence,
 | 
						|
       and  the  pattern contains no explicit matches for CR or LF characters,
 | 
						|
       the match position is advanced by two characters  instead  of  one,  in
 | 
						|
       other words, to after the CRLF.
 | 
						|
 | 
						|
       The above rule is a compromise that makes the most common cases work as
 | 
						|
       expected. For example, if the pattern  is  .+A  (and  the  PCRE2_DOTALL
 | 
						|
       option is not set), it does not match the string "\r\nA" because, after
 | 
						|
       failing at the start, it skips both the CR and the LF before  retrying.
 | 
						|
       However,  the  pattern  [\r\n]A does match that string, because it con-
 | 
						|
       tains an explicit CR or LF reference, and so advances only by one char-
 | 
						|
       acter after the first failure.
 | 
						|
 | 
						|
       An explicit match for CR of LF is either a literal appearance of one of
 | 
						|
       those characters in the  pattern,  or  one  of  the  \r  or  \n  escape
 | 
						|
       sequences.  Implicit  matches  such  as [^X] do not count, nor does \s,
 | 
						|
       even though it includes CR and LF in the characters that it matches.
 | 
						|
 | 
						|
       Notwithstanding the above, anomalous effects may still occur when  CRLF
 | 
						|
       is a valid newline sequence and explicit \r or \n escapes appear in the
 | 
						|
       pattern.
 | 
						|
 | 
						|
 | 
						|
HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
 | 
						|
 | 
						|
       uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
 | 
						|
 | 
						|
       PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
 | 
						|
 | 
						|
       In general, a pattern matches a certain portion of the subject, and  in
 | 
						|
       addition,  further  substrings  from  the  subject may be picked out by
 | 
						|
       parenthesized parts of the pattern.  Following  the  usage  in  Jeffrey
 | 
						|
       Friedl's  book,  this  is  called  "capturing" in what follows, and the
 | 
						|
       phrase "capturing subpattern" or "capturing group" is used for a  frag-
 | 
						|
       ment  of  a  pattern that picks out a substring. PCRE2 supports several
 | 
						|
       other kinds of parenthesized subpattern that do not cause substrings to
 | 
						|
       be  captured. The pcre2_pattern_info() function can be used to find out
 | 
						|
       how many capturing subpatterns there are in a compiled pattern.
 | 
						|
 | 
						|
       You can use auxiliary functions for accessing  captured  substrings  by
 | 
						|
       number or by name, as described in sections below.
 | 
						|
 | 
						|
       Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
 | 
						|
       ues, called  the  ovector,  which  contains  the  offsets  of  captured
 | 
						|
       strings.   It   is   part  of  the  match  data  block.   The  function
 | 
						|
       pcre2_get_ovector_pointer() returns the address  of  the  ovector,  and
 | 
						|
       pcre2_get_ovector_count() returns the number of pairs of values it con-
 | 
						|
       tains.
 | 
						|
 | 
						|
       Within the ovector, the first in each pair of values is set to the off-
 | 
						|
       set of the first code unit of a substring, and the second is set to the
 | 
						|
       offset of the first code unit after the end of a substring. These  val-
 | 
						|
       ues  are always code unit offsets, not character offsets. That is, they
 | 
						|
       are byte offsets in the 8-bit library, 16-bit  offsets  in  the  16-bit
 | 
						|
       library, and 32-bit offsets in the 32-bit library.
 | 
						|
 | 
						|
       After  a  partial  match  (error  return PCRE2_ERROR_PARTIAL), only the
 | 
						|
       first pair of offsets (that is, ovector[0]  and  ovector[1])  are  set.
 | 
						|
       They  identify  the part of the subject that was partially matched. See
 | 
						|
       the pcre2partial documentation for details of partial matching.
 | 
						|
 | 
						|
       After a successful match, the first pair of offsets identifies the por-
 | 
						|
       tion  of the subject string that was matched by the entire pattern. The
 | 
						|
       next pair is used for the first capturing subpattern, and  so  on.  The
 | 
						|
       value  returned  by pcre2_match() is one more than the highest numbered
 | 
						|
       pair that has been set. For example, if two substrings have  been  cap-
 | 
						|
       tured,  the returned value is 3. If there are no capturing subpatterns,
 | 
						|
       the return value from a successful match is 1, indicating that just the
 | 
						|
       first pair of offsets has been set.
 | 
						|
 | 
						|
       If  a  pattern uses the \K escape sequence within a positive assertion,
 | 
						|
       the reported start of a successful match can be greater than the end of
 | 
						|
       the  match.   For  example,  if the pattern (?=ab\K) is matched against
 | 
						|
       "ab", the start and end offset values for the match are 2 and 0.
 | 
						|
 | 
						|
       If a capturing subpattern group is matched repeatedly within  a  single
 | 
						|
       match  operation, it is the last portion of the subject that it matched
 | 
						|
       that is returned.
 | 
						|
 | 
						|
       If the ovector is too small to hold all the captured substring offsets,
 | 
						|
       as  much  as possible is filled in, and the function returns a value of
 | 
						|
       zero. If captured substrings are not of interest, pcre2_match() may  be
 | 
						|
       called with a match data block whose ovector is of minimum length (that
 | 
						|
       is, one pair). However, if the pattern contains back references and the
 | 
						|
       ovector is not big enough to remember the related substrings, PCRE2 has
 | 
						|
       to get additional memory for use during matching. Thus  it  is  usually
 | 
						|
       advisable to set up a match data block containing an ovector of reason-
 | 
						|
       able size.
 | 
						|
 | 
						|
       It is possible for capturing subpattern number n+1 to match  some  part
 | 
						|
       of the subject when subpattern n has not been used at all. For example,
 | 
						|
       if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the
 | 
						|
       return from the function is 4, and subpatterns 1 and 3 are matched, but
 | 
						|
       2 is not. When this happens, both values in  the  offset  pairs  corre-
 | 
						|
       sponding to unused subpatterns are set to PCRE2_UNSET.
 | 
						|
 | 
						|
       Offset  values  that correspond to unused subpatterns at the end of the
 | 
						|
       expression are also set to PCRE2_UNSET.  For  example,  if  the  string
 | 
						|
       "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3
 | 
						|
       are not matched.  The return from the function is 2, because the  high-
 | 
						|
       est used capturing subpattern number is 1. The offsets for for the sec-
 | 
						|
       ond and third capturing  subpatterns  (assuming  the  vector  is  large
 | 
						|
       enough, of course) are set to PCRE2_UNSET.
 | 
						|
 | 
						|
       Elements in the ovector that do not correspond to capturing parentheses
 | 
						|
       in the pattern are never changed. That is, if a pattern contains n cap-
 | 
						|
       turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
 | 
						|
       pcre2_match(). The other elements retain whatever  values  they  previ-
 | 
						|
       ously had.
 | 
						|
 | 
						|
 | 
						|
OTHER INFORMATION ABOUT A MATCH
 | 
						|
 | 
						|
       PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
 | 
						|
 | 
						|
       PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
 | 
						|
 | 
						|
       As  well as the offsets in the ovector, other information about a match
 | 
						|
       is retained in the match data block and can be retrieved by  the  above
 | 
						|
       functions  in  appropriate  circumstances.  If they are called at other
 | 
						|
       times, the result is undefined.
 | 
						|
 | 
						|
       After a successful match, a partial match (PCRE2_ERROR_PARTIAL),  or  a
 | 
						|
       failure  to  match  (PCRE2_ERROR_NOMATCH), a (*MARK) name may be avail-
 | 
						|
       able, and pcre2_get_mark() can be called. It returns a pointer  to  the
 | 
						|
       zero-terminated  name,  which is within the compiled pattern. Otherwise
 | 
						|
       NULL is returned. The length of the (*MARK) name (excluding the  termi-
 | 
						|
       nating  zero)  is  stored  in the code unit that preceeds the name. You
 | 
						|
       should use this instead of relying  on  the  terminating  zero  if  the
 | 
						|
       (*MARK) name might contain a binary zero.
 | 
						|
 | 
						|
       After a successful match, the (*MARK) name that is returned is the last
 | 
						|
       one encountered on the matching path through the pattern. After  a  "no
 | 
						|
       match"  or  a  partial  match,  the  last  encountered  (*MARK) name is
 | 
						|
       returned. For example, consider this pattern:
 | 
						|
 | 
						|
         ^(*MARK:A)((*MARK:B)a|b)c
 | 
						|
 | 
						|
       When it matches "bc", the returned mark is A. The B mark is  "seen"  in
 | 
						|
       the  first  branch of the group, but it is not on the matching path. On
 | 
						|
       the other hand, when this pattern fails to  match  "bx",  the  returned
 | 
						|
       mark is B.
 | 
						|
 | 
						|
       After  a  successful  match, a partial match, or one of the invalid UTF
 | 
						|
       errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar()  can
 | 
						|
       be called. After a successful or partial match it returns the code unit
 | 
						|
       offset of the character at which the match started. For  a  non-partial
 | 
						|
       match,  this can be different to the value of ovector[0] if the pattern
 | 
						|
       contains the \K escape sequence. After a partial match,  however,  this
 | 
						|
       value  is  always the same as ovector[0] because \K does not affect the
 | 
						|
       result of a partial match.
 | 
						|
 | 
						|
       After a UTF check failure, pcre2_get_startchar() can be used to  obtain
 | 
						|
       the code unit offset of the invalid UTF character. Details are given in
 | 
						|
       the pcre2unicode page.
 | 
						|
 | 
						|
 | 
						|
ERROR RETURNS FROM pcre2_match()
 | 
						|
 | 
						|
       If pcre2_match() fails, it returns a negative number. This can be  con-
 | 
						|
       verted  to a text string by calling the pcre2_get_error_message() func-
 | 
						|
       tion (see "Obtaining a textual error message" below).   Negative  error
 | 
						|
       codes  are  also  returned  by other functions, and are documented with
 | 
						|
       them. The codes are given names in the header file. If UTF checking  is
 | 
						|
       in force and an invalid UTF subject string is detected, one of a number
 | 
						|
       of UTF-specific negative error codes is returned. Details are given  in
 | 
						|
       the  pcre2unicode  page. The following are the other errors that may be
 | 
						|
       returned by pcre2_match():
 | 
						|
 | 
						|
         PCRE2_ERROR_NOMATCH
 | 
						|
 | 
						|
       The subject string did not match the pattern.
 | 
						|
 | 
						|
         PCRE2_ERROR_PARTIAL
 | 
						|
 | 
						|
       The subject string did not match, but it did match partially.  See  the
 | 
						|
       pcre2partial documentation for details of partial matching.
 | 
						|
 | 
						|
         PCRE2_ERROR_BADMAGIC
 | 
						|
 | 
						|
       PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
 | 
						|
       to catch the case when it is passed a junk pointer. This is  the  error
 | 
						|
       that is returned when the magic number is not present.
 | 
						|
 | 
						|
         PCRE2_ERROR_BADMODE
 | 
						|
 | 
						|
       This  error  is  given  when  a  pattern that was compiled by the 8-bit
 | 
						|
       library is passed to a 16-bit  or  32-bit  library  function,  or  vice
 | 
						|
       versa.
 | 
						|
 | 
						|
         PCRE2_ERROR_BADOFFSET
 | 
						|
 | 
						|
       The value of startoffset was greater than the length of the subject.
 | 
						|
 | 
						|
         PCRE2_ERROR_BADOPTION
 | 
						|
 | 
						|
       An unrecognized bit was set in the options argument.
 | 
						|
 | 
						|
         PCRE2_ERROR_BADUTFOFFSET
 | 
						|
 | 
						|
       The UTF code unit sequence that was passed as a subject was checked and
 | 
						|
       found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but  the
 | 
						|
       value  of startoffset did not point to the beginning of a UTF character
 | 
						|
       or the end of the subject.
 | 
						|
 | 
						|
         PCRE2_ERROR_CALLOUT
 | 
						|
 | 
						|
       This error is never generated by pcre2_match() itself. It  is  provided
 | 
						|
       for  use  by  callout  functions  that  want  to cause pcre2_match() or
 | 
						|
       pcre2_callout_enumerate() to return a distinctive error code.  See  the
 | 
						|
       pcre2callout documentation for details.
 | 
						|
 | 
						|
         PCRE2_ERROR_INTERNAL
 | 
						|
 | 
						|
       An  unexpected  internal error has occurred. This error could be caused
 | 
						|
       by a bug in PCRE2 or by overwriting of the compiled pattern.
 | 
						|
 | 
						|
         PCRE2_ERROR_JIT_BADOPTION
 | 
						|
 | 
						|
       This error is returned when a pattern  that  was  successfully  studied
 | 
						|
       using  JIT is being matched, but the matching mode (partial or complete
 | 
						|
       match) does not correspond to any JIT compilation mode.  When  the  JIT
 | 
						|
       fast  path  function  is used, this error may be also given for invalid
 | 
						|
       options. See the pcre2jit documentation for more details.
 | 
						|
 | 
						|
         PCRE2_ERROR_JIT_STACKLIMIT
 | 
						|
 | 
						|
       This error is returned when a pattern  that  was  successfully  studied
 | 
						|
       using  JIT  is being matched, but the memory available for the just-in-
 | 
						|
       time processing stack is not large enough. See the pcre2jit  documenta-
 | 
						|
       tion for more details.
 | 
						|
 | 
						|
         PCRE2_ERROR_MATCHLIMIT
 | 
						|
 | 
						|
       The backtracking limit was reached.
 | 
						|
 | 
						|
         PCRE2_ERROR_NOMEMORY
 | 
						|
 | 
						|
       If  a  pattern  contains  back  references,  but the ovector is not big
 | 
						|
       enough to remember the referenced substrings, PCRE2  gets  a  block  of
 | 
						|
       memory at the start of matching to use for this purpose. There are some
 | 
						|
       other special cases where extra memory is needed during matching.  This
 | 
						|
       error is given when memory cannot be obtained.
 | 
						|
 | 
						|
         PCRE2_ERROR_NULL
 | 
						|
 | 
						|
       Either the code, subject, or match_data argument was passed as NULL.
 | 
						|
 | 
						|
         PCRE2_ERROR_RECURSELOOP
 | 
						|
 | 
						|
       This  error  is  returned  when  pcre2_match() detects a recursion loop
 | 
						|
       within the pattern. Specifically, it means that either the  whole  pat-
 | 
						|
       tern or a subpattern has been called recursively for the second time at
 | 
						|
       the same position in the subject  string.  Some  simple  patterns  that
 | 
						|
       might  do  this are detected and faulted at compile time, but more com-
 | 
						|
       plicated cases, in particular mutual recursions between  two  different
 | 
						|
       subpatterns, cannot be detected until matching is attempted.
 | 
						|
 | 
						|
         PCRE2_ERROR_RECURSIONLIMIT
 | 
						|
 | 
						|
       The internal recursion limit was reached.
 | 
						|
 | 
						|
 | 
						|
OBTAINING A TEXTUAL ERROR MESSAGE
 | 
						|
 | 
						|
       int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
 | 
						|
         PCRE2_SIZE bufflen);
 | 
						|
 | 
						|
       A  text  message  for  an  error code from any PCRE2 function (compile,
 | 
						|
       match, or auxiliary) can be obtained  by  calling  pcre2_get_error_mes-
 | 
						|
       sage().  The  code  is passed as the first argument, with the remaining
 | 
						|
       two arguments specifying a code unit buffer and its length, into  which
 | 
						|
       the  text  message is placed. Note that the message is returned in code
 | 
						|
       units of the appropriate width for the library that is being used.
 | 
						|
 | 
						|
       The returned message is terminated with a trailing zero, and the  func-
 | 
						|
       tion  returns  the  number  of  code units used, excluding the trailing
 | 
						|
       zero.  If  the  error  number  is  unknown,  the  negative  error  code
 | 
						|
       PCRE2_ERROR_BADDATA  is  returned. If the buffer is too small, the mes-
 | 
						|
       sage is truncated (but still with a trailing zero),  and  the  negative
 | 
						|
       error  code PCRE2_ERROR_NOMEMORY is returned.  None of the messages are
 | 
						|
       very long; a buffer size of 120 code units is ample.
 | 
						|
 | 
						|
 | 
						|
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
 | 
						|
 | 
						|
       int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
 | 
						|
         uint32_t number, PCRE2_SIZE *length);
 | 
						|
 | 
						|
       int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
 | 
						|
         uint32_t number, PCRE2_UCHAR *buffer,
 | 
						|
         PCRE2_SIZE *bufflen);
 | 
						|
 | 
						|
       int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
 | 
						|
         uint32_t number, PCRE2_UCHAR **bufferptr,
 | 
						|
         PCRE2_SIZE *bufflen);
 | 
						|
 | 
						|
       void pcre2_substring_free(PCRE2_UCHAR *buffer);
 | 
						|
 | 
						|
       Captured substrings can be accessed directly by using  the  ovector  as
 | 
						|
       described above.  For convenience, auxiliary functions are provided for
 | 
						|
       extracting  captured  substrings  as  new,  separate,   zero-terminated
 | 
						|
       strings. A substring that contains a binary zero is correctly extracted
 | 
						|
       and has a further zero added on the end, but  the  result  is  not,  of
 | 
						|
       course, a C string.
 | 
						|
 | 
						|
       The functions in this section identify substrings by number. The number
 | 
						|
       zero refers to the entire matched substring, with higher numbers refer-
 | 
						|
       ring  to  substrings  captured by parenthesized groups. After a partial
 | 
						|
       match, only substring zero is available.  An  attempt  to  extract  any
 | 
						|
       other  substring  gives the error PCRE2_ERROR_PARTIAL. The next section
 | 
						|
       describes similar functions for extracting captured substrings by name.
 | 
						|
 | 
						|
       If a pattern uses the \K escape sequence within a  positive  assertion,
 | 
						|
       the reported start of a successful match can be greater than the end of
 | 
						|
       the match.  For example, if the pattern  (?=ab\K)  is  matched  against
 | 
						|
       "ab",  the  start  and  end offset values for the match are 2 and 0. In
 | 
						|
       this situation, calling these functions with a  zero  substring  number
 | 
						|
       extracts a zero-length empty string.
 | 
						|
 | 
						|
       You  can  find the length in code units of a captured substring without
 | 
						|
       extracting it by calling pcre2_substring_length_bynumber().  The  first
 | 
						|
       argument  is a pointer to the match data block, the second is the group
 | 
						|
       number, and the third is a pointer to a variable into which the  length
 | 
						|
       is  placed.  If  you just want to know whether or not the substring has
 | 
						|
       been captured, you can pass the third argument as NULL.
 | 
						|
 | 
						|
       The pcre2_substring_copy_bynumber() function  copies  a  captured  sub-
 | 
						|
       string  into  a supplied buffer, whereas pcre2_substring_get_bynumber()
 | 
						|
       copies it into new memory, obtained using the  same  memory  allocation
 | 
						|
       function  that  was  used for the match data block. The first two argu-
 | 
						|
       ments of these functions are a pointer to the match data  block  and  a
 | 
						|
       capturing group number.
 | 
						|
 | 
						|
       The final arguments of pcre2_substring_copy_bynumber() are a pointer to
 | 
						|
       the buffer and a pointer to a variable that contains its length in code
 | 
						|
       units.  This is updated to contain the actual number of code units used
 | 
						|
       for the extracted substring, excluding the terminating zero.
 | 
						|
 | 
						|
       For pcre2_substring_get_bynumber() the third and fourth arguments point
 | 
						|
       to  variables that are updated with a pointer to the new memory and the
 | 
						|
       number of code units that comprise the substring, again  excluding  the
 | 
						|
       terminating  zero.  When  the substring is no longer needed, the memory
 | 
						|
       should be freed by calling pcre2_substring_free().
 | 
						|
 | 
						|
       The return value from all these functions is zero  for  success,  or  a
 | 
						|
       negative  error  code.  If  the pattern match failed, the match failure
 | 
						|
       code is returned.  If a substring number  greater  than  zero  is  used
 | 
						|
       after  a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible
 | 
						|
       error codes are:
 | 
						|
 | 
						|
         PCRE2_ERROR_NOMEMORY
 | 
						|
 | 
						|
       The buffer was too small for  pcre2_substring_copy_bynumber(),  or  the
 | 
						|
       attempt to get memory failed for pcre2_substring_get_bynumber().
 | 
						|
 | 
						|
         PCRE2_ERROR_NOSUBSTRING
 | 
						|
 | 
						|
       There  is  no  substring  with that number in the pattern, that is, the
 | 
						|
       number is greater than the number of capturing parentheses.
 | 
						|
 | 
						|
         PCRE2_ERROR_UNAVAILABLE
 | 
						|
 | 
						|
       The substring number, though not greater than the number of captures in
 | 
						|
       the pattern, is greater than the number of slots in the ovector, so the
 | 
						|
       substring could not be captured.
 | 
						|
 | 
						|
         PCRE2_ERROR_UNSET
 | 
						|
 | 
						|
       The substring did not participate in the match.  For  example,  if  the
 | 
						|
       pattern  is  (abc)|(def) and the subject is "def", and the ovector con-
 | 
						|
       tains at least two capturing slots, substring number 1 is unset.
 | 
						|
 | 
						|
 | 
						|
EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
 | 
						|
 | 
						|
       int pcre2_substring_list_get(pcre2_match_data *match_data,
 | 
						|
         PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
 | 
						|
 | 
						|
       void pcre2_substring_list_free(PCRE2_SPTR *list);
 | 
						|
 | 
						|
       The pcre2_substring_list_get() function  extracts  all  available  sub-
 | 
						|
       strings  and  builds  a  list of pointers to them. It also (optionally)
 | 
						|
       builds a second list that  contains  their  lengths  (in  code  units),
 | 
						|
       excluding a terminating zero that is added to each of them. All this is
 | 
						|
       done in a single block of memory that is obtained using the same memory
 | 
						|
       allocation function that was used to get the match data block.
 | 
						|
 | 
						|
       This  function  must be called only after a successful match. If called
 | 
						|
       after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
 | 
						|
 | 
						|
       The address of the memory block is returned via listptr, which is  also
 | 
						|
       the start of the list of string pointers. The end of the list is marked
 | 
						|
       by a NULL pointer. The address of the list of lengths is  returned  via
 | 
						|
       lengthsptr.  If your strings do not contain binary zeros and you do not
 | 
						|
       therefore need the lengths, you may supply NULL as the lengthsptr argu-
 | 
						|
       ment  to  disable  the  creation of a list of lengths. The yield of the
 | 
						|
       function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the  mem-
 | 
						|
       ory  block could not be obtained. When the list is no longer needed, it
 | 
						|
       should be freed by calling pcre2_substring_list_free().
 | 
						|
 | 
						|
       If this function encounters a substring that is unset, which can happen
 | 
						|
       when  capturing subpattern number n+1 matches some part of the subject,
 | 
						|
       but subpattern n has not been used at all, it returns an empty  string.
 | 
						|
       This  can  be  distinguished  from  a  genuine zero-length substring by
 | 
						|
       inspecting  the  appropriate  offset  in  the  ovector,  which  contain
 | 
						|
       PCRE2_UNSET   for   unset   substrings,   or   by   calling  pcre2_sub-
 | 
						|
       string_length_bynumber().
 | 
						|
 | 
						|
 | 
						|
EXTRACTING CAPTURED SUBSTRINGS BY NAME
 | 
						|
 | 
						|
       int pcre2_substring_number_from_name(const pcre2_code *code,
 | 
						|
         PCRE2_SPTR name);
 | 
						|
 | 
						|
       int pcre2_substring_length_byname(pcre2_match_data *match_data,
 | 
						|
         PCRE2_SPTR name, PCRE2_SIZE *length);
 | 
						|
 | 
						|
       int pcre2_substring_copy_byname(pcre2_match_data *match_data,
 | 
						|
         PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
 | 
						|
 | 
						|
       int pcre2_substring_get_byname(pcre2_match_data *match_data,
 | 
						|
         PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
 | 
						|
 | 
						|
       void pcre2_substring_free(PCRE2_UCHAR *buffer);
 | 
						|
 | 
						|
       To extract a substring by name, you first have to find associated  num-
 | 
						|
       ber.  For example, for this pattern:
 | 
						|
 | 
						|
         (a+)b(?<xxx>\d+)...
 | 
						|
 | 
						|
       the number of the subpattern called "xxx" is 2. If the name is known to
 | 
						|
       be unique (PCRE2_DUPNAMES was not set), you can find  the  number  from
 | 
						|
       the name by calling pcre2_substring_number_from_name(). The first argu-
 | 
						|
       ment is the compiled pattern, and the second is the name. The yield  of
 | 
						|
       the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there
 | 
						|
       is no subpattern of  that  name,  or  PCRE2_ERROR_NOUNIQUESUBSTRING  if
 | 
						|
       there  is  more than one subpattern of that name. Given the number, you
 | 
						|
       can extract the  substring  directly,  or  use  one  of  the  functions
 | 
						|
       described above.
 | 
						|
 | 
						|
       For  convenience,  there are also "byname" functions that correspond to
 | 
						|
       the "bynumber" functions, the only difference  being  that  the  second
 | 
						|
       argument  is  a  name instead of a number. If PCRE2_DUPNAMES is set and
 | 
						|
       there are duplicate names, these functions scan all the groups with the
 | 
						|
       given name, and return the first named string that is set.
 | 
						|
 | 
						|
       If  there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
 | 
						|
       returned. If all groups with the name have  numbers  that  are  greater
 | 
						|
       than  the  number  of  slots in the ovector, PCRE2_ERROR_UNAVAILABLE is
 | 
						|
       returned. If there is at least one group with a slot  in  the  ovector,
 | 
						|
       but no group is found to be set, PCRE2_ERROR_UNSET is returned.
 | 
						|
 | 
						|
       Warning: If the pattern uses the (?| feature to set up multiple subpat-
 | 
						|
       terns with the same number, as described in the  section  on  duplicate
 | 
						|
       subpattern  numbers  in  the pcre2pattern page, you cannot use names to
 | 
						|
       distinguish the different subpatterns, because names are  not  included
 | 
						|
       in  the compiled code. The matching process uses only numbers. For this
 | 
						|
       reason, the use of different names for subpatterns of the  same  number
 | 
						|
       causes an error at compile time.
 | 
						|
 | 
						|
 | 
						|
CREATING A NEW STRING WITH SUBSTITUTIONS
 | 
						|
 | 
						|
       int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
 | 
						|
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | 
						|
         uint32_t options, pcre2_match_data *match_data,
 | 
						|
         pcre2_match_context *mcontext, PCRE2_SPTR replacement,
 | 
						|
         PCRE2_SIZE rlength, PCRE2_UCHAR *outputbufferP,
 | 
						|
         PCRE2_SIZE *outlengthptr);
 | 
						|
 | 
						|
       This  function calls pcre2_match() and then makes a copy of the subject
 | 
						|
       string in outputbuffer, replacing the part that was  matched  with  the
 | 
						|
       replacement  string,  whose  length is supplied in rlength. This can be
 | 
						|
       given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
 | 
						|
       which  a  \K item in a lookahead in the pattern causes the match to end
 | 
						|
       before it starts are not supported, and give rise to an error return.
 | 
						|
 | 
						|
       The first seven arguments of pcre2_substitute() are  the  same  as  for
 | 
						|
       pcre2_match(), except that the partial matching options are not permit-
 | 
						|
       ted, and match_data may be passed as NULL, in which case a  match  data
 | 
						|
       block  is obtained and freed within this function, using memory manage-
 | 
						|
       ment functions from the match context, if provided, or else those  that
 | 
						|
       were used to allocate memory for the compiled code.
 | 
						|
 | 
						|
       The  outlengthptr  argument  must point to a variable that contains the
 | 
						|
       length, in code units, of the output buffer. If the  function  is  suc-
 | 
						|
       cessful,  the value is updated to contain the length of the new string,
 | 
						|
       excluding the trailing zero that is automatically added.
 | 
						|
 | 
						|
       If the function is not  successful,  the  value  set  via  outlengthptr
 | 
						|
       depends  on  the  type  of  error. For syntax errors in the replacement
 | 
						|
       string, the value is the offset in the  replacement  string  where  the
 | 
						|
       error  was  detected.  For  other  errors,  the value is PCRE2_UNSET by
 | 
						|
       default. This includes the case of the output buffer being  too  small,
 | 
						|
       unless  PCRE2_SUBSTITUTE_OVERFLOW_LENGTH  is  set (see below), in which
 | 
						|
       case the value is the minimum length needed, including  space  for  the
 | 
						|
       trailing  zero.  Note  that  in  order  to compute the required length,
 | 
						|
       pcre2_substitute() has  to  simulate  all  the  matching  and  copying,
 | 
						|
       instead of giving an error return as soon as the buffer overflows. Note
 | 
						|
       also that the length is in code units, not bytes.
 | 
						|
 | 
						|
       In the replacement string, which is interpreted as a UTF string in  UTF
 | 
						|
       mode,  and  is  checked  for UTF validity unless the PCRE2_NO_UTF_CHECK
 | 
						|
       option is set, a dollar character is an escape character that can spec-
 | 
						|
       ify  the insertion of characters from capturing groups or (*MARK) items
 | 
						|
       in the pattern. The following forms are always recognized:
 | 
						|
 | 
						|
         $$                  insert a dollar character
 | 
						|
         $<n> or ${<n>}      insert the contents of group <n>
 | 
						|
         $*MARK or ${*MARK}  insert the name of the last (*MARK) encountered
 | 
						|
 | 
						|
       Either a group number or a group name  can  be  given  for  <n>.  Curly
 | 
						|
       brackets  are  required only if the following character would be inter-
 | 
						|
       preted as part of the number or name. The number may be zero to include
 | 
						|
       the  entire  matched  string.   For  example,  if  the pattern a(b)c is
 | 
						|
       matched with "=abc=" and the replacement string "+$1$0$1+", the  result
 | 
						|
       is "=+babcb+=".
 | 
						|
 | 
						|
       The facility for inserting a (*MARK) name can be used to perform simple
 | 
						|
       simultaneous substitutions, as this pcre2test example shows:
 | 
						|
 | 
						|
         /(*:pear)apple|(*:orange)lemon/g,replace=${*MARK}
 | 
						|
             apple lemon
 | 
						|
          2: pear orange
 | 
						|
 | 
						|
       As well as the usual options for pcre2_match(), a number of  additional
 | 
						|
       options can be set in the options argument.
 | 
						|
 | 
						|
       PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
 | 
						|
       string, replacing every matching substring. If this is  not  set,  only
 | 
						|
       the  first matching substring is replaced. If any matched substring has
 | 
						|
       zero length, after the substitution has happened, an attempt to find  a
 | 
						|
       non-empty  match at the same position is performed. If this is not suc-
 | 
						|
       cessful, the current position is advanced by one character except  when
 | 
						|
       CRLF  is  a  valid newline sequence and the next two characters are CR,
 | 
						|
       LF. In this case, the current position is advanced by two characters.
 | 
						|
 | 
						|
       PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when  the  output
 | 
						|
       buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
 | 
						|
       ORY immediately. If this option  is  set,  however,  pcre2_substitute()
 | 
						|
       continues to go through the motions of matching and substituting (with-
 | 
						|
       out, of course, writing anything) in order to compute the size of  buf-
 | 
						|
       fer  that  is  needed.  This  value is passed back via the outlengthptr
 | 
						|
       variable,   with   the   result   of   the   function    still    being
 | 
						|
       PCRE2_ERROR_NOMEMORY.
 | 
						|
 | 
						|
       Passing  a  buffer  size  of zero is a permitted way of finding out how
 | 
						|
       much memory is needed for given substitution. However, this  does  mean
 | 
						|
       that the entire operation is carried out twice. Depending on the appli-
 | 
						|
       cation, it may be more efficient to allocate a large  buffer  and  free
 | 
						|
       the   excess   afterwards,   instead  of  using  PCRE2_SUBSTITUTE_OVER-
 | 
						|
       FLOW_LENGTH.
 | 
						|
 | 
						|
       PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references  to  capturing  groups
 | 
						|
       that  do  not appear in the pattern to be treated as unset groups. This
 | 
						|
       option should be used with care, because it means  that  a  typo  in  a
 | 
						|
       group  name  or  number  no  longer  causes the PCRE2_ERROR_NOSUBSTRING
 | 
						|
       error.
 | 
						|
 | 
						|
       PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing  groups  (including
 | 
						|
       unknown  groups  when  PCRE2_SUBSTITUTE_UNKNOWN_UNSET  is  set)  to  be
 | 
						|
       treated as empty strings when inserted  as  described  above.  If  this
 | 
						|
       option  is  not  set,  an  attempt  to insert an unset group causes the
 | 
						|
       PCRE2_ERROR_UNSET error. This option does not  influence  the  extended
 | 
						|
       substitution syntax described below.
 | 
						|
 | 
						|
       PCRE2_SUBSTITUTE_EXTENDED  causes extra processing to be applied to the
 | 
						|
       replacement string. Without this option, only the dollar  character  is
 | 
						|
       special,  and  only  the  group insertion forms listed above are valid.
 | 
						|
       When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
 | 
						|
 | 
						|
       Firstly, backslash in a replacement string is interpreted as an  escape
 | 
						|
       character. The usual forms such as \n or \x{ddd} can be used to specify
 | 
						|
       particular character codes, and backslash followed by any  non-alphanu-
 | 
						|
       meric  character  quotes  that character. Extended quoting can be coded
 | 
						|
       using \Q...\E, exactly as in pattern strings.
 | 
						|
 | 
						|
       There are also four escape sequences for forcing the case  of  inserted
 | 
						|
       letters.   The  insertion  mechanism has three states: no case forcing,
 | 
						|
       force upper case, and force lower case. The escape sequences change the
 | 
						|
       current state: \U and \L change to upper or lower case forcing, respec-
 | 
						|
       tively, and \E (when not terminating a \Q quoted sequence)  reverts  to
 | 
						|
       no  case  forcing. The sequences \u and \l force the next character (if
 | 
						|
       it is a letter) to upper or lower  case,  respectively,  and  then  the
 | 
						|
       state automatically reverts to no case forcing. Case forcing applies to
 | 
						|
       all inserted  characters, including those from captured groups and let-
 | 
						|
       ters within \Q...\E quoted sequences.
 | 
						|
 | 
						|
       Note that case forcing sequences such as \U...\E do not nest. For exam-
 | 
						|
       ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc";  the  final
 | 
						|
       \E has no effect.
 | 
						|
 | 
						|
       The  second  effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
 | 
						|
       flexibility to group substitution. The syntax is similar to  that  used
 | 
						|
       by Bash:
 | 
						|
 | 
						|
         ${<n>:-<string>}
 | 
						|
         ${<n>:+<string1>:<string2>}
 | 
						|
 | 
						|
       As  before,  <n> may be a group number or a name. The first form speci-
 | 
						|
       fies a default value. If group <n> is set, its value  is  inserted;  if
 | 
						|
       not,  <string>  is  expanded  and  the result inserted. The second form
 | 
						|
       specifies strings that are expanded and inserted when group <n> is  set
 | 
						|
       or  unset,  respectively. The first form is just a convenient shorthand
 | 
						|
       for
 | 
						|
 | 
						|
         ${<n>:+${<n>}:<string>}
 | 
						|
 | 
						|
       Backslash can be used to escape colons and closing  curly  brackets  in
 | 
						|
       the  replacement  strings.  A change of the case forcing state within a
 | 
						|
       replacement string remains  in  force  afterwards,  as  shown  in  this
 | 
						|
       pcre2test example:
 | 
						|
 | 
						|
         /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
 | 
						|
             body
 | 
						|
          1: hello
 | 
						|
             somebody
 | 
						|
          1: HELLO
 | 
						|
 | 
						|
       The  PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
 | 
						|
       substitutions.  However,  PCRE2_SUBSTITUTE_UNKNOWN_UNSET   does   cause
 | 
						|
       unknown groups in the extended syntax forms to be treated as unset.
 | 
						|
 | 
						|
       If  successful,  pcre2_substitute()  returns the number of replacements
 | 
						|
       that were made. This may be zero if no matches were found, and is never
 | 
						|
       greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
 | 
						|
 | 
						|
       In the event of an error, a negative error code is returned. Except for
 | 
						|
       PCRE2_ERROR_NOMATCH   (which   is   never   returned),   errors    from
 | 
						|
       pcre2_match() are passed straight back.
 | 
						|
 | 
						|
       PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
 | 
						|
       tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
 | 
						|
 | 
						|
       PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
 | 
						|
       ing  an  unknown  substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
 | 
						|
       when  the  simple  (non-extended)  syntax  is  used  and  PCRE2_SUBSTI-
 | 
						|
       TUTE_UNSET_EMPTY is not set.
 | 
						|
 | 
						|
       PCRE2_ERROR_NOMEMORY  is  returned  if  the  output  buffer  is not big
 | 
						|
       enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
 | 
						|
       of  buffer  that is needed is returned via outlengthptr. Note that this
 | 
						|
       does not happen by default.
 | 
						|
 | 
						|
       PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax  errors  in
 | 
						|
       the   replacement   string,   with   more   particular   errors   being
 | 
						|
       PCRE2_ERROR_BADREPESCAPE (invalid  escape  sequence),  PCRE2_ERROR_REP-
 | 
						|
       MISSING_BRACE  (closing curly bracket not found), PCRE2_BADSUBSTITUTION
 | 
						|
       (syntax error in extended group substitution), and  PCRE2_BADSUBPATTERN
 | 
						|
       (the  pattern  match ended before it started, which can happen if \K is
 | 
						|
       used in an assertion).
 | 
						|
 | 
						|
       As for all PCRE2 errors, a text message that describes the error can be
 | 
						|
       obtained   by   calling  the  pcre2_get_error_message()  function  (see
 | 
						|
       "Obtaining a textual error message" above).
 | 
						|
 | 
						|
 | 
						|
DUPLICATE SUBPATTERN NAMES
 | 
						|
 | 
						|
       int pcre2_substring_nametable_scan(const pcre2_code *code,
 | 
						|
         PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
 | 
						|
 | 
						|
       When a pattern is compiled with the PCRE2_DUPNAMES  option,  names  for
 | 
						|
       subpatterns  are  not required to be unique. Duplicate names are always
 | 
						|
       allowed for subpatterns with the same number, created by using the  (?|
 | 
						|
       feature.  Indeed,  if  such subpatterns are named, they are required to
 | 
						|
       use the same names.
 | 
						|
 | 
						|
       Normally, patterns with duplicate names are such that in any one match,
 | 
						|
       only  one of the named subpatterns participates. An example is shown in
 | 
						|
       the pcre2pattern documentation.
 | 
						|
 | 
						|
       When  duplicates   are   present,   pcre2_substring_copy_byname()   and
 | 
						|
       pcre2_substring_get_byname()  return  the first substring corresponding
 | 
						|
       to  the  given  name  that  is  set.  Only   if   none   are   set   is
 | 
						|
       PCRE2_ERROR_UNSET  is  returned. The pcre2_substring_number_from_name()
 | 
						|
       function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are
 | 
						|
       duplicate names.
 | 
						|
 | 
						|
       If  you want to get full details of all captured substrings for a given
 | 
						|
       name, you must use the pcre2_substring_nametable_scan()  function.  The
 | 
						|
       first  argument is the compiled pattern, and the second is the name. If
 | 
						|
       the third and fourth arguments are NULL, the function returns  a  group
 | 
						|
       number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
 | 
						|
 | 
						|
       When the third and fourth arguments are not NULL, they must be pointers
 | 
						|
       to variables that are updated by the function. After it has  run,  they
 | 
						|
       point to the first and last entries in the name-to-number table for the
 | 
						|
       given name, and the function returns the length of each entry  in  code
 | 
						|
       units.  In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
 | 
						|
       no entries for the given name.
 | 
						|
 | 
						|
       The format of the name table is described above in the section entitled
 | 
						|
       Information  about  a  pattern.  Given all the relevant entries for the
 | 
						|
       name, you can extract each of their numbers,  and  hence  the  captured
 | 
						|
       data.
 | 
						|
 | 
						|
 | 
						|
FINDING ALL POSSIBLE MATCHES AT ONE POSITION
 | 
						|
 | 
						|
       The  traditional  matching  function  uses a similar algorithm to Perl,
 | 
						|
       which stops when it finds the first match at a given point in the  sub-
 | 
						|
       ject. If you want to find all possible matches, or the longest possible
 | 
						|
       match at a given position,  consider  using  the  alternative  matching
 | 
						|
       function  (see  below) instead. If you cannot use the alternative func-
 | 
						|
       tion, you can kludge it up by making use of the callout facility, which
 | 
						|
       is described in the pcre2callout documentation.
 | 
						|
 | 
						|
       What you have to do is to insert a callout right at the end of the pat-
 | 
						|
       tern.  When your callout function is called, extract and save the  cur-
 | 
						|
       rent  matched  substring.  Then return 1, which forces pcre2_match() to
 | 
						|
       backtrack and try other alternatives. Ultimately, when it runs  out  of
 | 
						|
       matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
 | 
						|
 | 
						|
 | 
						|
MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
 | 
						|
 | 
						|
       int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
 | 
						|
         PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | 
						|
         uint32_t options, pcre2_match_data *match_data,
 | 
						|
         pcre2_match_context *mcontext,
 | 
						|
         int *workspace, PCRE2_SIZE wscount);
 | 
						|
 | 
						|
       The  function  pcre2_dfa_match()  is  called  to match a subject string
 | 
						|
       against a compiled pattern, using a matching algorithm that  scans  the
 | 
						|
       subject  string  just  once, and does not backtrack. This has different
 | 
						|
       characteristics to the normal algorithm, and  is  not  compatible  with
 | 
						|
       Perl.  Some of the features of PCRE2 patterns are not supported. Never-
 | 
						|
       theless, there are times when this kind of matching can be useful.  For
 | 
						|
       a  discussion  of  the  two matching algorithms, and a list of features
 | 
						|
       that pcre2_dfa_match() does not support, see the pcre2matching documen-
 | 
						|
       tation.
 | 
						|
 | 
						|
       The  arguments  for  the pcre2_dfa_match() function are the same as for
 | 
						|
       pcre2_match(), plus two extras. The ovector within the match data block
 | 
						|
       is used in a different way, and this is described below. The other com-
 | 
						|
       mon arguments are used in the same way as for pcre2_match(),  so  their
 | 
						|
       description is not repeated here.
 | 
						|
 | 
						|
       The  two  additional  arguments provide workspace for the function. The
 | 
						|
       workspace vector should contain at least 20 elements. It  is  used  for
 | 
						|
       keeping  track  of  multiple  paths  through  the  pattern  tree.  More
 | 
						|
       workspace is needed for patterns and subjects where there are a lot  of
 | 
						|
       potential matches.
 | 
						|
 | 
						|
       Here is an example of a simple call to pcre2_dfa_match():
 | 
						|
 | 
						|
         int wspace[20];
 | 
						|
         pcre2_match_data *md = pcre2_match_data_create(4, NULL);
 | 
						|
         int rc = pcre2_dfa_match(
 | 
						|
           re,             /* result of pcre2_compile() */
 | 
						|
           "some string",  /* the subject string */
 | 
						|
           11,             /* the length of the subject string */
 | 
						|
           0,              /* start at offset 0 in the subject */
 | 
						|
           0,              /* default options */
 | 
						|
           match_data,     /* the match data block */
 | 
						|
           NULL,           /* a match context; NULL means use defaults */
 | 
						|
           wspace,         /* working space vector */
 | 
						|
           20);            /* number of elements (NOT size in bytes) */
 | 
						|
 | 
						|
   Option bits for pcre_dfa_match()
 | 
						|
 | 
						|
       The  unused  bits of the options argument for pcre2_dfa_match() must be
 | 
						|
       zero. The only bits that may be set are  PCRE2_ANCHORED,  PCRE2_NOTBOL,
 | 
						|
       PCRE2_NOTEOL,          PCRE2_NOTEMPTY,          PCRE2_NOTEMPTY_ATSTART,
 | 
						|
       PCRE2_NO_UTF_CHECK,       PCRE2_PARTIAL_HARD,       PCRE2_PARTIAL_SOFT,
 | 
						|
       PCRE2_DFA_SHORTEST,  and  PCRE2_DFA_RESTART.  All  but the last four of
 | 
						|
       these are exactly the same as for pcre2_match(), so  their  description
 | 
						|
       is not repeated here.
 | 
						|
 | 
						|
         PCRE2_PARTIAL_HARD
 | 
						|
         PCRE2_PARTIAL_SOFT
 | 
						|
 | 
						|
       These  have  the  same general effect as they do for pcre2_match(), but
 | 
						|
       the details are slightly different. When PCRE2_PARTIAL_HARD is set  for
 | 
						|
       pcre2_dfa_match(),  it  returns  PCRE2_ERROR_PARTIAL  if the end of the
 | 
						|
       subject is reached and there is still at least one matching possibility
 | 
						|
       that requires additional characters. This happens even if some complete
 | 
						|
       matches have already been found. When PCRE2_PARTIAL_SOFT  is  set,  the
 | 
						|
       return  code  PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
 | 
						|
       if the end of the subject is  reached,  there  have  been  no  complete
 | 
						|
       matches, but there is still at least one matching possibility. The por-
 | 
						|
       tion of the string that was inspected when the  longest  partial  match
 | 
						|
       was found is set as the first matching string in both cases. There is a
 | 
						|
       more detailed discussion of partial and  multi-segment  matching,  with
 | 
						|
       examples, in the pcre2partial documentation.
 | 
						|
 | 
						|
         PCRE2_DFA_SHORTEST
 | 
						|
 | 
						|
       Setting  the PCRE2_DFA_SHORTEST option causes the matching algorithm to
 | 
						|
       stop as soon as it has found one match. Because of the way the alterna-
 | 
						|
       tive  algorithm  works, this is necessarily the shortest possible match
 | 
						|
       at the first possible matching point in the subject string.
 | 
						|
 | 
						|
         PCRE2_DFA_RESTART
 | 
						|
 | 
						|
       When pcre2_dfa_match() returns a partial match, it is possible to  call
 | 
						|
       it again, with additional subject characters, and have it continue with
 | 
						|
       the same match. The PCRE2_DFA_RESTART option requests this action; when
 | 
						|
       it  is  set,  the workspace and wscount options must reference the same
 | 
						|
       vector as before because data about the match so far is  left  in  them
 | 
						|
       after a partial match. There is more discussion of this facility in the
 | 
						|
       pcre2partial documentation.
 | 
						|
 | 
						|
   Successful returns from pcre2_dfa_match()
 | 
						|
 | 
						|
       When pcre2_dfa_match() succeeds, it may have matched more than one sub-
 | 
						|
       string in the subject. Note, however, that all the matches from one run
 | 
						|
       of the function start at the same point in  the  subject.  The  shorter
 | 
						|
       matches  are all initial substrings of the longer matches. For example,
 | 
						|
       if the pattern
 | 
						|
 | 
						|
         <.*>
 | 
						|
 | 
						|
       is matched against the string
 | 
						|
 | 
						|
         This is <something> <something else> <something further> no more
 | 
						|
 | 
						|
       the three matched strings are
 | 
						|
 | 
						|
         <something> <something else> <something further>
 | 
						|
         <something> <something else>
 | 
						|
         <something>
 | 
						|
 | 
						|
       On success, the yield of the function is a number  greater  than  zero,
 | 
						|
       which  is  the  number  of  matched substrings. The offsets of the sub-
 | 
						|
       strings are returned in the ovector, and can be extracted by number  in
 | 
						|
       the  same way as for pcre2_match(), but the numbers bear no relation to
 | 
						|
       any capturing groups that may exist in the pattern, because DFA  match-
 | 
						|
       ing does not support group capture.
 | 
						|
 | 
						|
       Calls  to  the  convenience  functions  that extract substrings by name
 | 
						|
       return the error PCRE2_ERROR_DFA_UFUNC (unsupported function)  if  used
 | 
						|
       after a DFA match. The convenience functions that extract substrings by
 | 
						|
       number never return PCRE2_ERROR_NOSUBSTRING, and the meanings  of  some
 | 
						|
       other errors are slightly different:
 | 
						|
 | 
						|
         PCRE2_ERROR_UNAVAILABLE
 | 
						|
 | 
						|
       The ovector is not big enough to include a slot for the given substring
 | 
						|
       number.
 | 
						|
 | 
						|
         PCRE2_ERROR_UNSET
 | 
						|
 | 
						|
       There is a slot in the ovector  for  this  substring,  but  there  were
 | 
						|
       insufficient matches to fill it.
 | 
						|
 | 
						|
       The  matched  strings  are  stored  in  the ovector in reverse order of
 | 
						|
       length; that is, the longest matching string is first.  If  there  were
 | 
						|
       too  many matches to fit into the ovector, the yield of the function is
 | 
						|
       zero, and the vector is filled with the longest matches.
 | 
						|
 | 
						|
       NOTE: PCRE2's "auto-possessification" optimization usually  applies  to
 | 
						|
       character  repeats at the end of a pattern (as well as internally). For
 | 
						|
       example, the pattern "a\d+" is compiled as if it were "a\d++". For  DFA
 | 
						|
       matching,  this  means  that  only  one possible match is found. If you
 | 
						|
       really do want multiple matches in such cases, either use  an  ungreedy
 | 
						|
       repeat  auch  as  "a\d+?"  or set the PCRE2_NO_AUTO_POSSESS option when
 | 
						|
       compiling.
 | 
						|
 | 
						|
   Error returns from pcre2_dfa_match()
 | 
						|
 | 
						|
       The pcre2_dfa_match() function returns a negative number when it fails.
 | 
						|
       Many  of  the  errors  are  the same as for pcre2_match(), as described
 | 
						|
       above.  There are in addition the following errors that are specific to
 | 
						|
       pcre2_dfa_match():
 | 
						|
 | 
						|
         PCRE2_ERROR_DFA_UITEM
 | 
						|
 | 
						|
       This  return  is  given  if pcre2_dfa_match() encounters an item in the
 | 
						|
       pattern that it does not support, for instance, the use of \C in a  UTF
 | 
						|
       mode or a back reference.
 | 
						|
 | 
						|
         PCRE2_ERROR_DFA_UCOND
 | 
						|
 | 
						|
       This  return  is given if pcre2_dfa_match() encounters a condition item
 | 
						|
       that uses a back reference for the condition, or a test  for  recursion
 | 
						|
       in a specific group. These are not supported.
 | 
						|
 | 
						|
         PCRE2_ERROR_DFA_WSSIZE
 | 
						|
 | 
						|
       This  return  is  given  if  pcre2_dfa_match() runs out of space in the
 | 
						|
       workspace vector.
 | 
						|
 | 
						|
         PCRE2_ERROR_DFA_RECURSE
 | 
						|
 | 
						|
       When a recursive subpattern is processed, the matching  function  calls
 | 
						|
       itself recursively, using private memory for the ovector and workspace.
 | 
						|
       This error is given if the internal ovector is not large  enough.  This
 | 
						|
       should be extremely rare, as a vector of size 1000 is used.
 | 
						|
 | 
						|
         PCRE2_ERROR_DFA_BADRESTART
 | 
						|
 | 
						|
       When  pcre2_dfa_match()  is  called  with the PCRE2_DFA_RESTART option,
 | 
						|
       some plausibility checks are made on the  contents  of  the  workspace,
 | 
						|
       which  should  contain data about the previous partial match. If any of
 | 
						|
       these checks fail, this error is given.
 | 
						|
 | 
						|
 | 
						|
SEE ALSO
 | 
						|
 | 
						|
       pcre2build(3),   pcre2callout(3),    pcre2demo(3),    pcre2matching(3),
 | 
						|
       pcre2partial(3),    pcre2posix(3),    pcre2sample(3),    pcre2stack(3),
 | 
						|
       pcre2unicode(3).
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 23 December 2016
 | 
						|
       Copyright (c) 1997-2016 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2BUILD(3)              Library Functions Manual              PCRE2BUILD(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
BUILDING PCRE2
 | 
						|
 | 
						|
       PCRE2  is distributed with a configure script that can be used to build
 | 
						|
       the library in Unix-like environments using the applications  known  as
 | 
						|
       Autotools. Also in the distribution are files to support building using
 | 
						|
       CMake instead of configure.  The  text  file  README  contains  general
 | 
						|
       information  about  building  with Autotools (some of which is repeated
 | 
						|
       below), and also has some comments about building on various  operating
 | 
						|
       systems.  There  is a lot more information about building PCRE2 without
 | 
						|
       using Autotools (including information about using CMake  and  building
 | 
						|
       "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
 | 
						|
       consult this file as well as the README file if you are building  in  a
 | 
						|
       non-Unix-like environment.
 | 
						|
 | 
						|
 | 
						|
PCRE2 BUILD-TIME OPTIONS
 | 
						|
 | 
						|
       The rest of this document describes the optional features of PCRE2 that
 | 
						|
       can be selected when the library is compiled. It  assumes  use  of  the
 | 
						|
       configure  script,  where  the  optional features are selected or dese-
 | 
						|
       lected by providing options to configure before running the  make  com-
 | 
						|
       mand.  However,  the same options can be selected in both Unix-like and
 | 
						|
       non-Unix-like environments if you are using CMake instead of  configure
 | 
						|
       to build PCRE2.
 | 
						|
 | 
						|
       If  you  are not using Autotools or CMake, option selection can be done
 | 
						|
       by editing the config.h file, or by passing parameter settings  to  the
 | 
						|
       compiler, as described in NON-AUTOTOOLS-BUILD.
 | 
						|
 | 
						|
       The complete list of options for configure (which includes the standard
 | 
						|
       ones such as the  selection  of  the  installation  directory)  can  be
 | 
						|
       obtained by running
 | 
						|
 | 
						|
         ./configure --help
 | 
						|
 | 
						|
       The  following  sections  include  descriptions  of options whose names
 | 
						|
       begin with --enable or --disable. These settings specify changes to the
 | 
						|
       defaults  for  the configure command. Because of the way that configure
 | 
						|
       works, --enable and --disable always come in pairs, so  the  complemen-
 | 
						|
       tary  option always exists as well, but as it specifies the default, it
 | 
						|
       is not described.
 | 
						|
 | 
						|
 | 
						|
BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
 | 
						|
 | 
						|
       By default, a library called libpcre2-8 is built, containing  functions
 | 
						|
       that  take  string arguments contained in vectors of bytes, interpreted
 | 
						|
       either as single-byte characters, or UTF-8 strings. You can also  build
 | 
						|
       two  other libraries, called libpcre2-16 and libpcre2-32, which process
 | 
						|
       strings that are contained in vectors of 16-bit and 32-bit code  units,
 | 
						|
       respectively. These can be interpreted either as single-unit characters
 | 
						|
       or UTF-16/UTF-32 strings. To build these additional libraries, add  one
 | 
						|
       or both of the following to the configure command:
 | 
						|
 | 
						|
         --enable-pcre2-16
 | 
						|
         --enable-pcre2-32
 | 
						|
 | 
						|
       If you do not want the 8-bit library, add
 | 
						|
 | 
						|
         --disable-pcre2-8
 | 
						|
 | 
						|
       as  well.  At least one of the three libraries must be built. Note that
 | 
						|
       the POSIX wrapper is for the 8-bit library only, and that pcre2grep  is
 | 
						|
       an  8-bit  program.  Neither  of these are built if you select only the
 | 
						|
       16-bit or 32-bit libraries.
 | 
						|
 | 
						|
 | 
						|
BUILDING SHARED AND STATIC LIBRARIES
 | 
						|
 | 
						|
       The Autotools PCRE2 building process uses libtool to build both  shared
 | 
						|
       and  static  libraries by default. You can suppress an unwanted library
 | 
						|
       by adding one of
 | 
						|
 | 
						|
         --disable-shared
 | 
						|
         --disable-static
 | 
						|
 | 
						|
       to the configure command.
 | 
						|
 | 
						|
 | 
						|
UNICODE AND UTF SUPPORT
 | 
						|
 | 
						|
       By default, PCRE2 is built with support for Unicode and  UTF  character
 | 
						|
       strings.  To build it without Unicode support, add
 | 
						|
 | 
						|
         --disable-unicode
 | 
						|
 | 
						|
       to  the configure command. This setting applies to all three libraries.
 | 
						|
       It is not possible to build  one  library  with  Unicode  support,  and
 | 
						|
       another without, in the same configuration.
 | 
						|
 | 
						|
       Of  itself, Unicode support does not make PCRE2 treat strings as UTF-8,
 | 
						|
       UTF-16 or UTF-32. To do that, applications that use the library can set
 | 
						|
       the  PCRE2_UTF  option when they call pcre2_compile() to compile a pat-
 | 
						|
       tern.  Alternatively, patterns may be started with  (*UTF)  unless  the
 | 
						|
       application has locked this out by setting PCRE2_NEVER_UTF.
 | 
						|
 | 
						|
       UTF support allows the libraries to process character code points up to
 | 
						|
       0x10ffff in the strings that they handle. It also provides support  for
 | 
						|
       accessing  the  Unicode  properties  of  such characters, using pattern
 | 
						|
       escapes such as \P, \p, and \X. Only the  general  category  properties
 | 
						|
       such  as Lu and Nd are supported. Details are given in the pcre2pattern
 | 
						|
       documentation.
 | 
						|
 | 
						|
       Pattern escapes such as \d and \w do not by default make use of Unicode
 | 
						|
       properties.  The  application  can  request that they do by setting the
 | 
						|
       PCRE2_UCP option. Unless the application  has  set  PCRE2_NEVER_UCP,  a
 | 
						|
       pattern may also request this by starting with (*UCP).
 | 
						|
 | 
						|
 | 
						|
DISABLING THE USE OF \C
 | 
						|
 | 
						|
       The \C escape sequence, which matches a single code unit, even in a UTF
 | 
						|
       mode, can cause unpredictable behaviour because it may leave  the  cur-
 | 
						|
       rent  matching  point in the middle of a multi-code-unit character. The
 | 
						|
       application can lock it  out  by  setting  the  PCRE2_NEVER_BACKSLASH_C
 | 
						|
       option when calling pcre2_compile(). There is also a build-time option
 | 
						|
 | 
						|
         --enable-never-backslash-C
 | 
						|
 | 
						|
       (note the upper case C) which locks out the use of \C entirely.
 | 
						|
 | 
						|
 | 
						|
JUST-IN-TIME COMPILER SUPPORT
 | 
						|
 | 
						|
       Just-in-time compiler support is included in the build by specifying
 | 
						|
 | 
						|
         --enable-jit
 | 
						|
 | 
						|
       This  support  is available only for certain hardware architectures. If
 | 
						|
       this option is set for an unsupported architecture,  a  building  error
 | 
						|
       occurs.   See the pcre2jit documentation for a discussion of JIT usage.
 | 
						|
       When JIT support is enabled, pcre2grep automatically makes use  of  it,
 | 
						|
       unless you add
 | 
						|
 | 
						|
         --disable-pcre2grep-jit
 | 
						|
 | 
						|
       to the "configure" command.
 | 
						|
 | 
						|
 | 
						|
NEWLINE RECOGNITION
 | 
						|
 | 
						|
       By  default, PCRE2 interprets the linefeed (LF) character as indicating
 | 
						|
       the end of a line. This is the normal newline  character  on  Unix-like
 | 
						|
       systems.  You can compile PCRE2 to use carriage return (CR) instead, by
 | 
						|
       adding
 | 
						|
 | 
						|
         --enable-newline-is-cr
 | 
						|
 | 
						|
       to the configure  command.  There  is  also  an  --enable-newline-is-lf
 | 
						|
       option, which explicitly specifies linefeed as the newline character.
 | 
						|
 | 
						|
       Alternatively, you can specify that line endings are to be indicated by
 | 
						|
       the two-character sequence CRLF (CR immediately followed by LF). If you
 | 
						|
       want this, add
 | 
						|
 | 
						|
         --enable-newline-is-crlf
 | 
						|
 | 
						|
       to the configure command. There is a fourth option, specified by
 | 
						|
 | 
						|
         --enable-newline-is-anycrlf
 | 
						|
 | 
						|
       which  causes  PCRE2 to recognize any of the three sequences CR, LF, or
 | 
						|
       CRLF as indicating a line ending. Finally, a fifth option, specified by
 | 
						|
 | 
						|
         --enable-newline-is-any
 | 
						|
 | 
						|
       causes PCRE2 to recognize any Unicode  newline  sequence.  The  Unicode
 | 
						|
       newline sequences are the three just mentioned, plus the single charac-
 | 
						|
       ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
 | 
						|
       U+0085),  LS  (line  separator,  U+2028),  and PS (paragraph separator,
 | 
						|
       U+2029).
 | 
						|
 | 
						|
       Whatever default line ending convention is selected when PCRE2 is built
 | 
						|
       can  be  overridden by applications that use the library. At build time
 | 
						|
       it is conventional to use the standard for your operating system.
 | 
						|
 | 
						|
 | 
						|
WHAT \R MATCHES
 | 
						|
 | 
						|
       By default, the sequence \R in a pattern matches  any  Unicode  newline
 | 
						|
       sequence,  independently  of  what has been selected as the line ending
 | 
						|
       sequence. If you specify
 | 
						|
 | 
						|
         --enable-bsr-anycrlf
 | 
						|
 | 
						|
       the default is changed so that \R matches only CR, LF, or  CRLF.  What-
 | 
						|
       ever  is selected when PCRE2 is built can be overridden by applications
 | 
						|
       that use the called.
 | 
						|
 | 
						|
 | 
						|
HANDLING VERY LARGE PATTERNS
 | 
						|
 | 
						|
       Within a compiled pattern, offset values are used  to  point  from  one
 | 
						|
       part  to another (for example, from an opening parenthesis to an alter-
 | 
						|
       nation metacharacter). By default, in the 8-bit and  16-bit  libraries,
 | 
						|
       two-byte  values  are used for these offsets, leading to a maximum size
 | 
						|
       for a compiled pattern of around 64K code units. This is sufficient  to
 | 
						|
       handle all but the most gigantic patterns. Nevertheless, some people do
 | 
						|
       want to process truly enormous patterns, so it is possible  to  compile
 | 
						|
       PCRE2  to  use three-byte or four-byte offsets by adding a setting such
 | 
						|
       as
 | 
						|
 | 
						|
         --with-link-size=3
 | 
						|
 | 
						|
       to the configure command. The value given must be 2, 3, or 4.  For  the
 | 
						|
       16-bit  library,  a  value of 3 is rounded up to 4. In these libraries,
 | 
						|
       using longer offsets slows down the operation of PCRE2 because  it  has
 | 
						|
       to  load additional data when handling them. For the 32-bit library the
 | 
						|
       value is always 4 and cannot be overridden; the value  of  --with-link-
 | 
						|
       size is ignored.
 | 
						|
 | 
						|
 | 
						|
AVOIDING EXCESSIVE STACK USAGE
 | 
						|
 | 
						|
       When  matching  with the pcre2_match() function, PCRE2 implements back-
 | 
						|
       tracking by making recursive  calls  to  an  internal  function  called
 | 
						|
       match().  In  environments where the size of the stack is limited, this
 | 
						|
       can severely limit PCRE2's operation. (The Unix  environment  does  not
 | 
						|
       usually  suffer from this problem, but it may sometimes be necessary to
 | 
						|
       increase  the  maximum  stack  size.  There  is  a  discussion  in  the
 | 
						|
       pcre2stack  documentation.)  An  alternative approach to recursion that
 | 
						|
       uses memory from the heap to remember data, instead of using  recursive
 | 
						|
       function  calls, has been implemented to work round the problem of lim-
 | 
						|
       ited stack size. If you want to build a version  of  PCRE2  that  works
 | 
						|
       this way, add
 | 
						|
 | 
						|
         --disable-stack-for-recursion
 | 
						|
 | 
						|
       to the configure command. By default, the system functions malloc() and
 | 
						|
       free() are called to manage the heap memory that is required, but  cus-
 | 
						|
       tom  memory  management  functions  can  be  called instead. PCRE2 runs
 | 
						|
       noticeably more slowly when built in this way. This option affects only
 | 
						|
       the pcre2_match() function; it is not relevant for pcre2_dfa_match().
 | 
						|
 | 
						|
 | 
						|
LIMITING PCRE2 RESOURCE USAGE
 | 
						|
 | 
						|
       Internally, PCRE2 has a function called match(), which it calls repeat-
 | 
						|
       edly  (sometimes  recursively)  when  matching  a  pattern   with   the
 | 
						|
       pcre2_match() function. By controlling the maximum number of times this
 | 
						|
       function may be called during a single matching operation, a limit  can
 | 
						|
       be  placed on the resources used by a single call to pcre2_match(). The
 | 
						|
       limit can be changed at run time, as described in the pcre2api documen-
 | 
						|
       tation.  The default is 10 million, but this can be changed by adding a
 | 
						|
       setting such as
 | 
						|
 | 
						|
         --with-match-limit=500000
 | 
						|
 | 
						|
       to  the  configure  command.  This  setting  has  no  effect   on   the
 | 
						|
       pcre2_dfa_match() matching function.
 | 
						|
 | 
						|
       In  some  environments  it is desirable to limit the depth of recursive
 | 
						|
       calls of match() more strictly than the total number of calls, in order
 | 
						|
       to  restrict  the maximum amount of stack (or heap, if --disable-stack-
 | 
						|
       for-recursion is specified) that is used. A second limit controls this;
 | 
						|
       it  defaults  to  the  value  that is set for --with-match-limit, which
 | 
						|
       imposes no additional constraints. However, you can set a  lower  limit
 | 
						|
       by adding, for example,
 | 
						|
 | 
						|
         --with-match-limit-recursion=10000
 | 
						|
 | 
						|
       to  the  configure  command.  This  value can also be overridden at run
 | 
						|
       time.
 | 
						|
 | 
						|
 | 
						|
CREATING CHARACTER TABLES AT BUILD TIME
 | 
						|
 | 
						|
       PCRE2 uses fixed tables for processing characters whose code points are
 | 
						|
       less than 256. By default, PCRE2 is built with a set of tables that are
 | 
						|
       distributed in the file src/pcre2_chartables.c.dist. These  tables  are
 | 
						|
       for ASCII codes only. If you add
 | 
						|
 | 
						|
         --enable-rebuild-chartables
 | 
						|
 | 
						|
       to  the  configure  command, the distributed tables are no longer used.
 | 
						|
       Instead, a program called dftables is compiled and  run.  This  outputs
 | 
						|
       the source for new set of tables, created in the default locale of your
 | 
						|
       C run-time system. (This method of replacing the tables does  not  work
 | 
						|
       if  you are cross compiling, because dftables is run on the local host.
 | 
						|
       If you need to create alternative tables when cross compiling, you will
 | 
						|
       have to do so "by hand".)
 | 
						|
 | 
						|
 | 
						|
USING EBCDIC CODE
 | 
						|
 | 
						|
       PCRE2  assumes  by default that it will run in an environment where the
 | 
						|
       character code is ASCII or Unicode, which is a superset of ASCII.  This
 | 
						|
       is the case for most computer operating systems. PCRE2 can, however, be
 | 
						|
       compiled to run in an 8-bit EBCDIC environment by adding
 | 
						|
 | 
						|
         --enable-ebcdic --disable-unicode
 | 
						|
 | 
						|
       to the configure command. This setting implies --enable-rebuild-charta-
 | 
						|
       bles.  You  should  only  use  it if you know that you are in an EBCDIC
 | 
						|
       environment (for example, an IBM mainframe operating system).
 | 
						|
 | 
						|
       It is not possible to support both EBCDIC and UTF-8 codes in  the  same
 | 
						|
       version  of  the  library. Consequently, --enable-unicode and --enable-
 | 
						|
       ebcdic are mutually exclusive.
 | 
						|
 | 
						|
       The EBCDIC character that corresponds to an ASCII LF is assumed to have
 | 
						|
       the  value  0x15 by default. However, in some EBCDIC environments, 0x25
 | 
						|
       is used. In such an environment you should use
 | 
						|
 | 
						|
         --enable-ebcdic-nl25
 | 
						|
 | 
						|
       as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
 | 
						|
       has  the  same  value  as in ASCII, namely, 0x0d. Whichever of 0x15 and
 | 
						|
       0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
 | 
						|
       acter (which, in Unicode, is 0x85).
 | 
						|
 | 
						|
       The options that select newline behaviour, such as --enable-newline-is-
 | 
						|
       cr, and equivalent run-time options, refer to these character values in
 | 
						|
       an EBCDIC environment.
 | 
						|
 | 
						|
 | 
						|
PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS
 | 
						|
 | 
						|
       By default, on non-Windows systems, pcre2grep supports the use of call-
 | 
						|
       outs with string arguments within the patterns it is matching, in order
 | 
						|
       to  run external scripts. For details, see the pcre2grep documentation.
 | 
						|
       This support can be disabled by adding  --disable-pcre2grep-callout  to
 | 
						|
       the configure command.
 | 
						|
 | 
						|
 | 
						|
PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
 | 
						|
 | 
						|
       By  default,  pcre2grep reads all files as plain text. You can build it
 | 
						|
       so that it recognizes files whose names end in .gz or .bz2,  and  reads
 | 
						|
       them with libz or libbz2, respectively, by adding one or both of
 | 
						|
 | 
						|
         --enable-pcre2grep-libz
 | 
						|
         --enable-pcre2grep-libbz2
 | 
						|
 | 
						|
       to the configure command. These options naturally require that the rel-
 | 
						|
       evant libraries are installed on your system. Configuration  will  fail
 | 
						|
       if they are not.
 | 
						|
 | 
						|
 | 
						|
PCRE2GREP BUFFER SIZE
 | 
						|
 | 
						|
       pcre2grep  uses an internal buffer to hold a "window" on the file it is
 | 
						|
       scanning, in order to be able to output "before" and "after" lines when
 | 
						|
       it  finds  a  match. The starting size of the buffer is controlled by a
 | 
						|
       parameter whose default value is 20K. The buffer itself is three  times
 | 
						|
       this  size,  but  because  of  the  way it is used for holding "before"
 | 
						|
       lines, the longest line that is guaranteed to  be  processable  is  the
 | 
						|
       parameter  size.  If  a longer line is encountered, pcre2grep automati-
 | 
						|
       cally expands the buffer, up to a specified maximum size, whose default
 | 
						|
       is 1M or the starting size, whichever is the larger. You can change the
 | 
						|
       default parameter values by adding, for example,
 | 
						|
 | 
						|
         --with-pcre2grep-bufsize=51200
 | 
						|
         --with-pcre2grep-max-bufsize=2097152
 | 
						|
 | 
						|
       to the configure command. The caller of pcre2grep  can  override  these
 | 
						|
       values  by  using  --buffer-size  and  --max-buffer-size on the command
 | 
						|
       line.
 | 
						|
 | 
						|
 | 
						|
PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
 | 
						|
 | 
						|
       If you add one of
 | 
						|
 | 
						|
         --enable-pcre2test-libreadline
 | 
						|
         --enable-pcre2test-libedit
 | 
						|
 | 
						|
       to the configure command, pcre2test  is  linked  with  the  libreadline
 | 
						|
       orlibedit library, respectively, and when its input is from a terminal,
 | 
						|
       it reads it using the readline() function. This  provides  line-editing
 | 
						|
       and  history  facilities.  Note that libreadline is GPL-licensed, so if
 | 
						|
       you distribute a binary of pcre2test linked in this way, there  may  be
 | 
						|
       licensing issues. These can be avoided by linking instead with libedit,
 | 
						|
       which has a BSD licence.
 | 
						|
 | 
						|
       Setting --enable-pcre2test-libreadline causes the -lreadline option  to
 | 
						|
       be  added to the pcre2test build. In many operating environments with a
 | 
						|
       sytem-installed readline library this is sufficient. However,  in  some
 | 
						|
       environments (e.g. if an unmodified distribution version of readline is
 | 
						|
       in use), some extra configuration may be necessary.  The  INSTALL  file
 | 
						|
       for libreadline says this:
 | 
						|
 | 
						|
         "Readline uses the termcap functions, but does not link with
 | 
						|
         the termcap or curses library itself, allowing applications
 | 
						|
         which link with readline the to choose an appropriate library."
 | 
						|
 | 
						|
       If  your environment has not been set up so that an appropriate library
 | 
						|
       is automatically included, you may need to add something like
 | 
						|
 | 
						|
         LIBS="-ncurses"
 | 
						|
 | 
						|
       immediately before the configure command.
 | 
						|
 | 
						|
 | 
						|
INCLUDING DEBUGGING CODE
 | 
						|
 | 
						|
       If you add
 | 
						|
 | 
						|
         --enable-debug
 | 
						|
 | 
						|
       to the configure command, additional debugging code is included in  the
 | 
						|
       build. This feature is intended for use by the PCRE2 maintainers.
 | 
						|
 | 
						|
 | 
						|
DEBUGGING WITH VALGRIND SUPPORT
 | 
						|
 | 
						|
       If you add
 | 
						|
 | 
						|
         --enable-valgrind
 | 
						|
 | 
						|
       to  the  configure command, PCRE2 will use valgrind annotations to mark
 | 
						|
       certain memory regions as  unaddressable.  This  allows  it  to  detect
 | 
						|
       invalid  memory  accesses,  and  is  mostly  useful for debugging PCRE2
 | 
						|
       itself.
 | 
						|
 | 
						|
 | 
						|
CODE COVERAGE REPORTING
 | 
						|
 | 
						|
       If your C compiler is gcc, you can build a version of  PCRE2  that  can
 | 
						|
       generate a code coverage report for its test suite. To enable this, you
 | 
						|
       must install lcov version 1.6 or above. Then specify
 | 
						|
 | 
						|
         --enable-coverage
 | 
						|
 | 
						|
       to the configure command and build PCRE2 in the usual way.
 | 
						|
 | 
						|
       Note that using ccache (a caching C compiler) is incompatible with code
 | 
						|
       coverage  reporting. If you have configured ccache to run automatically
 | 
						|
       on your system, you must set the environment variable
 | 
						|
 | 
						|
         CCACHE_DISABLE=1
 | 
						|
 | 
						|
       before running make to build PCRE2, so that ccache is not used.
 | 
						|
 | 
						|
       When --enable-coverage is used,  the  following  addition  targets  are
 | 
						|
       added to the Makefile:
 | 
						|
 | 
						|
         make coverage
 | 
						|
 | 
						|
       This  creates  a  fresh coverage report for the PCRE2 test suite. It is
 | 
						|
       equivalent to running "make coverage-reset", "make  coverage-baseline",
 | 
						|
       "make check", and then "make coverage-report".
 | 
						|
 | 
						|
         make coverage-reset
 | 
						|
 | 
						|
       This zeroes the coverage counters, but does nothing else.
 | 
						|
 | 
						|
         make coverage-baseline
 | 
						|
 | 
						|
       This captures baseline coverage information.
 | 
						|
 | 
						|
         make coverage-report
 | 
						|
 | 
						|
       This creates the coverage report.
 | 
						|
 | 
						|
         make coverage-clean-report
 | 
						|
 | 
						|
       This  removes the generated coverage report without cleaning the cover-
 | 
						|
       age data itself.
 | 
						|
 | 
						|
         make coverage-clean-data
 | 
						|
 | 
						|
       This removes the captured coverage data without removing  the  coverage
 | 
						|
       files created at compile time (*.gcno).
 | 
						|
 | 
						|
         make coverage-clean
 | 
						|
 | 
						|
       This  cleans all coverage data including the generated coverage report.
 | 
						|
       For more information about code coverage, see the gcov and  lcov  docu-
 | 
						|
       mentation.
 | 
						|
 | 
						|
 | 
						|
SUPPORT FOR FUZZERS
 | 
						|
 | 
						|
       There  is  a  special  option for use by people who want to run fuzzing
 | 
						|
       tests on PCRE2:
 | 
						|
 | 
						|
         --enable-fuzz-support
 | 
						|
 | 
						|
       At present this applies only to the 8-bit library. If set, it causes an
 | 
						|
       extra  library  called  libpcre2-fuzzsupport.a  to  be  built,  but not
 | 
						|
       installed. This contains a single function called  LLVMFuzzerTestOneIn-
 | 
						|
       put()  whose  arguments are a pointer to a string and the length of the
 | 
						|
       string. When called, this function tries to compile  the  string  as  a
 | 
						|
       pattern,  and if that succeeds, to match it.  This is done both with no
 | 
						|
       options and with some random options bits that are generated  from  the
 | 
						|
       string.  Setting  --enable-fuzz-support  also  causes  a  binary called
 | 
						|
       pcre2fuzzcheck to be created. This is normally run  under  valgrind  or
 | 
						|
       used  when  PCRE2 is compiled with address sanitizing enabled. It calls
 | 
						|
       the fuzzing function and outputs information about  it  is  doing.  The
 | 
						|
       input  strings  are  specified by arguments: if an argument starts with
 | 
						|
       "=" the rest of it is a literal input string. Otherwise, it is  assumed
 | 
						|
       to be a file name, and the contents of the file are the test string.
 | 
						|
 | 
						|
 | 
						|
SEE ALSO
 | 
						|
 | 
						|
       pcre2api(3), pcre2-config(3).
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 01 November 2016
 | 
						|
       Copyright (c) 1997-2016 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
SYNOPSIS
 | 
						|
 | 
						|
       #include <pcre2.h>
 | 
						|
 | 
						|
       int (*pcre2_callout)(pcre2_callout_block *, void *);
 | 
						|
 | 
						|
       int pcre2_callout_enumerate(const pcre2_code *code,
 | 
						|
         int (*callback)(pcre2_callout_enumerate_block *, void *),
 | 
						|
         void *user_data);
 | 
						|
 | 
						|
 | 
						|
DESCRIPTION
 | 
						|
 | 
						|
       PCRE2  provides  a feature called "callout", which is a means of tempo-
 | 
						|
       rarily passing control to the caller of PCRE2 in the middle of  pattern
 | 
						|
       matching.  The caller of PCRE2 provides an external function by putting
 | 
						|
       its entry point in a match  context  (see  pcre2_set_callout()  in  the
 | 
						|
       pcre2api documentation).
 | 
						|
 | 
						|
       Within  a  regular expression, (?C<arg>) indicates a point at which the
 | 
						|
       external function is to be called.  Different  callout  points  can  be
 | 
						|
       identified  by  putting  a number less than 256 after the letter C. The
 | 
						|
       default value is zero.  Alternatively, the argument may be a  delimited
 | 
						|
       string.  The  starting delimiter must be one of ` ' " ^ % # $ { and the
 | 
						|
       ending delimiter is the same as the start, except for {, where the end-
 | 
						|
       ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
 | 
						|
       string, it must be doubled. For example, this pattern has  two  callout
 | 
						|
       points:
 | 
						|
 | 
						|
         (?C1)abc(?C"some ""arbitrary"" text")def
 | 
						|
 | 
						|
       If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
 | 
						|
       PCRE2 automatically inserts callouts, all with number 255, before  each
 | 
						|
       item  in  the  pattern except for immediately before or after a callout
 | 
						|
       item in the pattern.  For example, if PCRE2_AUTO_CALLOUT is  used  with
 | 
						|
       the pattern
 | 
						|
 | 
						|
         A(?C3)B
 | 
						|
 | 
						|
       it is processed as if it were
 | 
						|
 | 
						|
         (?C255)A(?C3)B(?C255)
 | 
						|
 | 
						|
       Here is a more complicated example:
 | 
						|
 | 
						|
         A(\d{2}|--)
 | 
						|
 | 
						|
       With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
 | 
						|
 | 
						|
       (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
 | 
						|
 | 
						|
       Notice  that  there  is a callout before and after each parenthesis and
 | 
						|
       alternation bar. If the pattern contains a conditional group whose con-
 | 
						|
       dition  is  an  assertion, an automatic callout is inserted immediately
 | 
						|
       before the condition. Such a callout may also be  inserted  explicitly,
 | 
						|
       for example:
 | 
						|
 | 
						|
         (?(?C9)(?=a)ab|de)  (?(?C%text%)(?!=d)ab|de)
 | 
						|
 | 
						|
       This  applies only to assertion conditions (because they are themselves
 | 
						|
       independent groups).
 | 
						|
 | 
						|
       Callouts can be useful for tracking the progress of  pattern  matching.
 | 
						|
       The pcre2test program has a pattern qualifier (/auto_callout) that sets
 | 
						|
       automatic callouts.  When any callouts are  present,  the  output  from
 | 
						|
       pcre2test  indicates  how  the pattern is being matched. This is useful
 | 
						|
       information when you are trying to optimize the performance of  a  par-
 | 
						|
       ticular pattern.
 | 
						|
 | 
						|
 | 
						|
MISSING CALLOUTS
 | 
						|
 | 
						|
       You  should  be  aware  that, because of optimizations in the way PCRE2
 | 
						|
       compiles and matches patterns, callouts sometimes do not happen exactly
 | 
						|
       as you might expect.
 | 
						|
 | 
						|
   Auto-possessification
 | 
						|
 | 
						|
       At compile time, PCRE2 "auto-possessifies" repeated items when it knows
 | 
						|
       that what follows cannot be part of the repeat. For example, a+[bc]  is
 | 
						|
       compiled  as if it were a++[bc]. The pcre2test output when this pattern
 | 
						|
       is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
 | 
						|
       to the string "aaaa" is:
 | 
						|
 | 
						|
         --->aaaa
 | 
						|
          +0 ^        a+
 | 
						|
          +2 ^   ^    [bc]
 | 
						|
         No match
 | 
						|
 | 
						|
       This  indicates that when matching [bc] fails, there is no backtracking
 | 
						|
       into a+ (because it is being treated as a++) and therefore the callouts
 | 
						|
       that  would  be  taken for the backtracks do not occur. You can disable
 | 
						|
       the  auto-possessify  feature  by  passing   PCRE2_NO_AUTO_POSSESS   to
 | 
						|
       pcre2_compile(),  or  starting  the pattern with (*NO_AUTO_POSSESS). In
 | 
						|
       this case, the output changes to this:
 | 
						|
 | 
						|
         --->aaaa
 | 
						|
          +0 ^        a+
 | 
						|
          +2 ^   ^    [bc]
 | 
						|
          +2 ^  ^     [bc]
 | 
						|
          +2 ^ ^      [bc]
 | 
						|
          +2 ^^       [bc]
 | 
						|
         No match
 | 
						|
 | 
						|
       This time, when matching [bc] fails, the matcher backtracks into a+ and
 | 
						|
       tries again, repeatedly, until a+ itself fails.
 | 
						|
 | 
						|
   Automatic .* anchoring
 | 
						|
 | 
						|
       By default, an optimization is applied when .* is the first significant
 | 
						|
       item in a pattern. If PCRE2_DOTALL is set, so that the  dot  can  match
 | 
						|
       any  character,  the pattern is automatically anchored. If PCRE2_DOTALL
 | 
						|
       is not set, a match can start only after an internal newline or at  the
 | 
						|
       beginning  of  the  subject,  and  pcre2_compile() remembers this. This
 | 
						|
       optimization is disabled, however, if .* is in an atomic  group  or  if
 | 
						|
       there  is  a back reference to the capturing group in which it appears.
 | 
						|
       It is also disabled if the pattern contains (*PRUNE) or  (*SKIP).  How-
 | 
						|
       ever, the presence of callouts does not affect it.
 | 
						|
 | 
						|
       For  example,  if  the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
 | 
						|
       and applied to the string "aa", the pcre2test output is:
 | 
						|
 | 
						|
         --->aa
 | 
						|
          +0 ^      .*
 | 
						|
          +2 ^ ^    \d
 | 
						|
          +2 ^^     \d
 | 
						|
          +2 ^      \d
 | 
						|
         No match
 | 
						|
 | 
						|
       This shows that all match attempts start at the beginning of  the  sub-
 | 
						|
       ject.  In  other  words,  the pattern is anchored. You can disable this
 | 
						|
       optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(),  or
 | 
						|
       starting  the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
 | 
						|
       put changes to:
 | 
						|
 | 
						|
         --->aa
 | 
						|
          +0 ^      .*
 | 
						|
          +2 ^ ^    \d
 | 
						|
          +2 ^^     \d
 | 
						|
          +2 ^      \d
 | 
						|
          +0  ^     .*
 | 
						|
          +2  ^^    \d
 | 
						|
          +2  ^     \d
 | 
						|
         No match
 | 
						|
 | 
						|
       This shows more match attempts, starting at the second subject  charac-
 | 
						|
       ter.   Another  optimization, described in the next section, means that
 | 
						|
       there is no subsequent attempt to match with an empty subject.
 | 
						|
 | 
						|
       If a pattern has more than one top-level  branch,  automatic  anchoring
 | 
						|
       occurs if all branches are anchorable.
 | 
						|
 | 
						|
   Other optimizations
 | 
						|
 | 
						|
       Other  optimizations  that  provide fast "no match" results also affect
 | 
						|
       callouts.  For example, if the pattern is
 | 
						|
 | 
						|
         ab(?C4)cd
 | 
						|
 | 
						|
       PCRE2 knows that any matching string must contain the  letter  "d".  If
 | 
						|
       the  subject  string  is  "abyz",  the  lack of "d" means that matching
 | 
						|
       doesn't ever start, and the callout is  never  reached.  However,  with
 | 
						|
       "abyd", though the result is still no match, the callout is obeyed.
 | 
						|
 | 
						|
       PCRE2  also  knows  the  minimum  length of a matching string, and will
 | 
						|
       immediately give a "no match" return without actually running  a  match
 | 
						|
       if  the  subject is not long enough, or, for unanchored patterns, if it
 | 
						|
       has been scanned far enough.
 | 
						|
 | 
						|
       You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
 | 
						|
       MIZE  option  to  pcre2_compile(),  or  by  starting  the  pattern with
 | 
						|
       (*NO_START_OPT). This slows down the matching process, but does  ensure
 | 
						|
       that callouts such as the example above are obeyed.
 | 
						|
 | 
						|
 | 
						|
THE CALLOUT INTERFACE
 | 
						|
 | 
						|
       During  matching,  when  PCRE2  reaches a callout point, if an external
 | 
						|
       function is set in the match context, it is  called.  This  applies  to
 | 
						|
       both  normal  and DFA matching. The first argument to the callout func-
 | 
						|
       tion is a pointer to a pcre2_callout block. The second argument is  the
 | 
						|
       void  *  callout  data that was supplied when the callout was set up by
 | 
						|
       calling pcre2_set_callout() (see the pcre2api documentation). The call-
 | 
						|
       out block structure contains the following fields:
 | 
						|
 | 
						|
         uint32_t      version;
 | 
						|
         uint32_t      callout_number;
 | 
						|
         uint32_t      capture_top;
 | 
						|
         uint32_t      capture_last;
 | 
						|
         PCRE2_SIZE   *offset_vector;
 | 
						|
         PCRE2_SPTR    mark;
 | 
						|
         PCRE2_SPTR    subject;
 | 
						|
         PCRE2_SIZE    subject_length;
 | 
						|
         PCRE2_SIZE    start_match;
 | 
						|
         PCRE2_SIZE    current_position;
 | 
						|
         PCRE2_SIZE    pattern_position;
 | 
						|
         PCRE2_SIZE    next_item_length;
 | 
						|
         PCRE2_SIZE    callout_string_offset;
 | 
						|
         PCRE2_SIZE    callout_string_length;
 | 
						|
         PCRE2_SPTR    callout_string;
 | 
						|
 | 
						|
       The  version field contains the version number of the block format. The
 | 
						|
       current version is 1; the three callout string fields  were  added  for
 | 
						|
       this  version. If you are writing an application that might use an ear-
 | 
						|
       lier release of PCRE2, you  should  check  the  version  number  before
 | 
						|
       accessing  any  of  these  fields.  The version number will increase in
 | 
						|
       future if more fields are added, but the intention is never  to  remove
 | 
						|
       any of the existing fields.
 | 
						|
 | 
						|
   Fields for numerical callouts
 | 
						|
 | 
						|
       For  a  numerical  callout,  callout_string is NULL, and callout_number
 | 
						|
       contains the number of the callout, in the range  0-255.  This  is  the
 | 
						|
       number  that  follows  (?C for callouts that part of the pattern; it is
 | 
						|
       255 for automatically generated callouts.
 | 
						|
 | 
						|
   Fields for string callouts
 | 
						|
 | 
						|
       For callouts with string arguments, callout_number is always zero,  and
 | 
						|
       callout_string  points  to the string that is contained within the com-
 | 
						|
       piled pattern. Its length is given by callout_string_length. Duplicated
 | 
						|
       ending delimiters that were present in the original pattern string have
 | 
						|
       been turned into single characters, but there is no other processing of
 | 
						|
       the  callout string argument. An additional code unit containing binary
 | 
						|
       zero is present after the string, but is not included  in  the  length.
 | 
						|
       The  delimiter  that was used to start the string is also stored within
 | 
						|
       the pattern, immediately before the string itself. You can access  this
 | 
						|
       delimiter as callout_string[-1] if you need it.
 | 
						|
 | 
						|
       The callout_string_offset field is the code unit offset to the start of
 | 
						|
       the callout argument string within the original pattern string. This is
 | 
						|
       provided  for the benefit of applications such as script languages that
 | 
						|
       might need to report errors in the callout string within the pattern.
 | 
						|
 | 
						|
   Fields for all callouts
 | 
						|
 | 
						|
       The remaining fields in the callout block are the same for  both  kinds
 | 
						|
       of callout.
 | 
						|
 | 
						|
       The offset_vector field is a pointer to the vector of capturing offsets
 | 
						|
       (the "ovector") that was passed to the matching function in  the  match
 | 
						|
       data  block.  When pcre2_match() is used, the contents can be inspected
 | 
						|
       in order to extract substrings that have been matched so  far,  in  the
 | 
						|
       same  way as for extracting substrings after a match has completed. For
 | 
						|
       the DFA matching function, this field is not useful.
 | 
						|
 | 
						|
       The subject and subject_length fields contain copies of the values that
 | 
						|
       were passed to the matching function.
 | 
						|
 | 
						|
       The  start_match  field normally contains the offset within the subject
 | 
						|
       at which the current match attempt  started.  However,  if  the  escape
 | 
						|
       sequence  \K has been encountered, this value is changed to reflect the
 | 
						|
       modified starting point. If the pattern is not  anchored,  the  callout
 | 
						|
       function may be called several times from the same point in the pattern
 | 
						|
       for different starting points in the subject.
 | 
						|
 | 
						|
       The current_position field contains the offset within  the  subject  of
 | 
						|
       the current match pointer.
 | 
						|
 | 
						|
       When the pcre2_match() is used, the capture_top field contains one more
 | 
						|
       than the number of the highest numbered captured substring so  far.  If
 | 
						|
       no substrings have been captured, the value of capture_top is one. This
 | 
						|
       is always the case when the DFA functions are used, because they do not
 | 
						|
       support captured substrings.
 | 
						|
 | 
						|
       The  capture_last  field  contains the number of the most recently cap-
 | 
						|
       tured substring. However, when a recursion exits, the value reverts  to
 | 
						|
       what  it  was  outside  the recursion, as do the values of all captured
 | 
						|
       substrings. If no substrings have been  captured,  the  value  of  cap-
 | 
						|
       ture_last is 0. This is always the case for the DFA matching functions.
 | 
						|
 | 
						|
       The pattern_position field contains the offset in the pattern string to
 | 
						|
       the next item to be matched.
 | 
						|
 | 
						|
       The next_item_length field contains the length of the next item  to  be
 | 
						|
       processed  in the pattern string. When the callout is at the end of the
 | 
						|
       pattern, the length is zero.  When  the  callout  precedes  an  opening
 | 
						|
       parenthesis, the length includes meta characters that follow the paren-
 | 
						|
       thesis. For example, in a callout before an assertion  such  as  (?=ab)
 | 
						|
       the  length  is  3. For an an alternation bar or a closing parenthesis,
 | 
						|
       the length is one, unless a closing parenthesis is followed by a  quan-
 | 
						|
       tifier, in which case its length is included.  (This changed in release
 | 
						|
       10.23. In earlier releases, before an opening  parenthesis  the  length
 | 
						|
       was  that  of the entire subpattern, and before an alternation bar or a
 | 
						|
       closing parenthesis the length was zero.)
 | 
						|
 | 
						|
       The pattern_position and next_item_length fields are intended  to  help
 | 
						|
       in  distinguishing between different automatic callouts, which all have
 | 
						|
       the same callout number. However, they are set for  all  callouts,  and
 | 
						|
       are used by pcre2test to show the next item to be matched when display-
 | 
						|
       ing callout information.
 | 
						|
 | 
						|
       In callouts from pcre2_match() the mark field contains a pointer to the
 | 
						|
       zero-terminated  name of the most recently passed (*MARK), (*PRUNE), or
 | 
						|
       (*THEN) item in the match, or NULL if no such items have  been  passed.
 | 
						|
       Instances  of  (*PRUNE)  or  (*THEN) without a name do not obliterate a
 | 
						|
       previous (*MARK). In callouts from the DFA matching function this field
 | 
						|
       always contains NULL.
 | 
						|
 | 
						|
 | 
						|
RETURN VALUES FROM CALLOUTS
 | 
						|
 | 
						|
       The external callout function returns an integer to PCRE2. If the value
 | 
						|
       is zero, matching proceeds as normal. If  the  value  is  greater  than
 | 
						|
       zero,  matching  fails  at  the current point, but the testing of other
 | 
						|
       matching possibilities goes ahead, just as if a lookahead assertion had
 | 
						|
       failed. If the value is less than zero, the match is abandoned, and the
 | 
						|
       matching function returns the negative value.
 | 
						|
 | 
						|
       Negative  values  should  normally  be   chosen   from   the   set   of
 | 
						|
       PCRE2_ERROR_xxx  values.  In  particular,  PCRE2_ERROR_NOMATCH forces a
 | 
						|
       standard "no match" failure. The error  number  PCRE2_ERROR_CALLOUT  is
 | 
						|
       reserved  for  use by callout functions; it will never be used by PCRE2
 | 
						|
       itself.
 | 
						|
 | 
						|
 | 
						|
CALLOUT ENUMERATION
 | 
						|
 | 
						|
       int pcre2_callout_enumerate(const pcre2_code *code,
 | 
						|
         int (*callback)(pcre2_callout_enumerate_block *, void *),
 | 
						|
         void *user_data);
 | 
						|
 | 
						|
       A script language that supports the use of string arguments in callouts
 | 
						|
       might  like  to  scan  all the callouts in a pattern before running the
 | 
						|
       match. This can be done by calling pcre2_callout_enumerate(). The first
 | 
						|
       argument  is  a  pointer  to a compiled pattern, the second points to a
 | 
						|
       callback function, and the third is arbitrary user data.  The  callback
 | 
						|
       function  is  called  for  every callout in the pattern in the order in
 | 
						|
       which they appear. Its first argument is a pointer to a callout enumer-
 | 
						|
       ation  block,  and  its second argument is the user_data value that was
 | 
						|
       passed to pcre2_callout_enumerate(). The data block contains  the  fol-
 | 
						|
       lowing fields:
 | 
						|
 | 
						|
         version                Block version number
 | 
						|
         pattern_position       Offset to next item in pattern
 | 
						|
         next_item_length       Length of next item in pattern
 | 
						|
         callout_number         Number for numbered callouts
 | 
						|
         callout_string_offset  Offset to string within pattern
 | 
						|
         callout_string_length  Length of callout string
 | 
						|
         callout_string         Points to callout string or is NULL
 | 
						|
 | 
						|
       The  version  number is currently 0. It will increase if new fields are
 | 
						|
       ever added to the block. The remaining fields are  the  same  as  their
 | 
						|
       namesakes  in  the pcre2_callout block that is used for callouts during
 | 
						|
       matching, as described above.
 | 
						|
 | 
						|
       Note that the value of pattern_position is  unique  for  each  callout.
 | 
						|
       However,  if  a callout occurs inside a group that is quantified with a
 | 
						|
       non-zero minimum or a fixed maximum, the group is replicated inside the
 | 
						|
       compiled  pattern.  For example, a pattern such as /(a){2}/ is compiled
 | 
						|
       as if it were /(a)(a)/. This means that the callout will be  enumerated
 | 
						|
       more  than  once,  but with the same value for pattern_position in each
 | 
						|
       case.
 | 
						|
 | 
						|
       The callback function should normally return zero. If it returns a non-
 | 
						|
       zero value, scanning the pattern stops, and that value is returned from
 | 
						|
       pcre2_callout_enumerate().
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 29 September 2016
 | 
						|
       Copyright (c) 1997-2016 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2COMPAT(3)             Library Functions Manual             PCRE2COMPAT(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
DIFFERENCES BETWEEN PCRE2 AND PERL
 | 
						|
 | 
						|
       This document describes the differences in the ways that PCRE2 and Perl
 | 
						|
       handle regular expressions. The differences  described  here  are  with
 | 
						|
       respect to Perl versions 5.10 and above.
 | 
						|
 | 
						|
       1.  PCRE2  has only a subset of Perl's Unicode support. Details of what
 | 
						|
       it does have are given in the pcre2unicode page.
 | 
						|
 | 
						|
       2. PCRE2 allows repeat quantifiers only  on  parenthesized  assertions,
 | 
						|
       but  they  do not mean what you might think. For example, (?!a){3} does
 | 
						|
       not assert that the next three characters are not "a". It just  asserts
 | 
						|
       that  the  next  character  is not "a" three times (in principle: PCRE2
 | 
						|
       optimizes this to run the assertion  just  once).  Perl  allows  repeat
 | 
						|
       quantifiers  on  other  assertions such as \b, but these do not seem to
 | 
						|
       have any use.
 | 
						|
 | 
						|
       3. Capturing subpatterns that occur inside  negative  lookahead  asser-
 | 
						|
       tions  are  counted,  but their entries in the offsets vector are never
 | 
						|
       set. Perl sometimes (but not always) sets its numerical variables  from
 | 
						|
       inside negative assertions.
 | 
						|
 | 
						|
       4.  The  following Perl escape sequences are not supported: \l, \u, \L,
 | 
						|
       \U, and \N when followed by a character name or Unicode value.  (\N  on
 | 
						|
       its own, matching a non-newline character, is supported.) In fact these
 | 
						|
       are implemented by Perl's general string-handling and are not  part  of
 | 
						|
       its  pattern matching engine. If any of these are encountered by PCRE2,
 | 
						|
       an error is generated by default. However, if the PCRE2_ALT_BSUX option
 | 
						|
       is set, \U and \u are interpreted as ECMAScript interprets them.
 | 
						|
 | 
						|
       5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
 | 
						|
       is built with Unicode support. The properties that can be  tested  with
 | 
						|
       \p and \P are limited to the general category properties such as Lu and
 | 
						|
       Nd, script names such as Greek or Han, and the derived  properties  Any
 | 
						|
       and L&. PCRE2 does support the Cs (surrogate) property, which Perl does
 | 
						|
       not; the Perl documentation says "Because Perl hides the need  for  the
 | 
						|
       user  to  understand the internal representation of Unicode characters,
 | 
						|
       there is no need to implement the  somewhat  messy  concept  of  surro-
 | 
						|
       gates."
 | 
						|
 | 
						|
       6.  PCRE2 does support the \Q...\E escape for quoting substrings. Char-
 | 
						|
       acters in between are treated as literals. This is  slightly  different
 | 
						|
       from  Perl  in  that  $  and  @ are also handled as literals inside the
 | 
						|
       quotes. In Perl, they cause variable interpolation (but of course PCRE2
 | 
						|
       does not have variables).  Note the following examples:
 | 
						|
 | 
						|
           Pattern            PCRE2 matches      Perl matches
 | 
						|
 | 
						|
           \Qabc$xyz\E        abc$xyz           abc followed by the
 | 
						|
                                                  contents of $xyz
 | 
						|
           \Qabc\$xyz\E       abc\$xyz          abc\$xyz
 | 
						|
           \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
 | 
						|
 | 
						|
       The  \Q...\E  sequence  is recognized both inside and outside character
 | 
						|
       classes.
 | 
						|
 | 
						|
       7.  Fairly  obviously,  PCRE2  does  not  support  the  (?{code})   and
 | 
						|
       (??{code})  constructions. However, there is support for recursive pat-
 | 
						|
       terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also,
 | 
						|
       the  PCRE2  "callout"  feature allows an external function to be called
 | 
						|
       during  pattern  matching.  See  the  pcre2callout  documentation   for
 | 
						|
       details.
 | 
						|
 | 
						|
       8.  Subroutine  calls  (whether recursive or not) are treated as atomic
 | 
						|
       groups.  Atomic recursion is like Python,  but  unlike  Perl.  Captured
 | 
						|
       values  that  are  set outside a subroutine call can be referenced from
 | 
						|
       inside in PCRE2, but not in Perl. There is a discussion  that  explains
 | 
						|
       these  differences  in  more detail in the section on recursion differ-
 | 
						|
       ences from Perl in the pcre2pattern page.
 | 
						|
 | 
						|
       9. If any of the backtracking control verbs are used  in  a  subpattern
 | 
						|
       that  is  called  as  a  subroutine (whether or not recursively), their
 | 
						|
       effect is confined to that subpattern; it does not extend to  the  sur-
 | 
						|
       rounding  pattern.  This is not always the case in Perl. In particular,
 | 
						|
       if (*THEN) is present in a group that is called as  a  subroutine,  its
 | 
						|
       action is limited to that group, even if the group does not contain any
 | 
						|
       | characters. Note that such subpatterns are processed as  anchored  at
 | 
						|
       the point where they are tested.
 | 
						|
 | 
						|
       10.  If a pattern contains more than one backtracking control verb, the
 | 
						|
       first one that is backtracked onto acts. For example,  in  the  pattern
 | 
						|
       A(*COMMIT)B(*PRUNE)C  a  failure in B triggers (*COMMIT), but a failure
 | 
						|
       in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
 | 
						|
       it is the same as PCRE2, but there are cases where it differs.
 | 
						|
 | 
						|
       11.  Most  backtracking  verbs in assertions have their normal actions.
 | 
						|
       They are not confined to the assertion.
 | 
						|
 | 
						|
       12. There are some differences that are concerned with the settings  of
 | 
						|
       captured  strings  when  part  of  a  pattern is repeated. For example,
 | 
						|
       matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2
 | 
						|
       unset, but in PCRE2 it is set to "b".
 | 
						|
 | 
						|
       13. PCRE2's handling of duplicate subpattern numbers and duplicate sub-
 | 
						|
       pattern names is not as general as Perl's. This is a consequence of the
 | 
						|
       fact  the  PCRE2  works internally just with numbers, using an external
 | 
						|
       table to translate between numbers and names. In particular, a  pattern
 | 
						|
       such  as  (?|(?<a>A)|(?<b>B),  where the two capturing parentheses have
 | 
						|
       the same number but different names, is not supported,  and  causes  an
 | 
						|
       error  at compile time. If it were allowed, it would not be possible to
 | 
						|
       distinguish which parentheses matched, because both names map  to  cap-
 | 
						|
       turing subpattern number 1. To avoid this confusing situation, an error
 | 
						|
       is given at compile time.
 | 
						|
 | 
						|
       14. Perl used to recognize comments in some places that PCRE2 does not,
 | 
						|
       for  example,  between the ( and ? at the start of a subpattern. If the
 | 
						|
       /x modifier is set, Perl allowed white space between ( and ? though the
 | 
						|
       latest  Perls give an error (for a while it was just deprecated). There
 | 
						|
       may still be some cases where Perl behaves differently.
 | 
						|
 | 
						|
       15. Perl, when in warning mode, gives warnings  for  character  classes
 | 
						|
       such  as  [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
 | 
						|
       als. PCRE2 has no warning features, so it gives an error in these cases
 | 
						|
       because they are almost certainly user mistakes.
 | 
						|
 | 
						|
       16.  In  PCRE2, the upper/lower case character properties Lu and Ll are
 | 
						|
       not affected when case-independent matching is specified. For  example,
 | 
						|
       \p{Lu} always matches an upper case letter. I think Perl has changed in
 | 
						|
       this respect; in the release at the time of writing (5.16), \p{Lu}  and
 | 
						|
       \p{Ll} match all letters, regardless of case, when case independence is
 | 
						|
       specified.
 | 
						|
 | 
						|
       17. PCRE2 provides some  extensions  to  the  Perl  regular  expression
 | 
						|
       facilities.   Perl  5.10  includes new features that are not in earlier
 | 
						|
       versions of Perl, some of which (such as named parentheses)  have  been
 | 
						|
       in PCRE2 for some time. This list is with respect to Perl 5.10:
 | 
						|
 | 
						|
       (a)  Although  lookbehind  assertions  in PCRE2 must match fixed length
 | 
						|
       strings, each alternative branch of a lookbehind assertion can match  a
 | 
						|
       different  length  of  string.  Perl requires them all to have the same
 | 
						|
       length.
 | 
						|
 | 
						|
       (b) From PCRE2 10.23, back references to groups  of  fixed  length  are
 | 
						|
       supported in lookbehinds, provided that there is no possibility of ref-
 | 
						|
       erencing a non-unique number or name. Perl does not support  backrefer-
 | 
						|
       ences in lookbehinds.
 | 
						|
 | 
						|
       (c)  If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the
 | 
						|
       $ meta-character matches only at the very end of the string.
 | 
						|
 | 
						|
       (d) A backslash followed  by  a  letter  with  no  special  meaning  is
 | 
						|
       faulted. (Perl can be made to issue a warning.)
 | 
						|
 | 
						|
       (e)  If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
 | 
						|
       fiers is inverted, that is, by default they are not greedy, but if fol-
 | 
						|
       lowed by a question mark they are.
 | 
						|
 | 
						|
       (f)  PCRE2_ANCHORED  can be used at matching time to force a pattern to
 | 
						|
       be tried only at the first matching position in the subject string.
 | 
						|
 | 
						|
       (g)      The      PCRE2_NOTBOL,      PCRE2_NOTEOL,      PCRE2_NOTEMPTY,
 | 
						|
       PCRE2_NOTEMPTY_ATSTART,  and PCRE2_NO_AUTO_CAPTURE options have no Perl
 | 
						|
       equivalents.
 | 
						|
 | 
						|
       (h) The \R escape sequence can be restricted to match only CR,  LF,  or
 | 
						|
       CRLF by the PCRE2_BSR_ANYCRLF option.
 | 
						|
 | 
						|
       (i) The callout facility is PCRE2-specific.
 | 
						|
 | 
						|
       (j) The partial matching facility is PCRE2-specific.
 | 
						|
 | 
						|
       (k)  The  alternative matching function (pcre2_dfa_match() matches in a
 | 
						|
       different way and is not Perl-compatible.
 | 
						|
 | 
						|
       (l) PCRE2 recognizes some special sequences such as (*CR) at the  start
 | 
						|
       of a pattern that set overall options that cannot be changed within the
 | 
						|
       pattern.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 18 October 2016
 | 
						|
       Copyright (c) 1997-2016 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2JIT(3)                Library Functions Manual                PCRE2JIT(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
PCRE2 JUST-IN-TIME COMPILER SUPPORT
 | 
						|
 | 
						|
       Just-in-time  compiling  is a heavyweight optimization that can greatly
 | 
						|
       speed up pattern matching. However, it comes at the cost of extra  pro-
 | 
						|
       cessing  before  the  match is performed, so it is of most benefit when
 | 
						|
       the same pattern is going to be matched many times. This does not  nec-
 | 
						|
       essarily  mean many calls of a matching function; if the pattern is not
 | 
						|
       anchored, matching attempts may take place many times at various  posi-
 | 
						|
       tions in the subject, even for a single call. Therefore, if the subject
 | 
						|
       string is very long, it may still pay  to  use  JIT  even  for  one-off
 | 
						|
       matches.  JIT  support  is  available  for all of the 8-bit, 16-bit and
 | 
						|
       32-bit PCRE2 libraries.
 | 
						|
 | 
						|
       JIT support applies only to the  traditional  Perl-compatible  matching
 | 
						|
       function.   It  does  not apply when the DFA matching function is being
 | 
						|
       used. The code for this support was written by Zoltan Herczeg.
 | 
						|
 | 
						|
 | 
						|
AVAILABILITY OF JIT SUPPORT
 | 
						|
 | 
						|
       JIT support is an optional feature of  PCRE2.  The  "configure"  option
 | 
						|
       --enable-jit  (or  equivalent  CMake  option) must be set when PCRE2 is
 | 
						|
       built if you want to use JIT. The support is limited to  the  following
 | 
						|
       hardware platforms:
 | 
						|
 | 
						|
         ARM 32-bit (v5, v7, and Thumb2)
 | 
						|
         ARM 64-bit
 | 
						|
         Intel x86 32-bit and 64-bit
 | 
						|
         MIPS 32-bit and 64-bit
 | 
						|
         Power PC 32-bit and 64-bit
 | 
						|
         SPARC 32-bit
 | 
						|
 | 
						|
       If --enable-jit is set on an unsupported platform, compilation fails.
 | 
						|
 | 
						|
       A  program  can  tell if JIT support is available by calling pcre2_con-
 | 
						|
       fig() with the PCRE2_CONFIG_JIT option. The result is  1  when  JIT  is
 | 
						|
       available,  and 0 otherwise. However, a simple program does not need to
 | 
						|
       check this in order to use JIT. The API is implemented in  a  way  that
 | 
						|
       falls  back  to the interpretive code if JIT is not available. For pro-
 | 
						|
       grams that need the best possible performance, there is  also  a  "fast
 | 
						|
       path" API that is JIT-specific.
 | 
						|
 | 
						|
 | 
						|
SIMPLE USE OF JIT
 | 
						|
 | 
						|
       To  make use of the JIT support in the simplest way, all you have to do
 | 
						|
       is to call pcre2_jit_compile() after successfully compiling  a  pattern
 | 
						|
       with pcre2_compile(). This function has two arguments: the first is the
 | 
						|
       compiled pattern pointer that was returned by pcre2_compile(), and  the
 | 
						|
       second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
 | 
						|
       PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
 | 
						|
 | 
						|
       If JIT support is not available, a  call  to  pcre2_jit_compile()  does
 | 
						|
       nothing  and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
 | 
						|
       pattern is passed to the JIT compiler, which turns it into machine code
 | 
						|
       that executes much faster than the normal interpretive code, but yields
 | 
						|
       exactly the same results. The returned value  from  pcre2_jit_compile()
 | 
						|
       is zero on success, or a negative error code.
 | 
						|
 | 
						|
       There  is  a limit to the size of pattern that JIT supports, imposed by
 | 
						|
       the size of machine stack that it uses. The exact rules are  not  docu-
 | 
						|
       mented  because  they  may  change at any time, in particular, when new
 | 
						|
       optimizations are introduced.  If a pattern  is  too  big,  a  call  to
 | 
						|
       pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY.
 | 
						|
 | 
						|
       PCRE2_JIT_COMPLETE  requests the JIT compiler to generate code for com-
 | 
						|
       plete matches. If you want to run partial matches using the  PCRE2_PAR-
 | 
						|
       TIAL_HARD  or  PCRE2_PARTIAL_SOFT  options of pcre2_match(), you should
 | 
						|
       set one or both of  the  other  options  as  well  as,  or  instead  of
 | 
						|
       PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
 | 
						|
       for each of the three modes (normal, soft partial, hard partial).  When
 | 
						|
       pcre2_match()  is  called,  the appropriate code is run if it is avail-
 | 
						|
       able. Otherwise, the pattern is matched using interpretive code.
 | 
						|
 | 
						|
       You can call pcre2_jit_compile() multiple times for the  same  compiled
 | 
						|
       pattern.  It does nothing if it has previously compiled code for any of
 | 
						|
       the option bits. For example, you can call it once with  PCRE2_JIT_COM-
 | 
						|
       PLETE  and  (perhaps  later,  when  you find you need partial matching)
 | 
						|
       again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time  it
 | 
						|
       will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
 | 
						|
       ing. If pcre2_jit_compile() is called with no option bits set, it imme-
 | 
						|
       diately returns zero. This is an alternative way of testing whether JIT
 | 
						|
       is available.
 | 
						|
 | 
						|
       At present, it is not possible to free JIT compiled  code  except  when
 | 
						|
       the entire compiled pattern is freed by calling pcre2_code_free().
 | 
						|
 | 
						|
       In  some circumstances you may need to call additional functions. These
 | 
						|
       are described in the  section  entitled  "Controlling  the  JIT  stack"
 | 
						|
       below.
 | 
						|
 | 
						|
       There are some pcre2_match() options that are not supported by JIT, and
 | 
						|
       there are also some pattern items that JIT cannot handle.  Details  are
 | 
						|
       given  below.  In  both cases, matching automatically falls back to the
 | 
						|
       interpretive code. If you want to know whether JIT  was  actually  used
 | 
						|
       for  a particular match, you should arrange for a JIT callback function
 | 
						|
       to be set up as described in the section entitled "Controlling the  JIT
 | 
						|
       stack"  below,  even  if  you  do  not need to supply a non-default JIT
 | 
						|
       stack. Such a callback function is called whenever JIT code is about to
 | 
						|
       be  obeyed.  If the match-time options are not right for JIT execution,
 | 
						|
       the callback function is not obeyed.
 | 
						|
 | 
						|
       If the JIT compiler finds an unsupported item, no JIT  data  is  gener-
 | 
						|
       ated.  You  can find out if JIT matching is available after compiling a
 | 
						|
       pattern by calling  pcre2_pattern_info()  with  the  PCRE2_INFO_JITSIZE
 | 
						|
       option.  A non-zero result means that JIT compilation was successful. A
 | 
						|
       result of 0 means that JIT support is not available, or the pattern was
 | 
						|
       not  processed by pcre2_jit_compile(), or the JIT compiler was not able
 | 
						|
       to handle the pattern.
 | 
						|
 | 
						|
 | 
						|
UNSUPPORTED OPTIONS AND PATTERN ITEMS
 | 
						|
 | 
						|
       The pcre2_match() options that  are  supported  for  JIT  matching  are
 | 
						|
       PCRE2_NOTBOL,   PCRE2_NOTEOL,  PCRE2_NOTEMPTY,  PCRE2_NOTEMPTY_ATSTART,
 | 
						|
       PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,  and  PCRE2_PARTIAL_SOFT.  The
 | 
						|
       PCRE2_ANCHORED option is not supported at match time.
 | 
						|
 | 
						|
       If  the  PCRE2_NO_JIT option is passed to pcre2_match() it disables the
 | 
						|
       use of JIT, forcing matching by the interpreter code.
 | 
						|
 | 
						|
       The only unsupported pattern items are \C (match a  single  data  unit)
 | 
						|
       when  running in a UTF mode, and a callout immediately before an asser-
 | 
						|
       tion condition in a conditional group.
 | 
						|
 | 
						|
 | 
						|
RETURN VALUES FROM JIT MATCHING
 | 
						|
 | 
						|
       When a pattern is matched using JIT matching, the return values are the
 | 
						|
       same  as  those  given by the interpretive pcre2_match() code, with the
 | 
						|
       addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This  means
 | 
						|
       that  the memory used for the JIT stack was insufficient. See "Control-
 | 
						|
       ling the JIT stack" below for a discussion of JIT stack usage.
 | 
						|
 | 
						|
       The error code PCRE2_ERROR_MATCHLIMIT is returned by the  JIT  code  if
 | 
						|
       searching  a  very large pattern tree goes on for too long, as it is in
 | 
						|
       the same circumstance when JIT is not used, but the details of  exactly
 | 
						|
       what  is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT error
 | 
						|
       code is never returned when JIT matching is used.
 | 
						|
 | 
						|
 | 
						|
CONTROLLING THE JIT STACK
 | 
						|
 | 
						|
       When the compiled JIT code runs, it needs a block of memory to use as a
 | 
						|
       stack.   By  default,  it  uses 32K on the machine stack. However, some
 | 
						|
       large  or  complicated  patterns  need  more  than  this.   The   error
 | 
						|
       PCRE2_ERROR_JIT_STACKLIMIT  is  given  when  there is not enough stack.
 | 
						|
       Three functions are provided for managing blocks of memory for  use  as
 | 
						|
       JIT  stacks. There is further discussion about the use of JIT stacks in
 | 
						|
       the section entitled "JIT stack FAQ" below.
 | 
						|
 | 
						|
       The pcre2_jit_stack_create() function creates a JIT  stack.  Its  argu-
 | 
						|
       ments  are  a starting size, a maximum size, and a general context (for
 | 
						|
       memory allocation functions, or NULL for standard  memory  allocation).
 | 
						|
       It returns a pointer to an opaque structure of type pcre2_jit_stack, or
 | 
						|
       NULL if there is an error. The pcre2_jit_stack_free() function is  used
 | 
						|
       to  free a stack that is no longer needed. (For the technically minded:
 | 
						|
       the address space is allocated by mmap or VirtualAlloc.)
 | 
						|
 | 
						|
       JIT uses far less memory for recursion than the interpretive code,  and
 | 
						|
       a  maximum  stack size of 512K to 1M should be more than enough for any
 | 
						|
       pattern.
 | 
						|
 | 
						|
       The pcre2_jit_stack_assign() function specifies which  stack  JIT  code
 | 
						|
       should use. Its arguments are as follows:
 | 
						|
 | 
						|
         pcre2_match_context  *mcontext
 | 
						|
         pcre2_jit_callback    callback
 | 
						|
         void                 *data
 | 
						|
 | 
						|
       The first argument is a pointer to a match context. When this is subse-
 | 
						|
       quently passed to a matching function, its information determines which
 | 
						|
       JIT  stack  is  used. There are three cases for the values of the other
 | 
						|
       two options:
 | 
						|
 | 
						|
         (1) If callback is NULL and data is NULL, an internal 32K block
 | 
						|
             on the machine stack is used. This is the default when a match
 | 
						|
             context is created.
 | 
						|
 | 
						|
         (2) If callback is NULL and data is not NULL, data must be
 | 
						|
             a pointer to a valid JIT stack, the result of calling
 | 
						|
             pcre2_jit_stack_create().
 | 
						|
 | 
						|
         (3) If callback is not NULL, it must point to a function that is
 | 
						|
             called with data as an argument at the start of matching, in
 | 
						|
             order to set up a JIT stack. If the return from the callback
 | 
						|
             function is NULL, the internal 32K stack is used; otherwise the
 | 
						|
             return value must be a valid JIT stack, the result of calling
 | 
						|
             pcre2_jit_stack_create().
 | 
						|
 | 
						|
       A callback function is obeyed whenever JIT code is about to be run;  it
 | 
						|
       is not obeyed when pcre2_match() is called with options that are incom-
 | 
						|
       patible for JIT matching. A callback function can therefore be used  to
 | 
						|
       determine  whether  a  match  operation  was  executed by JIT or by the
 | 
						|
       interpreter.
 | 
						|
 | 
						|
       You may safely use the same JIT stack for more than one pattern (either
 | 
						|
       by  assigning  directly  or  by  callback), as long as the patterns are
 | 
						|
       matched sequentially in the same thread. Currently, the only way to set
 | 
						|
       up  non-sequential matches in one thread is to use callouts: if a call-
 | 
						|
       out function starts another match, that match must use a different  JIT
 | 
						|
       stack to the one used for currently suspended match(es).
 | 
						|
 | 
						|
       In  a multithread application, if you do not specify a JIT stack, or if
 | 
						|
       you assign or pass back NULL from  a  callback,  that  is  thread-safe,
 | 
						|
       because  each  thread has its own machine stack. However, if you assign
 | 
						|
       or pass back a non-NULL JIT stack, this must be a different  stack  for
 | 
						|
       each thread so that the application is thread-safe.
 | 
						|
 | 
						|
       Strictly  speaking,  even more is allowed. You can assign the same non-
 | 
						|
       NULL stack to a match context that is used by any number  of  patterns,
 | 
						|
       as  long  as  they are not used for matching by multiple threads at the
 | 
						|
       same time. For example, you could use the same stack  in  all  compiled
 | 
						|
       patterns,  with  a global mutex in the callback to wait until the stack
 | 
						|
       is available for use. However, this is an inefficient solution, and not
 | 
						|
       recommended.
 | 
						|
 | 
						|
       This  is a suggestion for how a multithreaded program that needs to set
 | 
						|
       up non-default JIT stacks might operate:
 | 
						|
 | 
						|
         During thread initalization
 | 
						|
           thread_local_var = pcre2_jit_stack_create(...)
 | 
						|
 | 
						|
         During thread exit
 | 
						|
           pcre2_jit_stack_free(thread_local_var)
 | 
						|
 | 
						|
         Use a one-line callback function
 | 
						|
           return thread_local_var
 | 
						|
 | 
						|
       All the functions described in this section do nothing if  JIT  is  not
 | 
						|
       available.
 | 
						|
 | 
						|
 | 
						|
JIT STACK FAQ
 | 
						|
 | 
						|
       (1) Why do we need JIT stacks?
 | 
						|
 | 
						|
       PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
 | 
						|
       where the local data of the current node is pushed before checking  its
 | 
						|
       child nodes.  Allocating real machine stack on some platforms is diffi-
 | 
						|
       cult. For example, the stack chain needs to be updated every time if we
 | 
						|
       extend  the  stack  on  PowerPC.  Although it is possible, its updating
 | 
						|
       time overhead decreases performance. So we do the recursion in memory.
 | 
						|
 | 
						|
       (2) Why don't we simply allocate blocks of memory with malloc()?
 | 
						|
 | 
						|
       Modern operating systems have a  nice  feature:  they  can  reserve  an
 | 
						|
       address space instead of allocating memory. We can safely allocate mem-
 | 
						|
       ory pages inside this address space, so the stack  could  grow  without
 | 
						|
       moving memory data (this is important because of pointers). Thus we can
 | 
						|
       allocate 1M address space, and use only a single memory  page  (usually
 | 
						|
       4K)  if  that is enough. However, we can still grow up to 1M anytime if
 | 
						|
       needed.
 | 
						|
 | 
						|
       (3) Who "owns" a JIT stack?
 | 
						|
 | 
						|
       The owner of the stack is the user program, not the JIT studied pattern
 | 
						|
       or anything else. The user program must ensure that if a stack is being
 | 
						|
       used by pcre2_match(), (that is, it is assigned to a match context that
 | 
						|
       is  passed  to  the  pattern currently running), that stack must not be
 | 
						|
       used by any other threads (to avoid overwriting the same memory  area).
 | 
						|
       The best practice for multithreaded programs is to allocate a stack for
 | 
						|
       each thread, and return this stack through the JIT callback function.
 | 
						|
 | 
						|
       (4) When should a JIT stack be freed?
 | 
						|
 | 
						|
       You can free a JIT stack at any time, as long as it will not be used by
 | 
						|
       pcre2_match() again. When you assign the stack to a match context, only
 | 
						|
       a pointer is set. There is no reference counting or  any  other  magic.
 | 
						|
       You can free compiled patterns, contexts, and stacks in any order, any-
 | 
						|
       time. Just do not call pcre2_match() with a match context  pointing  to
 | 
						|
       an already freed stack, as that will cause SEGFAULT. (Also, do not free
 | 
						|
       a stack currently used by pcre2_match() in  another  thread).  You  can
 | 
						|
       also  replace the stack in a context at any time when it is not in use.
 | 
						|
       You should free the previous stack before assigning a replacement.
 | 
						|
 | 
						|
       (5) Should I allocate/free a  stack  every  time  before/after  calling
 | 
						|
       pcre2_match()?
 | 
						|
 | 
						|
       No,  because  this  is  too  costly in terms of resources. However, you
 | 
						|
       could implement some clever idea which release the stack if it  is  not
 | 
						|
       used  in  let's  say  two minutes. The JIT callback can help to achieve
 | 
						|
       this without keeping a list of patterns.
 | 
						|
 | 
						|
       (6) OK, the stack is for long term memory allocation. But what  happens
 | 
						|
       if  a pattern causes stack overflow with a stack of 1M? Is that 1M kept
 | 
						|
       until the stack is freed?
 | 
						|
 | 
						|
       Especially on embedded sytems, it might be a good idea to release  mem-
 | 
						|
       ory  sometimes  without  freeing the stack. There is no API for this at
 | 
						|
       the moment.  Probably a function call which returns with the  currently
 | 
						|
       allocated  memory for any stack and another which allows releasing mem-
 | 
						|
       ory (shrinking the stack) would be a good idea if someone needs this.
 | 
						|
 | 
						|
       (7) This is too much of a headache. Isn't there any better solution for
 | 
						|
       JIT stack handling?
 | 
						|
 | 
						|
       No,  thanks to Windows. If POSIX threads were used everywhere, we could
 | 
						|
       throw out this complicated API.
 | 
						|
 | 
						|
 | 
						|
FREEING JIT SPECULATIVE MEMORY
 | 
						|
 | 
						|
       void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       The JIT executable allocator does not free all memory when it is possi-
 | 
						|
       ble.   It expects new allocations, and keeps some free memory around to
 | 
						|
       improve allocation speed. However, in low memory conditions,  it  might
 | 
						|
       be  better to free all possible memory. You can cause this to happen by
 | 
						|
       calling pcre2_jit_free_unused_memory(). Its argument is a general  con-
 | 
						|
       text, for custom memory management, or NULL for standard memory manage-
 | 
						|
       ment.
 | 
						|
 | 
						|
 | 
						|
EXAMPLE CODE
 | 
						|
 | 
						|
       This is a single-threaded example that specifies a  JIT  stack  without
 | 
						|
       using  a  callback.  A real program should include error checking after
 | 
						|
       all the function calls.
 | 
						|
 | 
						|
         int rc;
 | 
						|
         pcre2_code *re;
 | 
						|
         pcre2_match_data *match_data;
 | 
						|
         pcre2_match_context *mcontext;
 | 
						|
         pcre2_jit_stack *jit_stack;
 | 
						|
 | 
						|
         re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
 | 
						|
           &errornumber, &erroffset, NULL);
 | 
						|
         rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
 | 
						|
         mcontext = pcre2_match_context_create(NULL);
 | 
						|
         jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
 | 
						|
         pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
 | 
						|
         match_data = pcre2_match_data_create(re, 10);
 | 
						|
         rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
 | 
						|
         /* Process result */
 | 
						|
 | 
						|
         pcre2_code_free(re);
 | 
						|
         pcre2_match_data_free(match_data);
 | 
						|
         pcre2_match_context_free(mcontext);
 | 
						|
         pcre2_jit_stack_free(jit_stack);
 | 
						|
 | 
						|
 | 
						|
JIT FAST PATH API
 | 
						|
 | 
						|
       Because the API described above falls back to interpreted matching when
 | 
						|
       JIT  is  not  available, it is convenient for programs that are written
 | 
						|
       for  general  use  in  many  environments.  However,  calling  JIT  via
 | 
						|
       pcre2_match() does have a performance impact. Programs that are written
 | 
						|
       for use where JIT is known to be available, and  which  need  the  best
 | 
						|
       possible  performance,  can  instead  use a "fast path" API to call JIT
 | 
						|
       matching directly instead of calling pcre2_match() (obviously only  for
 | 
						|
       patterns that have been successfully processed by pcre2_jit_compile()).
 | 
						|
 | 
						|
       The  fast  path  function  is  called  pcre2_jit_match(),  and it takes
 | 
						|
       exactly the same arguments as pcre2_match(). The return values are also
 | 
						|
       the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or
 | 
						|
       complete) is requested that was not compiled. Unsupported  option  bits
 | 
						|
       (for  example,  PCRE2_ANCHORED)  are  ignored,  as  is the PCRE2_NO_JIT
 | 
						|
       option.
 | 
						|
 | 
						|
       When you call pcre2_match(), as well as testing for invalid options,  a
 | 
						|
       number of other sanity checks are performed on the arguments. For exam-
 | 
						|
       ple, if the subject pointer is NULL, an immediate error is given. Also,
 | 
						|
       unless  PCRE2_NO_UTF_CHECK  is  set, a UTF subject string is tested for
 | 
						|
       validity. In the interests of speed, these checks do not happen on  the
 | 
						|
       JIT fast path, and if invalid data is passed, the result is undefined.
 | 
						|
 | 
						|
       Bypassing  the  sanity  checks  and the pcre2_match() wrapping can give
 | 
						|
       speedups of more than 10%.
 | 
						|
 | 
						|
 | 
						|
SEE ALSO
 | 
						|
 | 
						|
       pcre2api(3)
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel (FAQ by Zoltan Herczeg)
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 05 June 2016
 | 
						|
       Copyright (c) 1997-2016 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2LIMITS(3)             Library Functions Manual             PCRE2LIMITS(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
SIZE AND OTHER LIMITATIONS
 | 
						|
 | 
						|
       There are some size limitations in PCRE2 but it is hoped that they will
 | 
						|
       never in practice be relevant.
 | 
						|
 | 
						|
       The maximum size of a compiled pattern is approximately 64K code  units
 | 
						|
       for  the  8-bit  and  16-bit  libraries  if  PCRE2 is compiled with the
 | 
						|
       default internal linkage size, which is 2 bytes for these libraries. If
 | 
						|
       you  want  to  process regular expressions that are truly enormous, you
 | 
						|
       can compile PCRE2 with an internal linkage size of 3 or 4 (when  build-
 | 
						|
       ing  the  16-bit library, 3 is rounded up to 4). See the README file in
 | 
						|
       the source distribution and the pcre2build documentation  for  details.
 | 
						|
       In  these  cases the limit is substantially larger.  However, the speed
 | 
						|
       of execution is slower. In the 32-bit  library,  the  internal  linkage
 | 
						|
       size is always 4.
 | 
						|
 | 
						|
       The maximum length of a source pattern string is essentially unlimited;
 | 
						|
       it is the largest number a PCRE2_SIZE variable can hold.  However,  the
 | 
						|
       program that calls pcre2_compile() can specify a smaller limit.
 | 
						|
 | 
						|
       The maximum length (in code units) of a subject string is one less than
 | 
						|
       the largest number a PCRE2_SIZE variable can  hold.  PCRE2_SIZE  is  an
 | 
						|
       unsigned  integer  type,  usually  defined as size_t. Its maximum value
 | 
						|
       (that is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-
 | 
						|
       terminated strings and unset offsets.
 | 
						|
 | 
						|
       Note  that  when  using  the  traditional matching function, PCRE2 uses
 | 
						|
       recursion to handle subpatterns and indefinite repetition.  This  means
 | 
						|
       that  the  available stack space may limit the size of a subject string
 | 
						|
       that can be processed by certain patterns. For a  discussion  of  stack
 | 
						|
       issues, see the pcre2stack documentation.
 | 
						|
 | 
						|
       All values in repeating quantifiers must be less than 65536.
 | 
						|
 | 
						|
       The maximum length of a lookbehind assertion is 65535 characters.
 | 
						|
 | 
						|
       There is no limit to the number of parenthesized subpatterns, but there
 | 
						|
       can be no more than 65535 capturing subpatterns. There is,  however,  a
 | 
						|
       limit  to  the  depth  of  nesting  of parenthesized subpatterns of all
 | 
						|
       kinds. This is imposed in order to limit the  amount  of  system  stack
 | 
						|
       used  at compile time. The default limit can be specified when PCRE2 is
 | 
						|
       built; the default default is 250. An application can change this limit
 | 
						|
       by  calling pcre2_set_parens_nest_limit() to set the limit in a compile
 | 
						|
       context.
 | 
						|
 | 
						|
       The maximum length of name for a named subpattern is 32 code units, and
 | 
						|
       the maximum number of named subpatterns is 10000.
 | 
						|
 | 
						|
       The  maximum  length  of  a  name  in  a (*MARK), (*PRUNE), (*SKIP), or
 | 
						|
       (*THEN) verb is 255 code units for the 8-bit  library  and  65535  code
 | 
						|
       units for the 16-bit and 32-bit libraries.
 | 
						|
 | 
						|
       The  maximum  length  of  a string argument to a callout is the largest
 | 
						|
       number a 32-bit unsigned integer can hold.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 26 October 2016
 | 
						|
       Copyright (c) 1997-2016 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2MATCHING(3)           Library Functions Manual           PCRE2MATCHING(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
PCRE2 MATCHING ALGORITHMS
 | 
						|
 | 
						|
       This document describes the two different algorithms that are available
 | 
						|
       in PCRE2 for matching a compiled regular  expression  against  a  given
 | 
						|
       subject  string.  The  "standard"  algorithm is the one provided by the
 | 
						|
       pcre2_match() function. This works in the same as  as  Perl's  matching
 | 
						|
       function,  and  provide a Perl-compatible matching operation. The just-
 | 
						|
       in-time (JIT) optimization that is described in the pcre2jit documenta-
 | 
						|
       tion is compatible with this function.
 | 
						|
 | 
						|
       An alternative algorithm is provided by the pcre2_dfa_match() function;
 | 
						|
       it operates in a different way, and is not Perl-compatible. This alter-
 | 
						|
       native  has  advantages  and  disadvantages  compared with the standard
 | 
						|
       algorithm, and these are described below.
 | 
						|
 | 
						|
       When there is only one possible way in which a given subject string can
 | 
						|
       match  a pattern, the two algorithms give the same answer. A difference
 | 
						|
       arises, however, when there are multiple possibilities. For example, if
 | 
						|
       the pattern
 | 
						|
 | 
						|
         ^<.*>
 | 
						|
 | 
						|
       is matched against the string
 | 
						|
 | 
						|
         <something> <something else> <something further>
 | 
						|
 | 
						|
       there are three possible answers. The standard algorithm finds only one
 | 
						|
       of them, whereas the alternative algorithm finds all three.
 | 
						|
 | 
						|
 | 
						|
REGULAR EXPRESSIONS AS TREES
 | 
						|
 | 
						|
       The set of strings that are matched by a regular expression can be rep-
 | 
						|
       resented  as  a  tree structure. An unlimited repetition in the pattern
 | 
						|
       makes the tree of infinite size, but it is still a tree.  Matching  the
 | 
						|
       pattern  to a given subject string (from a given starting point) can be
 | 
						|
       thought of as a search of the tree.  There are two  ways  to  search  a
 | 
						|
       tree:  depth-first  and  breadth-first, and these correspond to the two
 | 
						|
       matching algorithms provided by PCRE2.
 | 
						|
 | 
						|
 | 
						|
THE STANDARD MATCHING ALGORITHM
 | 
						|
 | 
						|
       In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
 | 
						|
       sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
 | 
						|
       depth-first search of the pattern tree. That is, it  proceeds  along  a
 | 
						|
       single path through the tree, checking that the subject matches what is
 | 
						|
       required. When there is a mismatch, the algorithm  tries  any  alterna-
 | 
						|
       tives  at  the  current point, and if they all fail, it backs up to the
 | 
						|
       previous branch point in the  tree,  and  tries  the  next  alternative
 | 
						|
       branch  at  that  level.  This often involves backing up (moving to the
 | 
						|
       left) in the subject string as well.  The  order  in  which  repetition
 | 
						|
       branches  are  tried  is controlled by the greedy or ungreedy nature of
 | 
						|
       the quantifier.
 | 
						|
 | 
						|
       If a leaf node is reached, a matching string has  been  found,  and  at
 | 
						|
       that  point the algorithm stops. Thus, if there is more than one possi-
 | 
						|
       ble match, this algorithm returns the first one that it finds.  Whether
 | 
						|
       this  is the shortest, the longest, or some intermediate length depends
 | 
						|
       on the way the greedy and ungreedy repetition quantifiers are specified
 | 
						|
       in the pattern.
 | 
						|
 | 
						|
       Because  it  ends  up  with a single path through the tree, it is rela-
 | 
						|
       tively straightforward for this algorithm to keep  track  of  the  sub-
 | 
						|
       strings  that  are  matched  by portions of the pattern in parentheses.
 | 
						|
       This provides support for capturing parentheses and back references.
 | 
						|
 | 
						|
 | 
						|
THE ALTERNATIVE MATCHING ALGORITHM
 | 
						|
 | 
						|
       This algorithm conducts a breadth-first search of  the  tree.  Starting
 | 
						|
       from  the  first  matching  point  in the subject, it scans the subject
 | 
						|
       string from left to right, once, character by character, and as it does
 | 
						|
       this,  it remembers all the paths through the tree that represent valid
 | 
						|
       matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
 | 
						|
       though  it is not implemented as a traditional finite state machine (it
 | 
						|
       keeps multiple states active simultaneously).
 | 
						|
 | 
						|
       Although the general principle of this matching algorithm  is  that  it
 | 
						|
       scans  the subject string only once, without backtracking, there is one
 | 
						|
       exception: when a lookaround assertion is encountered,  the  characters
 | 
						|
       following  or  preceding  the  current  point  have to be independently
 | 
						|
       inspected.
 | 
						|
 | 
						|
       The scan continues until either the end of the subject is  reached,  or
 | 
						|
       there  are  no more unterminated paths. At this point, terminated paths
 | 
						|
       represent the different matching possibilities (if there are none,  the
 | 
						|
       match  has  failed).   Thus,  if there is more than one possible match,
 | 
						|
       this algorithm finds all of them, and in particular, it finds the long-
 | 
						|
       est.  The  matches are returned in decreasing order of length. There is
 | 
						|
       an option to stop the algorithm after the first match (which is  neces-
 | 
						|
       sarily the shortest) is found.
 | 
						|
 | 
						|
       Note that all the matches that are found start at the same point in the
 | 
						|
       subject. If the pattern
 | 
						|
 | 
						|
         cat(er(pillar)?)?
 | 
						|
 | 
						|
       is matched against the string "the caterpillar catchment",  the  result
 | 
						|
       is  the  three  strings "caterpillar", "cater", and "cat" that start at
 | 
						|
       the fifth character of the subject. The algorithm  does  not  automati-
 | 
						|
       cally move on to find matches that start at later positions.
 | 
						|
 | 
						|
       PCRE2's "auto-possessification" optimization usually applies to charac-
 | 
						|
       ter repeats at the end of a pattern (as well as internally). For  exam-
 | 
						|
       ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
 | 
						|
       is no point even considering the possibility of backtracking  into  the
 | 
						|
       repeated  digits.  For  DFA matching, this means that only one possible
 | 
						|
       match is found. If you really do want multiple matches in  such  cases,
 | 
						|
       either  use  an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
 | 
						|
       SESS option when compiling.
 | 
						|
 | 
						|
       There are a number of features of PCRE2 regular  expressions  that  are
 | 
						|
       not  supported  by the alternative matching algorithm. They are as fol-
 | 
						|
       lows:
 | 
						|
 | 
						|
       1. Because the algorithm finds all  possible  matches,  the  greedy  or
 | 
						|
       ungreedy  nature  of  repetition quantifiers is not relevant (though it
 | 
						|
       may affect auto-possessification, as just described). During  matching,
 | 
						|
       greedy  and  ungreedy  quantifiers are treated in exactly the same way.
 | 
						|
       However, possessive quantifiers can make a difference when what follows
 | 
						|
       could  also  match  what  is  quantified, for example in a pattern like
 | 
						|
       this:
 | 
						|
 | 
						|
         ^a++\w!
 | 
						|
 | 
						|
       This pattern matches "aaab!" but not "aaa!", which would be matched  by
 | 
						|
       a  non-possessive quantifier. Similarly, if an atomic group is present,
 | 
						|
       it is matched as if it were a standalone pattern at the current  point,
 | 
						|
       and  the  longest match is then "locked in" for the rest of the overall
 | 
						|
       pattern.
 | 
						|
 | 
						|
       2. When dealing with multiple paths through the tree simultaneously, it
 | 
						|
       is  not  straightforward  to  keep track of captured substrings for the
 | 
						|
       different matching possibilities, and PCRE2's  implementation  of  this
 | 
						|
       algorithm does not attempt to do this. This means that no captured sub-
 | 
						|
       strings are available.
 | 
						|
 | 
						|
       3. Because no substrings are captured, back references within the  pat-
 | 
						|
       tern are not supported, and cause errors if encountered.
 | 
						|
 | 
						|
       4.  For  the same reason, conditional expressions that use a backrefer-
 | 
						|
       ence as the condition or test for a specific group  recursion  are  not
 | 
						|
       supported.
 | 
						|
 | 
						|
       5.  Because  many  paths  through the tree may be active, the \K escape
 | 
						|
       sequence, which resets the start of the match when encountered (but may
 | 
						|
       be  on  some  paths  and not on others), is not supported. It causes an
 | 
						|
       error if encountered.
 | 
						|
 | 
						|
       6. Callouts are supported, but the value of the  capture_top  field  is
 | 
						|
       always 1, and the value of the capture_last field is always 0.
 | 
						|
 | 
						|
       7.  The  \C  escape  sequence, which (in the standard algorithm) always
 | 
						|
       matches a single code unit, even in a UTF mode,  is  not  supported  in
 | 
						|
       these  modes,  because the alternative algorithm moves through the sub-
 | 
						|
       ject string one character (not code unit) at a  time,  for  all  active
 | 
						|
       paths through the tree.
 | 
						|
 | 
						|
       8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
 | 
						|
       are not supported. (*FAIL) is supported, and  behaves  like  a  failing
 | 
						|
       negative assertion.
 | 
						|
 | 
						|
 | 
						|
ADVANTAGES OF THE ALTERNATIVE ALGORITHM
 | 
						|
 | 
						|
       Using  the alternative matching algorithm provides the following advan-
 | 
						|
       tages:
 | 
						|
 | 
						|
       1. All possible matches (at a single point in the subject) are automat-
 | 
						|
       ically  found,  and  in particular, the longest match is found. To find
 | 
						|
       more than one match using the standard algorithm, you have to do kludgy
 | 
						|
       things with callouts.
 | 
						|
 | 
						|
       2.  Because  the  alternative  algorithm  scans the subject string just
 | 
						|
       once, and never needs to backtrack (except for lookbehinds), it is pos-
 | 
						|
       sible  to  pass  very  long subject strings to the matching function in
 | 
						|
       several pieces, checking for partial matching each time. Although it is
 | 
						|
       also  possible  to  do  multi-segment matching using the standard algo-
 | 
						|
       rithm, by retaining partially matched substrings, it  is  more  compli-
 | 
						|
       cated. The pcre2partial documentation gives details of partial matching
 | 
						|
       and discusses multi-segment matching.
 | 
						|
 | 
						|
 | 
						|
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
 | 
						|
 | 
						|
       The alternative algorithm suffers from a number of disadvantages:
 | 
						|
 | 
						|
       1. It is substantially slower than  the  standard  algorithm.  This  is
 | 
						|
       partly  because  it has to search for all possible matches, but is also
 | 
						|
       because it is less susceptible to optimization.
 | 
						|
 | 
						|
       2. Capturing parentheses and back references are not supported.
 | 
						|
 | 
						|
       3. Although atomic groups are supported, their use does not provide the
 | 
						|
       performance advantage that it does for the standard algorithm.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 29 September 2014
 | 
						|
       Copyright (c) 1997-2014 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2PARTIAL(3)            Library Functions Manual            PCRE2PARTIAL(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions
 | 
						|
 | 
						|
PARTIAL MATCHING IN PCRE2
 | 
						|
 | 
						|
       In  normal  use  of  PCRE2,  if  the subject string that is passed to a
 | 
						|
       matching function matches as far as it goes, but is too short to  match
 | 
						|
       the  entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
 | 
						|
       stances where it might be helpful to distinguish this case  from  other
 | 
						|
       cases in which there is no match.
 | 
						|
 | 
						|
       Consider, for example, an application where a human is required to type
 | 
						|
       in data for a field with specific formatting requirements.  An  example
 | 
						|
       might be a date in the form ddmmmyy, defined by this pattern:
 | 
						|
 | 
						|
         ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
 | 
						|
 | 
						|
       If the application sees the user's keystrokes one by one, and can check
 | 
						|
       that what has been typed so far is potentially valid,  it  is  able  to
 | 
						|
       raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
 | 
						|
       reflecting the character that has been typed, for example. This immedi-
 | 
						|
       ate  feedback is likely to be a better user interface than a check that
 | 
						|
       is delayed until the entire string has been entered.  Partial  matching
 | 
						|
       can  also be useful when the subject string is very long and is not all
 | 
						|
       available at once.
 | 
						|
 | 
						|
       PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT  and
 | 
						|
       PCRE2_PARTIAL_HARD  options,  which  can be set when calling a matching
 | 
						|
       function.  The difference between the two options is whether or  not  a
 | 
						|
       partial match is preferred to an alternative complete match, though the
 | 
						|
       details differ between the two types  of  matching  function.  If  both
 | 
						|
       options are set, PCRE2_PARTIAL_HARD takes precedence.
 | 
						|
 | 
						|
       If  you  want to use partial matching with just-in-time optimized code,
 | 
						|
       you must call pcre2_jit_compile() with one or both of these options:
 | 
						|
 | 
						|
         PCRE2_JIT_PARTIAL_SOFT
 | 
						|
         PCRE2_JIT_PARTIAL_HARD
 | 
						|
 | 
						|
       PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par-
 | 
						|
       tial  matches  on the same pattern. If the appropriate JIT mode has not
 | 
						|
       been compiled, interpretive matching code is used.
 | 
						|
 | 
						|
       Setting a partial matching option  disables  two  of  PCRE2's  standard
 | 
						|
       optimizations. PCRE2 remembers the last literal code unit in a pattern,
 | 
						|
       and abandons matching immediately if it is not present in  the  subject
 | 
						|
       string.  This  optimization  cannot  be  used for a subject string that
 | 
						|
       might match only partially. PCRE2 also knows the minimum  length  of  a
 | 
						|
       matching  string,  and  does not bother to run the matching function on
 | 
						|
       shorter strings. This optimization is also disabled for partial  match-
 | 
						|
       ing.
 | 
						|
 | 
						|
 | 
						|
PARTIAL MATCHING USING pcre2_match()
 | 
						|
 | 
						|
       A  partial  match occurs during a call to pcre2_match() when the end of
 | 
						|
       the subject string is reached successfully, but  matching  cannot  con-
 | 
						|
       tinue because more characters are needed. However, at least one charac-
 | 
						|
       ter in the subject must have been inspected. This  character  need  not
 | 
						|
       form part of the final matched string; lookbehind assertions and the \K
 | 
						|
       escape sequence provide ways of inspecting characters before the  start
 | 
						|
       of  a matched string. The requirement for inspecting at least one char-
 | 
						|
       acter exists because an empty string can  always  be  matched;  without
 | 
						|
       such  a  restriction  there would always be a partial match of an empty
 | 
						|
       string at the end of the subject.
 | 
						|
 | 
						|
       When a partial match is returned, the first two elements in the ovector
 | 
						|
       point to the portion of the subject that was matched, but the values in
 | 
						|
       the rest of the ovector are undefined. The appearance of \K in the pat-
 | 
						|
       tern has no effect for a partial match. Consider this pattern:
 | 
						|
 | 
						|
         /abc\K123/
 | 
						|
 | 
						|
       If it is matched against "456abc123xyz" the result is a complete match,
 | 
						|
       and the ovector defines the matched string as "123", because \K  resets
 | 
						|
       the  "start  of  match" point. However, if a partial match is requested
 | 
						|
       and the subject string is "456abc12", a partial match is found for  the
 | 
						|
       string  "abc12",  because  all these characters are needed for a subse-
 | 
						|
       quent re-match with additional characters.
 | 
						|
 | 
						|
       What happens when a partial match is identified depends on which of the
 | 
						|
       two partial matching options are set.
 | 
						|
 | 
						|
   PCRE2_PARTIAL_SOFT WITH pcre2_match()
 | 
						|
 | 
						|
       If  PCRE2_PARTIAL_SOFT  is  set when pcre2_match() identifies a partial
 | 
						|
       match, the partial match is remembered, but matching continues as  nor-
 | 
						|
       mal,  and  other  alternatives in the pattern are tried. If no complete
 | 
						|
       match  can  be  found,  PCRE2_ERROR_PARTIAL  is  returned  instead   of
 | 
						|
       PCRE2_ERROR_NOMATCH.
 | 
						|
 | 
						|
       This  option  is "soft" because it prefers a complete match over a par-
 | 
						|
       tial match.  All the various matching items in a pattern behave  as  if
 | 
						|
       the  subject string is potentially complete. For example, \z, \Z, and $
 | 
						|
       match at the end of the subject, as normal, and for \b and \B  the  end
 | 
						|
       of the subject is treated as a non-alphanumeric.
 | 
						|
 | 
						|
       If  there  is more than one partial match, the first one that was found
 | 
						|
       provides the data that is returned. Consider this pattern:
 | 
						|
 | 
						|
         /123\w+X|dogY/
 | 
						|
 | 
						|
       If this is matched against the subject string "abc123dog", both  alter-
 | 
						|
       natives  fail  to  match,  but the end of the subject is reached during
 | 
						|
       matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to  3
 | 
						|
       and  9, identifying "123dog" as the first partial match that was found.
 | 
						|
       (In this example, there are two partial matches, because "dog"  on  its
 | 
						|
       own partially matches the second alternative.)
 | 
						|
 | 
						|
   PCRE2_PARTIAL_HARD WITH pcre2_match()
 | 
						|
 | 
						|
       If  PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is
 | 
						|
       returned as soon as a partial match is  found,  without  continuing  to
 | 
						|
       search  for possible complete matches. This option is "hard" because it
 | 
						|
       prefers an earlier partial match over a later complete match. For  this
 | 
						|
       reason,  the  assumption  is  made that the end of the supplied subject
 | 
						|
       string may not be the true end of the available data, and  so,  if  \z,
 | 
						|
       \Z,  \b, \B, or $ are encountered at the end of the subject, the result
 | 
						|
       is PCRE2_ERROR_PARTIAL, provided that at least  one  character  in  the
 | 
						|
       subject has been inspected.
 | 
						|
 | 
						|
   Comparing hard and soft partial matching
 | 
						|
 | 
						|
       The  difference  between the two partial matching options can be illus-
 | 
						|
       trated by a pattern such as:
 | 
						|
 | 
						|
         /dog(sbody)?/
 | 
						|
 | 
						|
       This matches either "dog" or "dogsbody", greedily (that is, it  prefers
 | 
						|
       the  longer  string  if  possible). If it is matched against the string
 | 
						|
       "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for  "dog".
 | 
						|
       However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
 | 
						|
       TIAL. On the other hand, if the pattern is made ungreedy the result  is
 | 
						|
       different:
 | 
						|
 | 
						|
         /dog(sbody)??/
 | 
						|
 | 
						|
       In  this  case  the  result  is always a complete match because that is
 | 
						|
       found first, and matching never  continues  after  finding  a  complete
 | 
						|
       match. It might be easier to follow this explanation by thinking of the
 | 
						|
       two patterns like this:
 | 
						|
 | 
						|
         /dog(sbody)?/    is the same as  /dogsbody|dog/
 | 
						|
         /dog(sbody)??/   is the same as  /dog|dogsbody/
 | 
						|
 | 
						|
       The second pattern will never match "dogsbody", because it will  always
 | 
						|
       find the shorter match first.
 | 
						|
 | 
						|
 | 
						|
PARTIAL MATCHING USING pcre2_dfa_match()
 | 
						|
 | 
						|
       The DFA functions move along the subject string character by character,
 | 
						|
       without backtracking, searching for  all  possible  matches  simultane-
 | 
						|
       ously.  If the end of the subject is reached before the end of the pat-
 | 
						|
       tern, there is the possibility of a partial match, again provided  that
 | 
						|
       at least one character has been inspected.
 | 
						|
 | 
						|
       When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
 | 
						|
       there have been no complete matches. Otherwise,  the  complete  matches
 | 
						|
       are  returned.   However, if PCRE2_PARTIAL_HARD is set, a partial match
 | 
						|
       takes precedence over any complete matches. The portion of  the  string
 | 
						|
       that was matched when the longest partial match was found is set as the
 | 
						|
       first matching string.
 | 
						|
 | 
						|
       Because the DFA functions always search for all possible  matches,  and
 | 
						|
       there  is  no  difference between greedy and ungreedy repetition, their
 | 
						|
       behaviour is different from  the  standard  functions  when  PCRE2_PAR-
 | 
						|
       TIAL_HARD  is  set.  Consider  the  string  "dog"  matched  against the
 | 
						|
       ungreedy pattern shown above:
 | 
						|
 | 
						|
         /dog(sbody)??/
 | 
						|
 | 
						|
       Whereas the standard function stops as soon as it  finds  the  complete
 | 
						|
       match  for  "dog",  the  DFA  function also finds the partial match for
 | 
						|
       "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
 | 
						|
 | 
						|
 | 
						|
PARTIAL MATCHING AND WORD BOUNDARIES
 | 
						|
 | 
						|
       If a pattern ends with one of sequences \b or \B, which test  for  word
 | 
						|
       boundaries,  partial matching with PCRE2_PARTIAL_SOFT can give counter-
 | 
						|
       intuitive results. Consider this pattern:
 | 
						|
 | 
						|
         /\bcat\b/
 | 
						|
 | 
						|
       This matches "cat", provided there is a word boundary at either end. If
 | 
						|
       the subject string is "the cat", the comparison of the final "t" with a
 | 
						|
       following character cannot take place, so a  partial  match  is  found.
 | 
						|
       However,  normal  matching carries on, and \b matches at the end of the
 | 
						|
       subject when the last character is a letter, so  a  complete  match  is
 | 
						|
       found.   The  result,  therefore,  is  not  PCRE2_ERROR_PARTIAL.  Using
 | 
						|
       PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because
 | 
						|
       then the partial match takes precedence.
 | 
						|
 | 
						|
 | 
						|
EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST
 | 
						|
 | 
						|
       If  the  partial_soft  (or  ps) modifier is present on a pcre2test data
 | 
						|
       line, the PCRE2_PARTIAL_SOFT option is used for the match.  Here  is  a
 | 
						|
       run of pcre2test that uses the date example quoted above:
 | 
						|
 | 
						|
           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
 | 
						|
         data> 25jun04\=ps
 | 
						|
          0: 25jun04
 | 
						|
          1: jun
 | 
						|
         data> 25dec3\=ps
 | 
						|
         Partial match: 23dec3
 | 
						|
         data> 3ju\=ps
 | 
						|
         Partial match: 3ju
 | 
						|
         data> 3juj\=ps
 | 
						|
         No match
 | 
						|
         data> j\=ps
 | 
						|
         No match
 | 
						|
 | 
						|
       The  first  data  string  is matched completely, so pcre2test shows the
 | 
						|
       matched substrings. The remaining four strings do not  match  the  com-
 | 
						|
       plete pattern, but the first two are partial matches. Similar output is
 | 
						|
       obtained if DFA matching is used.
 | 
						|
 | 
						|
       If the partial_hard (or ph) modifier is present  on  a  pcre2test  data
 | 
						|
       line, the PCRE2_PARTIAL_HARD option is set for the match.
 | 
						|
 | 
						|
 | 
						|
MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
 | 
						|
 | 
						|
       When  a  partial match has been found using a DFA matching function, it
 | 
						|
       is possible to continue the match by providing additional subject  data
 | 
						|
       and  calling  the function again with the same compiled regular expres-
 | 
						|
       sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
 | 
						|
       same working space as before, because this is where details of the pre-
 | 
						|
       vious partial match are stored. Here is an example using pcre2test:
 | 
						|
 | 
						|
           re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
 | 
						|
         data> 23ja\=dfa,ps
 | 
						|
         Partial match: 23ja
 | 
						|
         data> n05\=dfa,dfa_restart
 | 
						|
          0: n05
 | 
						|
 | 
						|
       The first call has "23ja" as the subject, and requests  partial  match-
 | 
						|
       ing;  the  second  call  has  "n05"  as  the  subject for the continued
 | 
						|
       (restarted) match.  Notice that when the match is  complete,  only  the
 | 
						|
       last  part  is  shown;  PCRE2 does not retain the previously partially-
 | 
						|
       matched string. It is up to the calling program to do that if it  needs
 | 
						|
       to.
 | 
						|
 | 
						|
       That means that, for an unanchored pattern, if a continued match fails,
 | 
						|
       it is not possible to try again at  a  new  starting  point.  All  this
 | 
						|
       facility  is  capable  of  doing  is continuing with the previous match
 | 
						|
       attempt. In the previous example, if the second set of data  is  "ug23"
 | 
						|
       the  result is no match, even though there would be a match for "aug23"
 | 
						|
       if the entire string were given at once. Depending on the  application,
 | 
						|
       this may or may not be what you want.  The only way to allow for start-
 | 
						|
       ing again at the next character is to retain the matched  part  of  the
 | 
						|
       subject and try a new complete match.
 | 
						|
 | 
						|
       You  can  set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
 | 
						|
       PCRE2_DFA_RESTART to continue partial matching over multiple  segments.
 | 
						|
       This  facility can be used to pass very long subject strings to the DFA
 | 
						|
       matching functions.
 | 
						|
 | 
						|
 | 
						|
MULTI-SEGMENT MATCHING WITH pcre2_match()
 | 
						|
 | 
						|
       Unlike the DFA function, it is not possible  to  restart  the  previous
 | 
						|
       match with a new segment of data when using pcre2_match(). Instead, new
 | 
						|
       data must be added to the previous subject string, and the entire match
 | 
						|
       re-run,  starting from the point where the partial match occurred. Ear-
 | 
						|
       lier data can be discarded.
 | 
						|
 | 
						|
       It is best to use PCRE2_PARTIAL_HARD in this situation, because it does
 | 
						|
       not  treat the end of a segment as the end of the subject when matching
 | 
						|
       \z, \Z, \b, \B, and $. Consider  an  unanchored  pattern  that  matches
 | 
						|
       dates:
 | 
						|
 | 
						|
           re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
 | 
						|
         data> The date is 23ja\=ph
 | 
						|
         Partial match: 23ja
 | 
						|
 | 
						|
       At  this stage, an application could discard the text preceding "23ja",
 | 
						|
       add on text from the next  segment,  and  call  the  matching  function
 | 
						|
       again.  Unlike  the  DFA  matching function, the entire matching string
 | 
						|
       must always be available, and the complete matching process occurs  for
 | 
						|
       each call, so more memory and more processing time is needed.
 | 
						|
 | 
						|
 | 
						|
ISSUES WITH MULTI-SEGMENT MATCHING
 | 
						|
 | 
						|
       Certain types of pattern may give problems with multi-segment matching,
 | 
						|
       whichever matching function is used.
 | 
						|
 | 
						|
       1. If the pattern contains a test for the beginning of a line, you need
 | 
						|
       to  pass  the  PCRE2_NOTBOL option when the subject string for any call
 | 
						|
       does start at the beginning of a line. There  is  also  a  PCRE2_NOTEOL
 | 
						|
       option, but in practice when doing multi-segment matching you should be
 | 
						|
       using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL.
 | 
						|
 | 
						|
       2. If a pattern contains a lookbehind assertion, characters  that  pre-
 | 
						|
       cede  the start of the partial match may have been inspected during the
 | 
						|
       matching process.  When using pcre2_match(), sufficient characters must
 | 
						|
       be  retained  for  the  next  match attempt. You can ensure that enough
 | 
						|
       characters are retained by doing the following:
 | 
						|
 | 
						|
       Before doing any matching, find the length of the longest lookbehind in
 | 
						|
       the     pattern    by    calling    pcre2_pattern_info()    with    the
 | 
						|
       PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting  count  is  in
 | 
						|
       characters, not code units. After a partial match, moving back from the
 | 
						|
       ovector[0] offset in the subject by the number of characters given  for
 | 
						|
       the  maximum lookbehind gets you to the earliest character that must be
 | 
						|
       retained. In a non-UTF or a 32-bit situation, moving  back  is  just  a
 | 
						|
       subtraction,  but in UTF-8 or UTF-16 you have to count characters while
 | 
						|
       moving back through the code units.
 | 
						|
 | 
						|
       Characters before the point you have now reached can be discarded,  and
 | 
						|
       after  the  next segment has been added to what is retained, you should
 | 
						|
       run the next match with the startoffset argument set so that the  match
 | 
						|
       begins at the same point as before.
 | 
						|
 | 
						|
       For  example, if the pattern "(?<=123)abc" is partially matched against
 | 
						|
       the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
 | 
						|
       mum  lookbehind  count  is  3, so all characters before offset 2 can be
 | 
						|
       discarded. The value of startoffset for the next  match  should  be  3.
 | 
						|
       When  pcre2test  displays  a partial match, it indicates the lookbehind
 | 
						|
       characters with '<' characters:
 | 
						|
 | 
						|
           re> "(?<=123)abc"
 | 
						|
         data> xx123ab\=ph
 | 
						|
         Partial match: 123ab
 | 
						|
                        <<<
 | 
						|
 | 
						|
       3. Because a partial match must always contain at least one  character,
 | 
						|
       what  might  be  considered a partial match of an empty string actually
 | 
						|
       gives a "no match" result. For example:
 | 
						|
 | 
						|
           re> /c(?<=abc)x/
 | 
						|
         data> ab\=ps
 | 
						|
         No match
 | 
						|
 | 
						|
       If the next segment begins "cx", a match should be found, but this will
 | 
						|
       only  happen  if characters from the previous segment are retained. For
 | 
						|
       this reason, a "no match" result  should  be  interpreted  as  "partial
 | 
						|
       match of an empty string" when the pattern contains lookbehinds.
 | 
						|
 | 
						|
       4.  Matching  a subject string that is split into multiple segments may
 | 
						|
       not always produce exactly the same result as matching over one  single
 | 
						|
       long  string,  especially  when PCRE2_PARTIAL_SOFT is used. The section
 | 
						|
       "Partial Matching and Word Boundaries" above describes  an  issue  that
 | 
						|
       arises  if  the  pattern ends with \b or \B. Another kind of difference
 | 
						|
       may occur when there are multiple matching possibilities, because  (for
 | 
						|
       PCRE2_PARTIAL_SOFT) a partial match result is given only when there are
 | 
						|
       no completed matches. This means that as soon as the shortest match has
 | 
						|
       been  found,  continuation to a new subject segment is no longer possi-
 | 
						|
       ble. Consider this pcre2test example:
 | 
						|
 | 
						|
           re> /dog(sbody)?/
 | 
						|
         data> dogsb\=ps
 | 
						|
          0: dog
 | 
						|
         data> do\=ps,dfa
 | 
						|
         Partial match: do
 | 
						|
         data> gsb\=ps,dfa,dfa_restart
 | 
						|
          0: g
 | 
						|
         data> dogsbody\=dfa
 | 
						|
          0: dogsbody
 | 
						|
          1: dog
 | 
						|
 | 
						|
       The first data line passes the string "dogsb" to  a  standard  matching
 | 
						|
       function, setting the PCRE2_PARTIAL_SOFT option. Although the string is
 | 
						|
       a partial match for "dogsbody", the result is not  PCRE2_ERROR_PARTIAL,
 | 
						|
       because  the  shorter string "dog" is a complete match. Similarly, when
 | 
						|
       the subject is presented to a DFA matching function  in  several  parts
 | 
						|
       ("do"  and  "gsb"  being  the first two) the match stops when "dog" has
 | 
						|
       been found, and it is not possible to continue.  On the other hand,  if
 | 
						|
       "dogsbody"  is  presented  as  a single string, a DFA matching function
 | 
						|
       finds both matches.
 | 
						|
 | 
						|
       Because of these problems, it is best to  use  PCRE2_PARTIAL_HARD  when
 | 
						|
       matching  multi-segment  data.  The  example above then behaves differ-
 | 
						|
       ently:
 | 
						|
 | 
						|
           re> /dog(sbody)?/
 | 
						|
         data> dogsb\=ph
 | 
						|
         Partial match: dogsb
 | 
						|
         data> do\=ps,dfa
 | 
						|
         Partial match: do
 | 
						|
         data> gsb\=ph,dfa,dfa_restart
 | 
						|
         Partial match: gsb
 | 
						|
 | 
						|
       5. Patterns that contain alternatives at the top level which do not all
 | 
						|
       start  with  the  same  pattern  item  may  not  work  as expected when
 | 
						|
       PCRE2_DFA_RESTART is used. For example, consider this pattern:
 | 
						|
 | 
						|
         1234|3789
 | 
						|
 | 
						|
       If the first part of the subject is "ABC123", a partial  match  of  the
 | 
						|
       first  alternative  is found at offset 3. There is no partial match for
 | 
						|
       the second alternative, because such a match does not start at the same
 | 
						|
       point  in  the  subject  string. Attempting to continue with the string
 | 
						|
       "7890" does not yield a match  because  only  those  alternatives  that
 | 
						|
       match  at  one  point in the subject are remembered. The problem arises
 | 
						|
       because the start of the second alternative matches  within  the  first
 | 
						|
       alternative.  There  is  no  problem with anchored patterns or patterns
 | 
						|
       such as:
 | 
						|
 | 
						|
         1234|ABCD
 | 
						|
 | 
						|
       where no string can be a partial match for both alternatives.  This  is
 | 
						|
       not  a  problem  if  a  standard matching function is used, because the
 | 
						|
       entire match has to be rerun each time:
 | 
						|
 | 
						|
           re> /1234|3789/
 | 
						|
         data> ABC123\=ph
 | 
						|
         Partial match: 123
 | 
						|
         data> 1237890
 | 
						|
          0: 3789
 | 
						|
 | 
						|
       Of course, instead of using PCRE2_DFA_RESTART, the  same  technique  of
 | 
						|
       re-running  the  entire  match  can  also be used with the DFA matching
 | 
						|
       function. Another possibility is to work with two buffers. If a partial
 | 
						|
       match  at  offset  n in the first buffer is followed by "no match" when
 | 
						|
       PCRE2_DFA_RESTART is used on the second buffer, you can then try a  new
 | 
						|
       match starting at offset n+1 in the first buffer.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 22 December 2014
 | 
						|
       Copyright (c) 1997-2014 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2PATTERN(3)            Library Functions Manual            PCRE2PATTERN(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
PCRE2 REGULAR EXPRESSION DETAILS
 | 
						|
 | 
						|
       The  syntax and semantics of the regular expressions that are supported
 | 
						|
       by PCRE2 are described in detail below. There is a quick-reference syn-
 | 
						|
       tax  summary  in the pcre2syntax page. PCRE2 tries to match Perl syntax
 | 
						|
       and semantics as closely as it can.  PCRE2 also supports some  alterna-
 | 
						|
       tive  regular  expression syntax (which does not conflict with the Perl
 | 
						|
       syntax) in order to provide some compatibility with regular expressions
 | 
						|
       in Python, .NET, and Oniguruma.
 | 
						|
 | 
						|
       Perl's  regular expressions are described in its own documentation, and
 | 
						|
       regular expressions in general are covered in a number of  books,  some
 | 
						|
       of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
 | 
						|
       Expressions", published by  O'Reilly,  covers  regular  expressions  in
 | 
						|
       great  detail.  This  description  of  PCRE2's  regular  expressions is
 | 
						|
       intended as reference material.
 | 
						|
 | 
						|
       This document discusses the patterns that are supported by  PCRE2  when
 | 
						|
       its  main  matching function, pcre2_match(), is used. PCRE2 also has an
 | 
						|
       alternative matching function, pcre2_dfa_match(), which matches using a
 | 
						|
       different  algorithm  that is not Perl-compatible. Some of the features
 | 
						|
       discussed below are not available when DFA matching is used. The advan-
 | 
						|
       tages and disadvantages of the alternative function, and how it differs
 | 
						|
       from the normal function, are discussed in the pcre2matching page.
 | 
						|
 | 
						|
 | 
						|
SPECIAL START-OF-PATTERN ITEMS
 | 
						|
 | 
						|
       A number of options that can be passed to pcre2_compile() can  also  be
 | 
						|
       set by special items at the start of a pattern. These are not Perl-com-
 | 
						|
       patible, but are provided to make these options accessible  to  pattern
 | 
						|
       writers  who are not able to change the program that processes the pat-
 | 
						|
       tern. Any number of these items  may  appear,  but  they  must  all  be
 | 
						|
       together right at the start of the pattern string, and the letters must
 | 
						|
       be in upper case.
 | 
						|
 | 
						|
   UTF support
 | 
						|
 | 
						|
       In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
 | 
						|
       as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
 | 
						|
       can be specified for the 32-bit library, in which  case  it  constrains
 | 
						|
       the  character  values  to  valid  Unicode  code points. To process UTF
 | 
						|
       strings, PCRE2 must be built to include Unicode support (which  is  the
 | 
						|
       default).  When  using  UTF  strings you must either call the compiling
 | 
						|
       function with the PCRE2_UTF option, or the pattern must start with  the
 | 
						|
       special  sequence  (*UTF),  which is equivalent to setting the relevant
 | 
						|
       option. How setting a UTF mode affects pattern matching is mentioned in
 | 
						|
       several  places  below.  There  is  also  a  summary of features in the
 | 
						|
       pcre2unicode page.
 | 
						|
 | 
						|
       Some applications that allow their users to supply patterns may wish to
 | 
						|
       restrict   them   to   non-UTF   data  for  security  reasons.  If  the
 | 
						|
       PCRE2_NEVER_UTF option is passed  to  pcre2_compile(),  (*UTF)  is  not
 | 
						|
       allowed, and its appearance in a pattern causes an error.
 | 
						|
 | 
						|
   Unicode property support
 | 
						|
 | 
						|
       Another  special  sequence that may appear at the start of a pattern is
 | 
						|
       (*UCP).  This has the same effect as setting the PCRE2_UCP  option:  it
 | 
						|
       causes  sequences such as \d and \w to use Unicode properties to deter-
 | 
						|
       mine character types, instead of recognizing only characters with codes
 | 
						|
       less than 256 via a lookup table.
 | 
						|
 | 
						|
       Some applications that allow their users to supply patterns may wish to
 | 
						|
       restrict them for security reasons. If the  PCRE2_NEVER_UCP  option  is
 | 
						|
       passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
 | 
						|
       a pattern causes an error.
 | 
						|
 | 
						|
   Locking out empty string matching
 | 
						|
 | 
						|
       Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
 | 
						|
       effect  as  passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option
 | 
						|
       to whichever matching function is subsequently called to match the pat-
 | 
						|
       tern.  These  options  lock  out  the matching of empty strings, either
 | 
						|
       entirely, or only at the start of the subject.
 | 
						|
 | 
						|
   Disabling auto-possessification
 | 
						|
 | 
						|
       If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect  as
 | 
						|
       setting  the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making
 | 
						|
       quantifiers possessive when what  follows  cannot  match  the  repeated
 | 
						|
       item. For example, by default a+b is treated as a++b. For more details,
 | 
						|
       see the pcre2api documentation.
 | 
						|
 | 
						|
   Disabling start-up optimizations
 | 
						|
 | 
						|
       If a pattern starts with (*NO_START_OPT), it has  the  same  effect  as
 | 
						|
       setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
 | 
						|
       mizations for quickly reaching "no match" results.  For  more  details,
 | 
						|
       see the pcre2api documentation.
 | 
						|
 | 
						|
   Disabling automatic anchoring
 | 
						|
 | 
						|
       If  a  pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect
 | 
						|
       as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables  optimiza-
 | 
						|
       tions that apply to patterns whose top-level branches all start with .*
 | 
						|
       (match any number of arbitrary characters). For more details,  see  the
 | 
						|
       pcre2api documentation.
 | 
						|
 | 
						|
   Disabling JIT compilation
 | 
						|
 | 
						|
       If  a  pattern  that starts with (*NO_JIT) is successfully compiled, an
 | 
						|
       attempt by the application to apply the  JIT  optimization  by  calling
 | 
						|
       pcre2_jit_compile() is ignored.
 | 
						|
 | 
						|
   Setting match and recursion limits
 | 
						|
 | 
						|
       The  caller of pcre2_match() can set a limit on the number of times the
 | 
						|
       internal match() function is called and on the maximum depth of  recur-
 | 
						|
       sive calls. These facilities are provided to catch runaway matches that
 | 
						|
       are provoked by patterns with huge matching trees (a typical example is
 | 
						|
       a  pattern  with  nested unlimited repeats) and to avoid running out of
 | 
						|
       system stack by too  much  recursion.  When  one  of  these  limits  is
 | 
						|
       reached,  pcre2_match()  gives  an error return. The limits can also be
 | 
						|
       set by items at the start of the pattern of the form
 | 
						|
 | 
						|
         (*LIMIT_MATCH=d)
 | 
						|
         (*LIMIT_RECURSION=d)
 | 
						|
 | 
						|
       where d is any number of decimal digits. However, the value of the set-
 | 
						|
       ting  must  be  less than the value set (or defaulted) by the caller of
 | 
						|
       pcre2_match() for it to have any effect. In other  words,  the  pattern
 | 
						|
       writer  can lower the limits set by the programmer, but not raise them.
 | 
						|
       If there is more than one setting of one of  these  limits,  the  lower
 | 
						|
       value is used.
 | 
						|
 | 
						|
       The  match  limit  is  used  (but in a different way) when JIT is being
 | 
						|
       used, but it is not  relevant,  and  is  ignored,  when  matching  with
 | 
						|
       pcre2_dfa_match().   However,  the  recursion limit is relevant for DFA
 | 
						|
       matching, which does use some function recursion,  in  particular,  for
 | 
						|
       recursions within the pattern.
 | 
						|
 | 
						|
   Newline conventions
 | 
						|
 | 
						|
       PCRE2 supports five different conventions for indicating line breaks in
 | 
						|
       strings: a single CR (carriage return) character, a  single  LF  (line-
 | 
						|
       feed) character, the two-character sequence CRLF, any of the three pre-
 | 
						|
       ceding, or any Unicode newline sequence. The pcre2api page has  further
 | 
						|
       discussion  about newlines, and shows how to set the newline convention
 | 
						|
       when calling pcre2_compile().
 | 
						|
 | 
						|
       It is also possible to specify a newline convention by starting a  pat-
 | 
						|
       tern string with one of the following five sequences:
 | 
						|
 | 
						|
         (*CR)        carriage return
 | 
						|
         (*LF)        linefeed
 | 
						|
         (*CRLF)      carriage return, followed by linefeed
 | 
						|
         (*ANYCRLF)   any of the three above
 | 
						|
         (*ANY)       all Unicode newline sequences
 | 
						|
 | 
						|
       These override the default and the options given to the compiling func-
 | 
						|
       tion. For example, on a Unix system where LF  is  the  default  newline
 | 
						|
       sequence, the pattern
 | 
						|
 | 
						|
         (*CR)a.b
 | 
						|
 | 
						|
       changes the convention to CR. That pattern matches "a\nb" because LF is
 | 
						|
       no longer a newline. If more than one of these settings is present, the
 | 
						|
       last one is used.
 | 
						|
 | 
						|
       The  newline  convention affects where the circumflex and dollar asser-
 | 
						|
       tions are true. It also affects the interpretation of the dot metachar-
 | 
						|
       acter  when  PCRE2_DOTALL is not set, and the behaviour of \N. However,
 | 
						|
       it does not affect what the \R escape  sequence  matches.  By  default,
 | 
						|
       this  is any Unicode newline sequence, for Perl compatibility. However,
 | 
						|
       this can be changed; see the description of \R in the section  entitled
 | 
						|
       "Newline  sequences" below. A change of \R setting can be combined with
 | 
						|
       a change of newline convention.
 | 
						|
 | 
						|
   Specifying what \R matches
 | 
						|
 | 
						|
       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
 | 
						|
       the  complete  set  of  Unicode  line  endings)  by  setting the option
 | 
						|
       PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved  by
 | 
						|
       starting  a  pattern  with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
 | 
						|
       CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
 | 
						|
 | 
						|
 | 
						|
EBCDIC CHARACTER CODES
 | 
						|
 | 
						|
       PCRE2 can be compiled to run in an environment that uses EBCDIC as  its
 | 
						|
       character code rather than ASCII or Unicode (typically a mainframe sys-
 | 
						|
       tem). In the sections below, character code values are  ASCII  or  Uni-
 | 
						|
       code; in an EBCDIC environment these characters may have different code
 | 
						|
       values, and there are no code points greater than 255.
 | 
						|
 | 
						|
 | 
						|
CHARACTERS AND METACHARACTERS
 | 
						|
 | 
						|
       A regular expression is a pattern that is  matched  against  a  subject
 | 
						|
       string  from  left  to right. Most characters stand for themselves in a
 | 
						|
       pattern, and match the corresponding characters in the  subject.  As  a
 | 
						|
       trivial example, the pattern
 | 
						|
 | 
						|
         The quick brown fox
 | 
						|
 | 
						|
       matches a portion of a subject string that is identical to itself. When
 | 
						|
       caseless matching is specified (the PCRE2_CASELESS option), letters are
 | 
						|
       matched independently of case.
 | 
						|
 | 
						|
       The  power  of  regular  expressions  comes from the ability to include
 | 
						|
       alternatives and repetitions in the pattern. These are encoded  in  the
 | 
						|
       pattern by the use of metacharacters, which do not stand for themselves
 | 
						|
       but instead are interpreted in some special way.
 | 
						|
 | 
						|
       There are two different sets of metacharacters: those that  are  recog-
 | 
						|
       nized  anywhere in the pattern except within square brackets, and those
 | 
						|
       that are recognized within square brackets.  Outside  square  brackets,
 | 
						|
       the metacharacters are as follows:
 | 
						|
 | 
						|
         \      general escape character with several uses
 | 
						|
         ^      assert start of string (or line, in multiline mode)
 | 
						|
         $      assert end of string (or line, in multiline mode)
 | 
						|
         .      match any character except newline (by default)
 | 
						|
         [      start character class definition
 | 
						|
         |      start of alternative branch
 | 
						|
         (      start subpattern
 | 
						|
         )      end subpattern
 | 
						|
         ?      extends the meaning of (
 | 
						|
                also 0 or 1 quantifier
 | 
						|
                also quantifier minimizer
 | 
						|
         *      0 or more quantifier
 | 
						|
         +      1 or more quantifier
 | 
						|
                also "possessive quantifier"
 | 
						|
         {      start min/max quantifier
 | 
						|
 | 
						|
       Part  of  a  pattern  that is in square brackets is called a "character
 | 
						|
       class". In a character class the only metacharacters are:
 | 
						|
 | 
						|
         \      general escape character
 | 
						|
         ^      negate the class, but only if the first character
 | 
						|
         -      indicates character range
 | 
						|
         [      POSIX character class (only if followed by POSIX
 | 
						|
                  syntax)
 | 
						|
         ]      terminates the character class
 | 
						|
 | 
						|
       The following sections describe the use of each of the metacharacters.
 | 
						|
 | 
						|
 | 
						|
BACKSLASH
 | 
						|
 | 
						|
       The backslash character has several uses. Firstly, if it is followed by
 | 
						|
       a character that is not a number or a letter, it takes away any special
 | 
						|
       meaning that character may have. This use of  backslash  as  an  escape
 | 
						|
       character applies both inside and outside character classes.
 | 
						|
 | 
						|
       For  example,  if  you want to match a * character, you write \* in the
 | 
						|
       pattern.  This escaping action applies whether  or  not  the  following
 | 
						|
       character  would  otherwise be interpreted as a metacharacter, so it is
 | 
						|
       always safe to precede a non-alphanumeric  with  backslash  to  specify
 | 
						|
       that  it stands for itself. In particular, if you want to match a back-
 | 
						|
       slash, you write \\.
 | 
						|
 | 
						|
       In a UTF mode, only ASCII numbers and letters have any special  meaning
 | 
						|
       after  a  backslash.  All  other characters (in particular, those whose
 | 
						|
       codepoints are greater than 127) are treated as literals.
 | 
						|
 | 
						|
       If a pattern is compiled with the  PCRE2_EXTENDED  option,  most  white
 | 
						|
       space  in the pattern (other than in a character class), and characters
 | 
						|
       between a # outside a character class and the next newline,  inclusive,
 | 
						|
       are ignored. An escaping backslash can be used to include a white space
 | 
						|
       or # character as part of the pattern.
 | 
						|
 | 
						|
       If you want to remove the special meaning from a  sequence  of  charac-
 | 
						|
       ters,  you can do so by putting them between \Q and \E. This is differ-
 | 
						|
       ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
 | 
						|
       sequences  in PCRE2, whereas in Perl, $ and @ cause variable interpola-
 | 
						|
       tion. Note the following examples:
 | 
						|
 | 
						|
         Pattern            PCRE2 matches   Perl matches
 | 
						|
 | 
						|
         \Qabc$xyz\E        abc$xyz        abc followed by the
 | 
						|
                                             contents of $xyz
 | 
						|
         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
 | 
						|
         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
 | 
						|
 | 
						|
       The \Q...\E sequence is recognized both inside  and  outside  character
 | 
						|
       classes.   An  isolated \E that is not preceded by \Q is ignored. If \Q
 | 
						|
       is not followed by \E later in the pattern, the literal  interpretation
 | 
						|
       continues  to  the  end  of  the pattern (that is, \E is assumed at the
 | 
						|
       end). If the isolated \Q is inside a character class,  this  causes  an
 | 
						|
       error, because the character class is not terminated.
 | 
						|
 | 
						|
   Non-printing characters
 | 
						|
 | 
						|
       A second use of backslash provides a way of encoding non-printing char-
 | 
						|
       acters in patterns in a visible manner. There is no restriction on  the
 | 
						|
       appearance  of non-printing characters in a pattern, but when a pattern
 | 
						|
       is being prepared by text editing, it is often easier to use one of the
 | 
						|
       following  escape sequences than the binary character it represents. In
 | 
						|
       an ASCII or Unicode environment, these escapes are as follows:
 | 
						|
 | 
						|
         \a        alarm, that is, the BEL character (hex 07)
 | 
						|
         \cx       "control-x", where x is any printable ASCII character
 | 
						|
         \e        escape (hex 1B)
 | 
						|
         \f        form feed (hex 0C)
 | 
						|
         \n        linefeed (hex 0A)
 | 
						|
         \r        carriage return (hex 0D)
 | 
						|
         \t        tab (hex 09)
 | 
						|
         \0dd      character with octal code 0dd
 | 
						|
         \ddd      character with octal code ddd, or back reference
 | 
						|
         \o{ddd..} character with octal code ddd..
 | 
						|
         \xhh      character with hex code hh
 | 
						|
         \x{hhh..} character with hex code hhh.. (default mode)
 | 
						|
         \uhhhh    character with hex code hhhh (when PCRE2_ALT_BSUX is set)
 | 
						|
 | 
						|
       The precise effect of \cx on ASCII characters is as follows: if x is  a
 | 
						|
       lower  case  letter,  it  is converted to upper case. Then bit 6 of the
 | 
						|
       character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
 | 
						|
       (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
 | 
						|
       hex 7B (; is 3B). If the code unit following \c has a value  less  than
 | 
						|
       32 or greater than 126, a compile-time error occurs.
 | 
						|
 | 
						|
       When  PCRE2 is compiled in EBCDIC mode, \a, \e, \f, \n, \r, and \t gen-
 | 
						|
       erate the appropriate EBCDIC code values. The \c escape is processed as
 | 
						|
       specified for Perl in the perlebcdic document. The only characters that
 | 
						|
       are allowed after \c are A-Z, a-z, or one of @, [, \, ], ^,  _,  or  ?.
 | 
						|
       Any  other  character  provokes  a compile-time error. The sequence \c@
 | 
						|
       encodes character code 0; after \c the letters (in either case)  encode
 | 
						|
       characters 1-26 (hex 01 to hex 1A); [, \, ], ^, and _ encode characters
 | 
						|
       27-31 (hex 1B to hex 1F), and \c? becomes either 255  (hex  FF)  or  95
 | 
						|
       (hex 5F).
 | 
						|
 | 
						|
       Thus,  apart  from  \c?, these escapes generate the same character code
 | 
						|
       values as they do in an ASCII environment, though the meanings  of  the
 | 
						|
       values  mostly  differ. For example, \cG always generates code value 7,
 | 
						|
       which is BEL in ASCII but DEL in EBCDIC.
 | 
						|
 | 
						|
       The sequence \c? generates DEL (127, hex 7F) in an  ASCII  environment,
 | 
						|
       but  because  127  is  not a control character in EBCDIC, Perl makes it
 | 
						|
       generate the APC character. Unfortunately, there are  several  variants
 | 
						|
       of  EBCDIC.  In  most  of them the APC character has the value 255 (hex
 | 
						|
       FF), but in the one Perl calls POSIX-BC its value is 95  (hex  5F).  If
 | 
						|
       certain other characters have POSIX-BC values, PCRE2 makes \c? generate
 | 
						|
       95; otherwise it generates 255.
 | 
						|
 | 
						|
       After \0 up to two further octal digits are read. If  there  are  fewer
 | 
						|
       than  two  digits,  just  those  that  are  present  are used. Thus the
 | 
						|
       sequence \0\x\015 specifies two binary zeros followed by a CR character
 | 
						|
       (code value 13). Make sure you supply two digits after the initial zero
 | 
						|
       if the pattern character that follows is itself an octal digit.
 | 
						|
 | 
						|
       The escape \o must be followed by a sequence of octal digits,  enclosed
 | 
						|
       in  braces.  An  error occurs if this is not the case. This escape is a
 | 
						|
       recent addition to Perl; it provides way of specifying  character  code
 | 
						|
       points  as  octal  numbers  greater than 0777, and it also allows octal
 | 
						|
       numbers and back references to be unambiguously specified.
 | 
						|
 | 
						|
       For greater clarity and unambiguity, it is best to avoid following \ by
 | 
						|
       a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
 | 
						|
       ter numbers, and \g{} to specify back references. The  following  para-
 | 
						|
       graphs describe the old, ambiguous syntax.
 | 
						|
 | 
						|
       The handling of a backslash followed by a digit other than 0 is compli-
 | 
						|
       cated, and Perl has changed over time, causing PCRE2 also to change.
 | 
						|
 | 
						|
       Outside a character class, PCRE2 reads the digit and any following dig-
 | 
						|
       its as a decimal number. If the number is less than 10, begins with the
 | 
						|
       digit 8 or 9, or if there are at least  that  many  previous  capturing
 | 
						|
       left  parentheses  in the expression, the entire sequence is taken as a
 | 
						|
       back reference. A description of how this works is given later, follow-
 | 
						|
       ing  the  discussion  of  parenthesized  subpatterns.  Otherwise, up to
 | 
						|
       three octal digits are read to form a character code.
 | 
						|
 | 
						|
       Inside a character class, PCRE2 handles \8 and \9 as the literal  char-
 | 
						|
       acters  "8"  and "9", and otherwise reads up to three octal digits fol-
 | 
						|
       lowing the backslash, using them to generate a data character. Any sub-
 | 
						|
       sequent  digits  stand for themselves. For example, outside a character
 | 
						|
       class:
 | 
						|
 | 
						|
         \040   is another way of writing an ASCII space
 | 
						|
         \40    is the same, provided there are fewer than 40
 | 
						|
                   previous capturing subpatterns
 | 
						|
         \7     is always a back reference
 | 
						|
         \11    might be a back reference, or another way of
 | 
						|
                   writing a tab
 | 
						|
         \011   is always a tab
 | 
						|
         \0113  is a tab followed by the character "3"
 | 
						|
         \113   might be a back reference, otherwise the
 | 
						|
                   character with octal code 113
 | 
						|
         \377   might be a back reference, otherwise
 | 
						|
                   the value 255 (decimal)
 | 
						|
         \81    is always a back reference
 | 
						|
 | 
						|
       Note that octal values of 100 or greater that are specified using  this
 | 
						|
       syntax  must  not be introduced by a leading zero, because no more than
 | 
						|
       three octal digits are ever read.
 | 
						|
 | 
						|
       By default, after \x that is not followed by {, from zero to two  hexa-
 | 
						|
       decimal  digits  are  read (letters can be in upper or lower case). Any
 | 
						|
       number of hexadecimal digits may appear between \x{ and }. If a charac-
 | 
						|
       ter  other  than  a  hexadecimal digit appears between \x{ and }, or if
 | 
						|
       there is no terminating }, an error occurs.
 | 
						|
 | 
						|
       If the PCRE2_ALT_BSUX option is set, the interpretation  of  \x  is  as
 | 
						|
       just described only when it is followed by two hexadecimal digits. Oth-
 | 
						|
       erwise, it matches a literal "x" character. In this mode mode,  support
 | 
						|
       for  code points greater than 256 is provided by \u, which must be fol-
 | 
						|
       lowed by four hexadecimal digits; otherwise it matches  a  literal  "u"
 | 
						|
       character.
 | 
						|
 | 
						|
       Characters whose value is less than 256 can be defined by either of the
 | 
						|
       two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
 | 
						|
       ference  in  the way they are handled. For example, \xdc is exactly the
 | 
						|
       same as \x{dc} (or \u00dc in PCRE2_ALT_BSUX mode).
 | 
						|
 | 
						|
   Constraints on character values
 | 
						|
 | 
						|
       Characters that are specified using octal or  hexadecimal  numbers  are
 | 
						|
       limited to certain values, as follows:
 | 
						|
 | 
						|
         8-bit non-UTF mode    less than 0x100
 | 
						|
         8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
 | 
						|
         16-bit non-UTF mode   less than 0x10000
 | 
						|
         16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
 | 
						|
         32-bit non-UTF mode   less than 0x100000000
 | 
						|
         32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
 | 
						|
 | 
						|
       Invalid  Unicode  codepoints  are  the  range 0xd800 to 0xdfff (the so-
 | 
						|
       called "surrogate" codepoints), and 0xffef.
 | 
						|
 | 
						|
   Escape sequences in character classes
 | 
						|
 | 
						|
       All the sequences that define a single character value can be used both
 | 
						|
       inside  and  outside character classes. In addition, inside a character
 | 
						|
       class, \b is interpreted as the backspace character (hex 08).
 | 
						|
 | 
						|
       \N is not allowed in a character class. \B, \R, and \X are not  special
 | 
						|
       inside  a  character  class.  Like other unrecognized alphabetic escape
 | 
						|
       sequences, they cause  an  error.  Outside  a  character  class,  these
 | 
						|
       sequences have different meanings.
 | 
						|
 | 
						|
   Unsupported escape sequences
 | 
						|
 | 
						|
       In  Perl, the sequences \l, \L, \u, and \U are recognized by its string
 | 
						|
       handler and used  to  modify  the  case  of  following  characters.  By
 | 
						|
       default, PCRE2 does not support these escape sequences. However, if the
 | 
						|
       PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be
 | 
						|
       used  to define a character by code point, as described in the previous
 | 
						|
       section.
 | 
						|
 | 
						|
   Absolute and relative back references
 | 
						|
 | 
						|
       The sequence \g followed by a signed  or  unsigned  number,  optionally
 | 
						|
       enclosed  in braces, is an absolute or relative back reference. A named
 | 
						|
       back reference can be coded as \g{name}. Back references are  discussed
 | 
						|
       later, following the discussion of parenthesized subpatterns.
 | 
						|
 | 
						|
   Absolute and relative subroutine calls
 | 
						|
 | 
						|
       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
 | 
						|
       name or a number enclosed either in angle brackets or single quotes, is
 | 
						|
       an  alternative  syntax for referencing a subpattern as a "subroutine".
 | 
						|
       Details are discussed later.   Note  that  \g{...}  (Perl  syntax)  and
 | 
						|
       \g<...>  (Oniguruma  syntax)  are  not synonymous. The former is a back
 | 
						|
       reference; the latter is a subroutine call.
 | 
						|
 | 
						|
   Generic character types
 | 
						|
 | 
						|
       Another use of backslash is for specifying generic character types:
 | 
						|
 | 
						|
         \d     any decimal digit
 | 
						|
         \D     any character that is not a decimal digit
 | 
						|
         \h     any horizontal white space character
 | 
						|
         \H     any character that is not a horizontal white space character
 | 
						|
         \s     any white space character
 | 
						|
         \S     any character that is not a white space character
 | 
						|
         \v     any vertical white space character
 | 
						|
         \V     any character that is not a vertical white space character
 | 
						|
         \w     any "word" character
 | 
						|
         \W     any "non-word" character
 | 
						|
 | 
						|
       There is also the single sequence \N, which matches a non-newline char-
 | 
						|
       acter.   This is the same as the "." metacharacter when PCRE2_DOTALL is
 | 
						|
       not set. Perl also uses \N to match characters by name; PCRE2 does  not
 | 
						|
       support this.
 | 
						|
 | 
						|
       Each  pair of lower and upper case escape sequences partitions the com-
 | 
						|
       plete set of characters into two disjoint  sets.  Any  given  character
 | 
						|
       matches  one, and only one, of each pair. The sequences can appear both
 | 
						|
       inside and outside character classes. They each match one character  of
 | 
						|
       the  appropriate  type.  If the current matching point is at the end of
 | 
						|
       the subject string, all of them fail, because there is no character  to
 | 
						|
       match.
 | 
						|
 | 
						|
       The  default  \s  characters  are HT (9), LF (10), VT (11), FF (12), CR
 | 
						|
       (13), and space (32), which are defined  as  white  space  in  the  "C"
 | 
						|
       locale. This list may vary if locale-specific matching is taking place.
 | 
						|
       For example, in some locales the "non-breaking space" character  (\xA0)
 | 
						|
       is recognized as white space, and in others the VT character is not.
 | 
						|
 | 
						|
       A  "word"  character is an underscore or any character that is a letter
 | 
						|
       or digit.  By default, the definition of letters  and  digits  is  con-
 | 
						|
       trolled by PCRE2's low-valued character tables, and may vary if locale-
 | 
						|
       specific matching is taking place (see "Locale support" in the pcre2api
 | 
						|
       page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
 | 
						|
       systems, or "french" in Windows, some character codes greater than  127
 | 
						|
       are  used  for  accented letters, and these are then matched by \w. The
 | 
						|
       use of locales with Unicode is discouraged.
 | 
						|
 | 
						|
       By default, characters whose code points are  greater  than  127  never
 | 
						|
       match \d, \s, or \w, and always match \D, \S, and \W, although this may
 | 
						|
       be different for characters in the range 128-255  when  locale-specific
 | 
						|
       matching  is  happening.   These escape sequences retain their original
 | 
						|
       meanings from before Unicode support was available,  mainly  for  effi-
 | 
						|
       ciency  reasons.  If  the  PCRE2_UCP  option  is  set, the behaviour is
 | 
						|
       changed so that Unicode properties  are  used  to  determine  character
 | 
						|
       types, as follows:
 | 
						|
 | 
						|
         \d  any character that matches \p{Nd} (decimal digit)
 | 
						|
         \s  any character that matches \p{Z} or \h or \v
 | 
						|
         \w  any character that matches \p{L} or \p{N}, plus underscore
 | 
						|
 | 
						|
       The  upper case escapes match the inverse sets of characters. Note that
 | 
						|
       \d matches only decimal digits, whereas \w matches any  Unicode  digit,
 | 
						|
       as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
 | 
						|
       affects \b, and \B because they are defined in  terms  of  \w  and  \W.
 | 
						|
       Matching these sequences is noticeably slower when PCRE2_UCP is set.
 | 
						|
 | 
						|
       The  sequences  \h, \H, \v, and \V, in contrast to the other sequences,
 | 
						|
       which match only ASCII characters by default, always match  a  specific
 | 
						|
       list  of  code  points, whether or not PCRE2_UCP is set. The horizontal
 | 
						|
       space characters are:
 | 
						|
 | 
						|
         U+0009     Horizontal tab (HT)
 | 
						|
         U+0020     Space
 | 
						|
         U+00A0     Non-break space
 | 
						|
         U+1680     Ogham space mark
 | 
						|
         U+180E     Mongolian vowel separator
 | 
						|
         U+2000     En quad
 | 
						|
         U+2001     Em quad
 | 
						|
         U+2002     En space
 | 
						|
         U+2003     Em space
 | 
						|
         U+2004     Three-per-em space
 | 
						|
         U+2005     Four-per-em space
 | 
						|
         U+2006     Six-per-em space
 | 
						|
         U+2007     Figure space
 | 
						|
         U+2008     Punctuation space
 | 
						|
         U+2009     Thin space
 | 
						|
         U+200A     Hair space
 | 
						|
         U+202F     Narrow no-break space
 | 
						|
         U+205F     Medium mathematical space
 | 
						|
         U+3000     Ideographic space
 | 
						|
 | 
						|
       The vertical space characters are:
 | 
						|
 | 
						|
         U+000A     Linefeed (LF)
 | 
						|
         U+000B     Vertical tab (VT)
 | 
						|
         U+000C     Form feed (FF)
 | 
						|
         U+000D     Carriage return (CR)
 | 
						|
         U+0085     Next line (NEL)
 | 
						|
         U+2028     Line separator
 | 
						|
         U+2029     Paragraph separator
 | 
						|
 | 
						|
       In 8-bit, non-UTF-8 mode, only the characters  with  code  points  less
 | 
						|
       than 256 are relevant.
 | 
						|
 | 
						|
   Newline sequences
 | 
						|
 | 
						|
       Outside  a  character class, by default, the escape sequence \R matches
 | 
						|
       any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is  equivalent
 | 
						|
       to the following:
 | 
						|
 | 
						|
         (?>\r\n|\n|\x0b|\f|\r|\x85)
 | 
						|
 | 
						|
       This  is  an  example  of an "atomic group", details of which are given
 | 
						|
       below.  This particular group matches either the two-character sequence
 | 
						|
       CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
 | 
						|
       U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car-
 | 
						|
       riage  return,  U+000D), or NEL (next line, U+0085). Because this is an
 | 
						|
       atomic group, the two-character sequence is treated as  a  single  unit
 | 
						|
       that cannot be split.
 | 
						|
 | 
						|
       In  other modes, two additional characters whose codepoints are greater
 | 
						|
       than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
 | 
						|
       rator,  U+2029).  Unicode support is not needed for these characters to
 | 
						|
       be recognized.
 | 
						|
 | 
						|
       It is possible to restrict \R to match only CR, LF, or CRLF (instead of
 | 
						|
       the  complete  set  of  Unicode  line  endings)  by  setting the option
 | 
						|
       PCRE2_BSR_ANYCRLF at compile time. (BSR is an  abbrevation  for  "back-
 | 
						|
       slash R".) This can be made the default when PCRE2 is built; if this is
 | 
						|
       the case, the other behaviour can be requested via  the  PCRE2_BSR_UNI-
 | 
						|
       CODE  option. It is also possible to specify these settings by starting
 | 
						|
       a pattern string with one of the following sequences:
 | 
						|
 | 
						|
         (*BSR_ANYCRLF)   CR, LF, or CRLF only
 | 
						|
         (*BSR_UNICODE)   any Unicode newline sequence
 | 
						|
 | 
						|
       These override the default and the options given to the compiling func-
 | 
						|
       tion.  Note that these special settings, which are not Perl-compatible,
 | 
						|
       are recognized only at the very start of a pattern, and that they  must
 | 
						|
       be  in upper case. If more than one of them is present, the last one is
 | 
						|
       used. They can be combined with a change  of  newline  convention;  for
 | 
						|
       example, a pattern can start with:
 | 
						|
 | 
						|
         (*ANY)(*BSR_ANYCRLF)
 | 
						|
 | 
						|
       They  can also be combined with the (*UTF) or (*UCP) special sequences.
 | 
						|
       Inside a character class, \R  is  treated  as  an  unrecognized  escape
 | 
						|
       sequence, and causes an error.
 | 
						|
 | 
						|
   Unicode character properties
 | 
						|
 | 
						|
       When  PCRE2  is  built  with Unicode support (the default), three addi-
 | 
						|
       tional escape sequences that match characters with specific  properties
 | 
						|
       are  available.  In 8-bit non-UTF-8 mode, these sequences are of course
 | 
						|
       limited to testing characters whose codepoints are less than  256,  but
 | 
						|
       they do work in this mode.  The extra escape sequences are:
 | 
						|
 | 
						|
         \p{xx}   a character with the xx property
 | 
						|
         \P{xx}   a character without the xx property
 | 
						|
         \X       a Unicode extended grapheme cluster
 | 
						|
 | 
						|
       The  property  names represented by xx above are limited to the Unicode
 | 
						|
       script names, the general category properties, "Any", which matches any
 | 
						|
       character  (including  newline),  and  some  special  PCRE2  properties
 | 
						|
       (described in the next section).  Other Perl properties such as  "InMu-
 | 
						|
       sicalSymbols"  are  not supported by PCRE2.  Note that \P{Any} does not
 | 
						|
       match any characters, so always causes a match failure.
 | 
						|
 | 
						|
       Sets of Unicode characters are defined as belonging to certain scripts.
 | 
						|
       A  character from one of these sets can be matched using a script name.
 | 
						|
       For example:
 | 
						|
 | 
						|
         \p{Greek}
 | 
						|
         \P{Han}
 | 
						|
 | 
						|
       Those that are not part of an identified script are lumped together  as
 | 
						|
       "Common". The current list of scripts is:
 | 
						|
 | 
						|
       Ahom,   Anatolian_Hieroglyphs,  Arabic,  Armenian,  Avestan,  Balinese,
 | 
						|
       Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille,  Buginese,
 | 
						|
       Buhid,  Canadian_Aboriginal,  Carian, Caucasian_Albanian, Chakma, Cham,
 | 
						|
       Cherokee,  Common,  Coptic,  Cuneiform,  Cypriot,  Cyrillic,   Deseret,
 | 
						|
       Devanagari,  Duployan,  Egyptian_Hieroglyphs,  Elbasan, Ethiopic, Geor-
 | 
						|
       gian, Glagolitic, Gothic,  Grantha,  Greek,  Gujarati,  Gurmukhi,  Han,
 | 
						|
       Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
 | 
						|
       Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese,  Kaithi,  Kan-
 | 
						|
       nada,  Katakana,  Kayah_Li,  Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
 | 
						|
       Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian,  Lydian,  Maha-
 | 
						|
       jani,  Malayalam,  Mandaic,  Manichaean,  Meetei_Mayek,  Mende_Kikakui,
 | 
						|
       Meroitic_Cursive, Meroitic_Hieroglyphs,  Miao,  Modi,  Mongolian,  Mro,
 | 
						|
       Multani,   Myanmar,   Nabataean,  New_Tai_Lue,  Nko,  Ogham,  Ol_Chiki,
 | 
						|
       Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic,  Old_Persian,
 | 
						|
       Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene,
 | 
						|
       Pau_Cin_Hau,  Phags_Pa,  Phoenician,  Psalter_Pahlavi,  Rejang,  Runic,
 | 
						|
       Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala,
 | 
						|
       Sora_Sompeng,  Sundanese,  Syloti_Nagri,  Syriac,  Tagalog,   Tagbanwa,
 | 
						|
       Tai_Le,   Tai_Tham,  Tai_Viet,  Takri,  Tamil,  Telugu,  Thaana,  Thai,
 | 
						|
       Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi.
 | 
						|
 | 
						|
       Each character has exactly one Unicode general category property, spec-
 | 
						|
       ified  by a two-letter abbreviation. For compatibility with Perl, nega-
 | 
						|
       tion can be specified by including a  circumflex  between  the  opening
 | 
						|
       brace  and  the  property  name.  For  example,  \p{^Lu} is the same as
 | 
						|
       \P{Lu}.
 | 
						|
 | 
						|
       If only one letter is specified with \p or \P, it includes all the gen-
 | 
						|
       eral  category properties that start with that letter. In this case, in
 | 
						|
       the absence of negation, the curly brackets in the escape sequence  are
 | 
						|
       optional; these two examples have the same effect:
 | 
						|
 | 
						|
         \p{L}
 | 
						|
         \pL
 | 
						|
 | 
						|
       The following general category property codes are supported:
 | 
						|
 | 
						|
         C     Other
 | 
						|
         Cc    Control
 | 
						|
         Cf    Format
 | 
						|
         Cn    Unassigned
 | 
						|
         Co    Private use
 | 
						|
         Cs    Surrogate
 | 
						|
 | 
						|
         L     Letter
 | 
						|
         Ll    Lower case letter
 | 
						|
         Lm    Modifier letter
 | 
						|
         Lo    Other letter
 | 
						|
         Lt    Title case letter
 | 
						|
         Lu    Upper case letter
 | 
						|
 | 
						|
         M     Mark
 | 
						|
         Mc    Spacing mark
 | 
						|
         Me    Enclosing mark
 | 
						|
         Mn    Non-spacing mark
 | 
						|
 | 
						|
         N     Number
 | 
						|
         Nd    Decimal number
 | 
						|
         Nl    Letter number
 | 
						|
         No    Other number
 | 
						|
 | 
						|
         P     Punctuation
 | 
						|
         Pc    Connector punctuation
 | 
						|
         Pd    Dash punctuation
 | 
						|
         Pe    Close punctuation
 | 
						|
         Pf    Final punctuation
 | 
						|
         Pi    Initial punctuation
 | 
						|
         Po    Other punctuation
 | 
						|
         Ps    Open punctuation
 | 
						|
 | 
						|
         S     Symbol
 | 
						|
         Sc    Currency symbol
 | 
						|
         Sk    Modifier symbol
 | 
						|
         Sm    Mathematical symbol
 | 
						|
         So    Other symbol
 | 
						|
 | 
						|
         Z     Separator
 | 
						|
         Zl    Line separator
 | 
						|
         Zp    Paragraph separator
 | 
						|
         Zs    Space separator
 | 
						|
 | 
						|
       The  special property L& is also supported: it matches a character that
 | 
						|
       has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
 | 
						|
       classified as a modifier or "other".
 | 
						|
 | 
						|
       The  Cs  (Surrogate)  property  applies only to characters in the range
 | 
						|
       U+D800 to U+DFFF. Such characters are not valid in Unicode strings  and
 | 
						|
       so  cannot  be  tested  by PCRE2, unless UTF validity checking has been
 | 
						|
       turned off (see the discussion of PCRE2_NO_UTF_CHECK  in  the  pcre2api
 | 
						|
       page). Perl does not support the Cs property.
 | 
						|
 | 
						|
       The  long  synonyms  for  property  names  that  Perl supports (such as
 | 
						|
       \p{Letter}) are not supported by PCRE2, nor is it permitted  to  prefix
 | 
						|
       any of these properties with "Is".
 | 
						|
 | 
						|
       No character that is in the Unicode table has the Cn (unassigned) prop-
 | 
						|
       erty.  Instead, this property is assumed for any code point that is not
 | 
						|
       in the Unicode table.
 | 
						|
 | 
						|
       Specifying  caseless  matching  does not affect these escape sequences.
 | 
						|
       For example, \p{Lu} always matches only upper  case  letters.  This  is
 | 
						|
       different from the behaviour of current versions of Perl.
 | 
						|
 | 
						|
       Matching  characters by Unicode property is not fast, because PCRE2 has
 | 
						|
       to do a multistage table lookup in order to find  a  character's  prop-
 | 
						|
       erty. That is why the traditional escape sequences such as \d and \w do
 | 
						|
       not use Unicode properties in PCRE2 by default,  though  you  can  make
 | 
						|
       them  do  so by setting the PCRE2_UCP option or by starting the pattern
 | 
						|
       with (*UCP).
 | 
						|
 | 
						|
   Extended grapheme clusters
 | 
						|
 | 
						|
       The \X escape matches any number of Unicode  characters  that  form  an
 | 
						|
       "extended grapheme cluster", and treats the sequence as an atomic group
 | 
						|
       (see below).  Unicode supports various kinds of composite character  by
 | 
						|
       giving  each  character  a grapheme breaking property, and having rules
 | 
						|
       that use these properties to define the boundaries of extended grapheme
 | 
						|
       clusters.  \X  always  matches  at least one character. Then it decides
 | 
						|
       whether to add additional characters according to the  following  rules
 | 
						|
       for ending a cluster:
 | 
						|
 | 
						|
       1. End at the end of the subject string.
 | 
						|
 | 
						|
       2.  Do not end between CR and LF; otherwise end after any control char-
 | 
						|
       acter.
 | 
						|
 | 
						|
       3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul
 | 
						|
       characters  are of five types: L, V, T, LV, and LVT. An L character may
 | 
						|
       be followed by an L, V, LV, or LVT character; an LV or V character  may
 | 
						|
       be followed by a V or T character; an LVT or T character may be follwed
 | 
						|
       only by a T character.
 | 
						|
 | 
						|
       4. Do not end before extending characters or spacing marks.  Characters
 | 
						|
       with  the  "mark"  property  always have the "extend" grapheme breaking
 | 
						|
       property.
 | 
						|
 | 
						|
       5. Do not end after prepend characters.
 | 
						|
 | 
						|
       6. Otherwise, end the cluster.
 | 
						|
 | 
						|
   PCRE2's additional properties
 | 
						|
 | 
						|
       As well as the standard Unicode properties described above, PCRE2  sup-
 | 
						|
       ports  four  more  that  make it possible to convert traditional escape
 | 
						|
       sequences such as \w and \s to use Unicode properties. PCRE2 uses these
 | 
						|
       non-standard,  non-Perl  properties  internally  when PCRE2_UCP is set.
 | 
						|
       However, they may also be used explicitly. These properties are:
 | 
						|
 | 
						|
         Xan   Any alphanumeric character
 | 
						|
         Xps   Any POSIX space character
 | 
						|
         Xsp   Any Perl space character
 | 
						|
         Xwd   Any Perl "word" character
 | 
						|
 | 
						|
       Xan matches characters that have either the L (letter) or the  N  (num-
 | 
						|
       ber)  property. Xps matches the characters tab, linefeed, vertical tab,
 | 
						|
       form feed, or carriage return, and any other character that has  the  Z
 | 
						|
       (separator)  property.   Xsp  is  the  same as Xps; in PCRE1 it used to
 | 
						|
       exclude vertical tab, for Perl compatibility,  but  Perl  changed.  Xwd
 | 
						|
       matches the same characters as Xan, plus underscore.
 | 
						|
 | 
						|
       There  is another non-standard property, Xuc, which matches any charac-
 | 
						|
       ter that can be represented by a Universal Character Name  in  C++  and
 | 
						|
       other  programming  languages.  These are the characters $, @, ` (grave
 | 
						|
       accent), and all characters with Unicode code points  greater  than  or
 | 
						|
       equal  to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
 | 
						|
       most base (ASCII) characters are excluded. (Universal  Character  Names
 | 
						|
       are  of  the  form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
 | 
						|
       Note that the Xuc property does not match these sequences but the char-
 | 
						|
       acters that they represent.)
 | 
						|
 | 
						|
   Resetting the match start
 | 
						|
 | 
						|
       The  escape sequence \K causes any previously matched characters not to
 | 
						|
       be included in the final matched sequence. For example, the pattern:
 | 
						|
 | 
						|
         foo\Kbar
 | 
						|
 | 
						|
       matches "foobar", but reports that it has matched "bar".  This  feature
 | 
						|
       is  similar  to  a lookbehind assertion (described below).  However, in
 | 
						|
       this case, the part of the subject before the real match does not  have
 | 
						|
       to  be of fixed length, as lookbehind assertions do. The use of \K does
 | 
						|
       not interfere with the setting of captured  substrings.   For  example,
 | 
						|
       when the pattern
 | 
						|
 | 
						|
         (foo)\Kbar
 | 
						|
 | 
						|
       matches "foobar", the first substring is still set to "foo".
 | 
						|
 | 
						|
       Perl  documents  that  the  use  of  \K  within assertions is "not well
 | 
						|
       defined". In PCRE2, \K is acted upon when  it  occurs  inside  positive
 | 
						|
       assertions,  but  is  ignored  in negative assertions. Note that when a
 | 
						|
       pattern such as (?=ab\K) matches, the reported start of the  match  can
 | 
						|
       be greater than the end of the match.
 | 
						|
 | 
						|
   Simple assertions
 | 
						|
 | 
						|
       The  final use of backslash is for certain simple assertions. An asser-
 | 
						|
       tion specifies a condition that has to be met at a particular point  in
 | 
						|
       a  match, without consuming any characters from the subject string. The
 | 
						|
       use of subpatterns for more complicated assertions is described  below.
 | 
						|
       The backslashed assertions are:
 | 
						|
 | 
						|
         \b     matches at a word boundary
 | 
						|
         \B     matches when not at a word boundary
 | 
						|
         \A     matches at the start of the subject
 | 
						|
         \Z     matches at the end of the subject
 | 
						|
                 also matches before a newline at the end of the subject
 | 
						|
         \z     matches only at the end of the subject
 | 
						|
         \G     matches at the first matching position in the subject
 | 
						|
 | 
						|
       Inside  a  character  class, \b has a different meaning; it matches the
 | 
						|
       backspace character. If any other of  these  assertions  appears  in  a
 | 
						|
       character class, an "invalid escape sequence" error is generated.
 | 
						|
 | 
						|
       A  word  boundary is a position in the subject string where the current
 | 
						|
       character and the previous character do not both match \w or  \W  (i.e.
 | 
						|
       one  matches  \w  and the other matches \W), or the start or end of the
 | 
						|
       string if the first or last character matches \w,  respectively.  In  a
 | 
						|
       UTF  mode,  the  meanings  of  \w  and \W can be changed by setting the
 | 
						|
       PCRE2_UCP option. When this is done, it also affects \b and \B. Neither
 | 
						|
       PCRE2  nor Perl has a separate "start of word" or "end of word" metase-
 | 
						|
       quence. However, whatever follows \b normally determines which  it  is.
 | 
						|
       For example, the fragment \ba matches "a" at the start of a word.
 | 
						|
 | 
						|
       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
 | 
						|
       and dollar (described in the next section) in that they only ever match
 | 
						|
       at  the  very start and end of the subject string, whatever options are
 | 
						|
       set. Thus, they are independent of multiline mode. These  three  asser-
 | 
						|
       tions  are  not  affected  by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
 | 
						|
       which affect only the behaviour of the circumflex and dollar  metachar-
 | 
						|
       acters.  However,  if the startoffset argument of pcre2_match() is non-
 | 
						|
       zero, indicating that matching is to start at a point  other  than  the
 | 
						|
       beginning  of  the subject, \A can never match.  The difference between
 | 
						|
       \Z and \z is that \Z matches before a newline at the end of the  string
 | 
						|
       as well as at the very end, whereas \z matches only at the end.
 | 
						|
 | 
						|
       The  \G assertion is true only when the current matching position is at
 | 
						|
       the start point of the match, as specified by the startoffset  argument
 | 
						|
       of  pcre2_match().  It differs from \A when the value of startoffset is
 | 
						|
       non-zero. By calling  pcre2_match()  multiple  times  with  appropriate
 | 
						|
       arguments,  you  can  mimic Perl's /g option, and it is in this kind of
 | 
						|
       implementation where \G can be useful.
 | 
						|
 | 
						|
       Note, however, that PCRE2's interpretation of \G, as the start  of  the
 | 
						|
       current match, is subtly different from Perl's, which defines it as the
 | 
						|
       end of the previous match. In Perl, these can  be  different  when  the
 | 
						|
       previously  matched string was empty. Because PCRE2 does just one match
 | 
						|
       at a time, it cannot reproduce this behaviour.
 | 
						|
 | 
						|
       If all the alternatives of a pattern begin with \G, the  expression  is
 | 
						|
       anchored to the starting match position, and the "anchored" flag is set
 | 
						|
       in the compiled regular expression.
 | 
						|
 | 
						|
 | 
						|
CIRCUMFLEX AND DOLLAR
 | 
						|
 | 
						|
       The circumflex and dollar  metacharacters  are  zero-width  assertions.
 | 
						|
       That  is,  they test for a particular condition being true without con-
 | 
						|
       suming any characters from the subject string. These two metacharacters
 | 
						|
       are  concerned  with matching the starts and ends of lines. If the new-
 | 
						|
       line convention is set so that only the two-character sequence CRLF  is
 | 
						|
       recognized  as  a newline, isolated CR and LF characters are treated as
 | 
						|
       ordinary data characters, and are not recognized as newlines.
 | 
						|
 | 
						|
       Outside a character class, in the default matching mode, the circumflex
 | 
						|
       character  is  an  assertion  that is true only if the current matching
 | 
						|
       point is at the start of the subject string. If the  startoffset  argu-
 | 
						|
       ment  of  pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
 | 
						|
       flex can never match if the PCRE2_MULTILINE option is unset.  Inside  a
 | 
						|
       character  class,  circumflex  has  an  entirely different meaning (see
 | 
						|
       below).
 | 
						|
 | 
						|
       Circumflex need not be the first character of the pattern if  a  number
 | 
						|
       of  alternatives are involved, but it should be the first thing in each
 | 
						|
       alternative in which it appears if the pattern is ever  to  match  that
 | 
						|
       branch.  If all possible alternatives start with a circumflex, that is,
 | 
						|
       if the pattern is constrained to match only at the start  of  the  sub-
 | 
						|
       ject,  it  is  said  to be an "anchored" pattern. (There are also other
 | 
						|
       constructs that can cause a pattern to be anchored.)
 | 
						|
 | 
						|
       The dollar character is an assertion that is true only if  the  current
 | 
						|
       matching  point  is  at  the  end of the subject string, or immediately
 | 
						|
       before a newline  at  the  end  of  the  string  (by  default),  unless
 | 
						|
       PCRE2_NOTEOL is set. Note, however, that it does not actually match the
 | 
						|
       newline. Dollar need not be the last character of the pattern if a num-
 | 
						|
       ber of alternatives are involved, but it should be the last item in any
 | 
						|
       branch in which it appears. Dollar has no special meaning in a  charac-
 | 
						|
       ter class.
 | 
						|
 | 
						|
       The  meaning  of  dollar  can be changed so that it matches only at the
 | 
						|
       very end of the string, by setting the PCRE2_DOLLAR_ENDONLY  option  at
 | 
						|
       compile time. This does not affect the \Z assertion.
 | 
						|
 | 
						|
       The meanings of the circumflex and dollar metacharacters are changed if
 | 
						|
       the PCRE2_MULTILINE option is set. When this  is  the  case,  a  dollar
 | 
						|
       character  matches before any newlines in the string, as well as at the
 | 
						|
       very end, and a circumflex matches immediately after internal  newlines
 | 
						|
       as  well as at the start of the subject string. It does not match after
 | 
						|
       a newline that ends the string, for compatibility with  Perl.  However,
 | 
						|
       this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
 | 
						|
 | 
						|
       For  example, the pattern /^abc$/ matches the subject string "def\nabc"
 | 
						|
       (where \n represents a newline) in multiline mode, but  not  otherwise.
 | 
						|
       Consequently,  patterns  that  are anchored in single line mode because
 | 
						|
       all branches start with ^ are not anchored in  multiline  mode,  and  a
 | 
						|
       match  for  circumflex  is  possible  when  the startoffset argument of
 | 
						|
       pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored
 | 
						|
       if PCRE2_MULTILINE is set.
 | 
						|
 | 
						|
       When  the  newline  convention (see "Newline conventions" below) recog-
 | 
						|
       nizes the two-character sequence CRLF as a newline, this is  preferred,
 | 
						|
       even  if  the  single  characters CR and LF are also recognized as new-
 | 
						|
       lines. For example, if the newline convention  is  "any",  a  multiline
 | 
						|
       mode  circumflex matches before "xyz" in the string "abc\r\nxyz" rather
 | 
						|
       than after CR, even though CR on its own is a valid newline.  (It  also
 | 
						|
       matches at the very start of the string, of course.)
 | 
						|
 | 
						|
       Note  that  the sequences \A, \Z, and \z can be used to match the start
 | 
						|
       and end of the subject in both modes, and if all branches of a  pattern
 | 
						|
       start  with \A it is always anchored, whether or not PCRE2_MULTILINE is
 | 
						|
       set.
 | 
						|
 | 
						|
 | 
						|
FULL STOP (PERIOD, DOT) AND \N
 | 
						|
 | 
						|
       Outside a character class, a dot in the pattern matches any one charac-
 | 
						|
       ter  in  the subject string except (by default) a character that signi-
 | 
						|
       fies the end of a line.
 | 
						|
 | 
						|
       When a line ending is defined as a single character, dot never  matches
 | 
						|
       that  character; when the two-character sequence CRLF is used, dot does
 | 
						|
       not match CR if it is immediately followed  by  LF,  but  otherwise  it
 | 
						|
       matches  all characters (including isolated CRs and LFs). When any Uni-
 | 
						|
       code line endings are being recognized, dot does not match CR or LF  or
 | 
						|
       any of the other line ending characters.
 | 
						|
 | 
						|
       The  behaviour  of  dot  with regard to newlines can be changed. If the
 | 
						|
       PCRE2_DOTALL option is set, a dot matches any  one  character,  without
 | 
						|
       exception.   If  the two-character sequence CRLF is present in the sub-
 | 
						|
       ject string, it takes two dots to match it.
 | 
						|
 | 
						|
       The handling of dot is entirely independent of the handling of  circum-
 | 
						|
       flex  and  dollar,  the  only relationship being that they both involve
 | 
						|
       newlines. Dot has no special meaning in a character class.
 | 
						|
 | 
						|
       The escape sequence \N behaves like  a  dot,  except  that  it  is  not
 | 
						|
       affected  by  the  PCRE2_DOTALL  option. In other words, it matches any
 | 
						|
       character except one that signifies the end of a line. Perl  also  uses
 | 
						|
       \N to match characters by name; PCRE2 does not support this.
 | 
						|
 | 
						|
 | 
						|
MATCHING A SINGLE CODE UNIT
 | 
						|
 | 
						|
       Outside  a character class, the escape sequence \C matches any one code
 | 
						|
       unit, whether or not a UTF mode is set. In the 8-bit library, one  code
 | 
						|
       unit  is  one  byte;  in the 16-bit library it is a 16-bit unit; in the
 | 
						|
       32-bit library it is a 32-bit unit. Unlike a  dot,  \C  always  matches
 | 
						|
       line-ending  characters.  The  feature  is provided in Perl in order to
 | 
						|
       match individual bytes in UTF-8 mode, but it is unclear how it can use-
 | 
						|
       fully be used.
 | 
						|
 | 
						|
       Because  \C  breaks  up characters into individual code units, matching
 | 
						|
       one unit with \C in UTF-8 or UTF-16 mode means that  the  rest  of  the
 | 
						|
       string  may  start  with  a malformed UTF character. This has undefined
 | 
						|
       results, because PCRE2 assumes that it is matching character by charac-
 | 
						|
       ter  in  a  valid UTF string (by default it checks the subject string's
 | 
						|
       validity at the  start  of  processing  unless  the  PCRE2_NO_UTF_CHECK
 | 
						|
       option is used).
 | 
						|
 | 
						|
       An   application   can   lock   out  the  use  of  \C  by  setting  the
 | 
						|
       PCRE2_NEVER_BACKSLASH_C option when compiling a  pattern.  It  is  also
 | 
						|
       possible to build PCRE2 with the use of \C permanently disabled.
 | 
						|
 | 
						|
       PCRE2  does  not allow \C to appear in lookbehind assertions (described
 | 
						|
       below) in UTF-8 or UTF-16 modes, because this would make it  impossible
 | 
						|
       to  calculate  the  length  of  the lookbehind. Neither the alternative
 | 
						|
       matching function pcre2_dfa_match() nor the JIT optimizer support \C in
 | 
						|
       these UTF modes.  The former gives a match-time error; the latter fails
 | 
						|
       to optimize and so the match is always run using the interpreter.
 | 
						|
 | 
						|
       In the 32-bit library,  however,  \C  is  always  supported  (when  not
 | 
						|
       explicitly  locked  out)  because it always matches a single code unit,
 | 
						|
       whether or not UTF-32 is specified.
 | 
						|
 | 
						|
       In general, the \C escape sequence is best avoided. However, one way of
 | 
						|
       using  it  that avoids the problem of malformed UTF-8 or UTF-16 charac-
 | 
						|
       ters is to use a lookahead to check the length of the  next  character,
 | 
						|
       as  in  this  pattern,  which could be used with a UTF-8 string (ignore
 | 
						|
       white space and line breaks):
 | 
						|
 | 
						|
         (?| (?=[\x00-\x7f])(\C) |
 | 
						|
             (?=[\x80-\x{7ff}])(\C)(\C) |
 | 
						|
             (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
 | 
						|
             (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
 | 
						|
 | 
						|
       In this example, a group that starts  with  (?|  resets  the  capturing
 | 
						|
       parentheses numbers in each alternative (see "Duplicate Subpattern Num-
 | 
						|
       bers" below). The assertions at the start of each branch check the next
 | 
						|
       UTF-8  character  for  values  whose encoding uses 1, 2, 3, or 4 bytes,
 | 
						|
       respectively. The character's individual bytes are then captured by the
 | 
						|
       appropriate number of \C groups.
 | 
						|
 | 
						|
 | 
						|
SQUARE BRACKETS AND CHARACTER CLASSES
 | 
						|
 | 
						|
       An opening square bracket introduces a character class, terminated by a
 | 
						|
       closing square bracket. A closing square bracket on its own is not spe-
 | 
						|
       cial  by  default.  If a closing square bracket is required as a member
 | 
						|
       of the class, it should be the first data character in the class (after
 | 
						|
       an  initial  circumflex,  if present) or escaped with a backslash. This
 | 
						|
       means that, by default, an empty class cannot be defined.  However,  if
 | 
						|
       the  PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
 | 
						|
       the start does end the (empty) class.
 | 
						|
 | 
						|
       A character class matches a single character in the subject. A  matched
 | 
						|
       character must be in the set of characters defined by the class, unless
 | 
						|
       the first character in the class definition is a circumflex,  in  which
 | 
						|
       case the subject character must not be in the set defined by the class.
 | 
						|
       If a circumflex is actually required as a member of the  class,  ensure
 | 
						|
       it is not the first character, or escape it with a backslash.
 | 
						|
 | 
						|
       For  example, the character class [aeiou] matches any lower case vowel,
 | 
						|
       while [^aeiou] matches any character that is not a  lower  case  vowel.
 | 
						|
       Note that a circumflex is just a convenient notation for specifying the
 | 
						|
       characters that are in the class by enumerating those that are  not.  A
 | 
						|
       class  that starts with a circumflex is not an assertion; it still con-
 | 
						|
       sumes a character from the subject string, and therefore  it  fails  if
 | 
						|
       the current pointer is at the end of the string.
 | 
						|
 | 
						|
       When  caseless  matching  is set, any letters in a class represent both
 | 
						|
       their upper case and lower case versions, so for  example,  a  caseless
 | 
						|
       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
 | 
						|
       match "A", whereas a caseful version would.
 | 
						|
 | 
						|
       Characters that might indicate line breaks are  never  treated  in  any
 | 
						|
       special  way  when  matching  character  classes,  whatever line-ending
 | 
						|
       sequence is in use,  and  whatever  setting  of  the  PCRE2_DOTALL  and
 | 
						|
       PCRE2_MULTILINE  options  is  used. A class such as [^a] always matches
 | 
						|
       one of these characters.
 | 
						|
 | 
						|
       The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,
 | 
						|
       \w, and \W may appear in a character class, and add the characters that
 | 
						|
       they match to the class. For example, [\dABCDEF] matches any  hexadeci-
 | 
						|
       mal  digit.  In UTF modes, the PCRE2_UCP option affects the meanings of
 | 
						|
       \d, \s, \w and their upper case partners, just as  it  does  when  they
 | 
						|
       appear  outside a character class, as described in the section entitled
 | 
						|
       "Generic character types" above. The escape sequence \b has a different
 | 
						|
       meaning  inside  a character class; it matches the backspace character.
 | 
						|
       The sequences \B, \N, \R, and \X are not  special  inside  a  character
 | 
						|
       class.  Like  any  other  unrecognized  escape sequences, they cause an
 | 
						|
       error.
 | 
						|
 | 
						|
       The minus (hyphen) character can be used to specify a range of  charac-
 | 
						|
       ters  in  a  character  class.  For  example,  [d-m] matches any letter
 | 
						|
       between d and m, inclusive. If a  minus  character  is  required  in  a
 | 
						|
       class,  it  must  be  escaped  with a backslash or appear in a position
 | 
						|
       where it cannot be interpreted as indicating a range, typically as  the
 | 
						|
       first or last character in the class, or immediately after a range. For
 | 
						|
       example, [b-d-z] matches letters in the range b to d, a hyphen  charac-
 | 
						|
       ter, or z.
 | 
						|
 | 
						|
       Perl treats a hyphen as a literal if it appears before or after a POSIX
 | 
						|
       class (see below) or a character type escape such as as \d, but gives a
 | 
						|
       warning  in  its  warning mode, as this is most likely a user error. As
 | 
						|
       PCRE2 has no facility for warning, an error is given in these cases.
 | 
						|
 | 
						|
       It is not possible to have the literal character "]" as the end charac-
 | 
						|
       ter  of a range. A pattern such as [W-]46] is interpreted as a class of
 | 
						|
       two characters ("W" and "-") followed by a literal string "46]", so  it
 | 
						|
       would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
 | 
						|
       backslash it is interpreted as the end of range, so [W-\]46] is  inter-
 | 
						|
       preted  as a class containing a range followed by two other characters.
 | 
						|
       The octal or hexadecimal representation of "]" can also be used to  end
 | 
						|
       a range.
 | 
						|
 | 
						|
       Ranges normally include all code points between the start and end char-
 | 
						|
       acters, inclusive. They can also be  used  for  code  points  specified
 | 
						|
       numerically, for example [\000-\037]. Ranges can include any characters
 | 
						|
       that are valid for the current mode.
 | 
						|
 | 
						|
       There is a special case in EBCDIC environments  for  ranges  whose  end
 | 
						|
       points are both specified as literal letters in the same case. For com-
 | 
						|
       patibility with Perl, EBCDIC code points within the range that are  not
 | 
						|
       letters  are  omitted. For example, [h-k] matches only four characters,
 | 
						|
       even though the codes for h and k are 0x88 and 0x92, a range of 11 code
 | 
						|
       points.  However,  if  the range is specified numerically, for example,
 | 
						|
       [\x88-\x92] or [h-\x92], all code points are included.
 | 
						|
 | 
						|
       If a range that includes letters is used when caseless matching is set,
 | 
						|
       it matches the letters in either case. For example, [W-c] is equivalent
 | 
						|
       to [][\\^_`wxyzabc], matched caselessly, and  in  a  non-UTF  mode,  if
 | 
						|
       character  tables  for  a French locale are in use, [\xc8-\xcb] matches
 | 
						|
       accented E characters in both cases.
 | 
						|
 | 
						|
       A circumflex can conveniently be used with  the  upper  case  character
 | 
						|
       types  to specify a more restricted set of characters than the matching
 | 
						|
       lower case type.  For example, the class [^\W_] matches any  letter  or
 | 
						|
       digit, but not underscore, whereas [\w] includes underscore. A positive
 | 
						|
       character class should be read as "something OR something OR ..." and a
 | 
						|
       negative class as "NOT something AND NOT something AND NOT ...".
 | 
						|
 | 
						|
       The  only  metacharacters  that are recognized in character classes are
 | 
						|
       backslash, hyphen (only where it can be  interpreted  as  specifying  a
 | 
						|
       range),  circumflex  (only  at the start), opening square bracket (only
 | 
						|
       when it can be interpreted as introducing a POSIX class name, or for  a
 | 
						|
       special  compatibility  feature  -  see the next two sections), and the
 | 
						|
       terminating  closing  square  bracket.  However,  escaping  other  non-
 | 
						|
       alphanumeric characters does no harm.
 | 
						|
 | 
						|
 | 
						|
POSIX CHARACTER CLASSES
 | 
						|
 | 
						|
       Perl supports the POSIX notation for character classes. This uses names
 | 
						|
       enclosed by [: and :] within the enclosing square brackets. PCRE2  also
 | 
						|
       supports this notation. For example,
 | 
						|
 | 
						|
         [01[:alpha:]%]
 | 
						|
 | 
						|
       matches "0", "1", any alphabetic character, or "%". The supported class
 | 
						|
       names are:
 | 
						|
 | 
						|
         alnum    letters and digits
 | 
						|
         alpha    letters
 | 
						|
         ascii    character codes 0 - 127
 | 
						|
         blank    space or tab only
 | 
						|
         cntrl    control characters
 | 
						|
         digit    decimal digits (same as \d)
 | 
						|
         graph    printing characters, excluding space
 | 
						|
         lower    lower case letters
 | 
						|
         print    printing characters, including space
 | 
						|
         punct    printing characters, excluding letters and digits and space
 | 
						|
         space    white space (the same as \s from PCRE2 8.34)
 | 
						|
         upper    upper case letters
 | 
						|
         word     "word" characters (same as \w)
 | 
						|
         xdigit   hexadecimal digits
 | 
						|
 | 
						|
       The default "space" characters are HT (9), LF (10), VT (11),  FF  (12),
 | 
						|
       CR  (13),  and space (32). If locale-specific matching is taking place,
 | 
						|
       the list of space characters may be different; there may  be  fewer  or
 | 
						|
       more of them. "Space" and \s match the same set of characters.
 | 
						|
 | 
						|
       The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
 | 
						|
       from Perl 5.8. Another Perl extension is negation, which  is  indicated
 | 
						|
       by a ^ character after the colon. For example,
 | 
						|
 | 
						|
         [12[:^digit:]]
 | 
						|
 | 
						|
       matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
 | 
						|
       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
 | 
						|
       these are not supported, and an error is given if they are encountered.
 | 
						|
 | 
						|
       By default, characters with values greater than 127 do not match any of
 | 
						|
       the POSIX character classes, although this may be different for charac-
 | 
						|
       ters  in  the range 128-255 when locale-specific matching is happening.
 | 
						|
       However, if the PCRE2_UCP option is passed to pcre2_compile(), some  of
 | 
						|
       the  classes are changed so that Unicode character properties are used.
 | 
						|
       This  is  achieved  by  replacing  certain  POSIX  classes  with  other
 | 
						|
       sequences, as follows:
 | 
						|
 | 
						|
         [:alnum:]  becomes  \p{Xan}
 | 
						|
         [:alpha:]  becomes  \p{L}
 | 
						|
         [:blank:]  becomes  \h
 | 
						|
         [:cntrl:]  becomes  \p{Cc}
 | 
						|
         [:digit:]  becomes  \p{Nd}
 | 
						|
         [:lower:]  becomes  \p{Ll}
 | 
						|
         [:space:]  becomes  \p{Xps}
 | 
						|
         [:upper:]  becomes  \p{Lu}
 | 
						|
         [:word:]   becomes  \p{Xwd}
 | 
						|
 | 
						|
       Negated  versions, such as [:^alpha:] use \P instead of \p. Three other
 | 
						|
       POSIX classes are handled specially in UCP mode:
 | 
						|
 | 
						|
       [:graph:] This matches characters that have glyphs that mark  the  page
 | 
						|
                 when printed. In Unicode property terms, it matches all char-
 | 
						|
                 acters with the L, M, N, P, S, or Cf properties, except for:
 | 
						|
 | 
						|
                   U+061C           Arabic Letter Mark
 | 
						|
                   U+180E           Mongolian Vowel Separator
 | 
						|
                   U+2066 - U+2069  Various "isolate"s
 | 
						|
 | 
						|
 | 
						|
       [:print:] This matches the same  characters  as  [:graph:]  plus  space
 | 
						|
                 characters  that  are  not controls, that is, characters with
 | 
						|
                 the Zs property.
 | 
						|
 | 
						|
       [:punct:] This matches all characters that have the Unicode P (punctua-
 | 
						|
                 tion)  property,  plus those characters with code points less
 | 
						|
                 than 256 that have the S (Symbol) property.
 | 
						|
 | 
						|
       The other POSIX classes are unchanged, and match only  characters  with
 | 
						|
       code points less than 256.
 | 
						|
 | 
						|
 | 
						|
COMPATIBILITY FEATURE FOR WORD BOUNDARIES
 | 
						|
 | 
						|
       In  the POSIX.2 compliant library that was included in 4.4BSD Unix, the
 | 
						|
       ugly syntax [[:<:]] and [[:>:]] is used for matching  "start  of  word"
 | 
						|
       and "end of word". PCRE2 treats these items as follows:
 | 
						|
 | 
						|
         [[:<:]]  is converted to  \b(?=\w)
 | 
						|
         [[:>:]]  is converted to  \b(?<=\w)
 | 
						|
 | 
						|
       Only these exact character sequences are recognized. A sequence such as
 | 
						|
       [a[:<:]b] provokes error for an unrecognized  POSIX  class  name.  This
 | 
						|
       support  is not compatible with Perl. It is provided to help migrations
 | 
						|
       from other environments, and is best not used in any new patterns. Note
 | 
						|
       that  \b matches at the start and the end of a word (see "Simple asser-
 | 
						|
       tions" above), and in a Perl-style pattern the preceding  or  following
 | 
						|
       character  normally  shows  which  is  wanted, without the need for the
 | 
						|
       assertions that are used above in order to give exactly the  POSIX  be-
 | 
						|
       haviour.
 | 
						|
 | 
						|
 | 
						|
VERTICAL BAR
 | 
						|
 | 
						|
       Vertical  bar characters are used to separate alternative patterns. For
 | 
						|
       example, the pattern
 | 
						|
 | 
						|
         gilbert|sullivan
 | 
						|
 | 
						|
       matches either "gilbert" or "sullivan". Any number of alternatives  may
 | 
						|
       appear,  and  an  empty  alternative  is  permitted (matching the empty
 | 
						|
       string). The matching process tries each alternative in turn, from left
 | 
						|
       to  right, and the first one that succeeds is used. If the alternatives
 | 
						|
       are within a subpattern (defined below), "succeeds" means matching  the
 | 
						|
       rest of the main pattern as well as the alternative in the subpattern.
 | 
						|
 | 
						|
 | 
						|
INTERNAL OPTION SETTING
 | 
						|
 | 
						|
       The  settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, and
 | 
						|
       PCRE2_EXTENDED options (which are Perl-compatible) can be changed  from
 | 
						|
       within  the  pattern  by  a  sequence  of  Perl option letters enclosed
 | 
						|
       between "(?" and ")".  The option letters are
 | 
						|
 | 
						|
         i  for PCRE2_CASELESS
 | 
						|
         m  for PCRE2_MULTILINE
 | 
						|
         s  for PCRE2_DOTALL
 | 
						|
         x  for PCRE2_EXTENDED
 | 
						|
 | 
						|
       For example, (?im) sets caseless, multiline matching. It is also possi-
 | 
						|
       ble to unset these options by preceding the letter with a hyphen, and a
 | 
						|
       combined setting and unsetting such as (?im-sx), which sets PCRE2_CASE-
 | 
						|
       LESS    and    PCRE2_MULTILINE   while   unsetting   PCRE2_DOTALL   and
 | 
						|
       PCRE2_EXTENDED, is also permitted. If a letter appears both before  and
 | 
						|
       after  the  hyphen, the option is unset. An empty options setting "(?)"
 | 
						|
       is allowed. Needless to say, it has no effect.
 | 
						|
 | 
						|
       The PCRE2-specific options PCRE2_DUPNAMES  and  PCRE2_UNGREEDY  can  be
 | 
						|
       changed  in  the  same  way as the Perl-compatible options by using the
 | 
						|
       characters J and U respectively.
 | 
						|
 | 
						|
       When one of these option changes occurs at  top  level  (that  is,  not
 | 
						|
       inside  subpattern parentheses), the change applies to the remainder of
 | 
						|
       the pattern that follows. An option change  within  a  subpattern  (see
 | 
						|
       below  for  a description of subpatterns) affects only that part of the
 | 
						|
       subpattern that follows it, so
 | 
						|
 | 
						|
         (a(?i)b)c
 | 
						|
 | 
						|
       matches abc and aBc and no other strings  (assuming  PCRE2_CASELESS  is
 | 
						|
       not  used).   By this means, options can be made to have different set-
 | 
						|
       tings in different parts of the pattern. Any changes made in one alter-
 | 
						|
       native do carry on into subsequent branches within the same subpattern.
 | 
						|
       For example,
 | 
						|
 | 
						|
         (a(?i)b|c)
 | 
						|
 | 
						|
       matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
 | 
						|
       first  branch  is  abandoned before the option setting. This is because
 | 
						|
       the effects of option settings happen at compile time. There  would  be
 | 
						|
       some very weird behaviour otherwise.
 | 
						|
 | 
						|
       As  a  convenient shorthand, if any option settings are required at the
 | 
						|
       start of a non-capturing subpattern (see the next section), the  option
 | 
						|
       letters may appear between the "?" and the ":". Thus the two patterns
 | 
						|
 | 
						|
         (?i:saturday|sunday)
 | 
						|
         (?:(?i)saturday|sunday)
 | 
						|
 | 
						|
       match exactly the same set of strings.
 | 
						|
 | 
						|
       Note:  There  are  other  PCRE2-specific options that can be set by the
 | 
						|
       application when the compiling function is called. The pattern can con-
 | 
						|
       tain  special  leading  sequences  such as (*CRLF) to override what the
 | 
						|
       application has set or what has been defaulted. Details  are  given  in
 | 
						|
       the  section  entitled  "Newline  sequences"  above. There are also the
 | 
						|
       (*UTF) and (*UCP) leading sequences that can be used  to  set  UTF  and
 | 
						|
       Unicode  property  modes;  they are equivalent to setting the PCRE2_UTF
 | 
						|
       and PCRE2_UCP options, respectively. However, the application  can  set
 | 
						|
       the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use
 | 
						|
       of the (*UTF) and (*UCP) sequences.
 | 
						|
 | 
						|
 | 
						|
SUBPATTERNS
 | 
						|
 | 
						|
       Subpatterns are delimited by parentheses (round brackets), which can be
 | 
						|
       nested.  Turning part of a pattern into a subpattern does two things:
 | 
						|
 | 
						|
       1. It localizes a set of alternatives. For example, the pattern
 | 
						|
 | 
						|
         cat(aract|erpillar|)
 | 
						|
 | 
						|
       matches  "cataract",  "caterpillar", or "cat". Without the parentheses,
 | 
						|
       it would match "cataract", "erpillar" or an empty string.
 | 
						|
 | 
						|
       2. It sets up the subpattern as  a  capturing  subpattern.  This  means
 | 
						|
       that, when the whole pattern matches, the portion of the subject string
 | 
						|
       that matched the subpattern is passed back to  the  caller,  separately
 | 
						|
       from  the portion that matched the whole pattern. (This applies only to
 | 
						|
       the traditional matching function; the DFA matching function  does  not
 | 
						|
       support capturing.)
 | 
						|
 | 
						|
       Opening parentheses are counted from left to right (starting from 1) to
 | 
						|
       obtain numbers for the  capturing  subpatterns.  For  example,  if  the
 | 
						|
       string "the red king" is matched against the pattern
 | 
						|
 | 
						|
         the ((red|white) (king|queen))
 | 
						|
 | 
						|
       the captured substrings are "red king", "red", and "king", and are num-
 | 
						|
       bered 1, 2, and 3, respectively.
 | 
						|
 | 
						|
       The fact that plain parentheses fulfil  two  functions  is  not  always
 | 
						|
       helpful.   There are often times when a grouping subpattern is required
 | 
						|
       without a capturing requirement. If an opening parenthesis is  followed
 | 
						|
       by  a question mark and a colon, the subpattern does not do any captur-
 | 
						|
       ing, and is not counted when computing the  number  of  any  subsequent
 | 
						|
       capturing  subpatterns. For example, if the string "the white queen" is
 | 
						|
       matched against the pattern
 | 
						|
 | 
						|
         the ((?:red|white) (king|queen))
 | 
						|
 | 
						|
       the captured substrings are "white queen" and "queen", and are numbered
 | 
						|
       1 and 2. The maximum number of capturing subpatterns is 65535.
 | 
						|
 | 
						|
       As  a  convenient shorthand, if any option settings are required at the
 | 
						|
       start of a non-capturing subpattern,  the  option  letters  may  appear
 | 
						|
       between the "?" and the ":". Thus the two patterns
 | 
						|
 | 
						|
         (?i:saturday|sunday)
 | 
						|
         (?:(?i)saturday|sunday)
 | 
						|
 | 
						|
       match exactly the same set of strings. Because alternative branches are
 | 
						|
       tried from left to right, and options are not reset until  the  end  of
 | 
						|
       the  subpattern is reached, an option setting in one branch does affect
 | 
						|
       subsequent branches, so the above patterns match "SUNDAY"  as  well  as
 | 
						|
       "Saturday".
 | 
						|
 | 
						|
 | 
						|
DUPLICATE SUBPATTERN NUMBERS
 | 
						|
 | 
						|
       Perl 5.10 introduced a feature whereby each alternative in a subpattern
 | 
						|
       uses the same numbers for its capturing parentheses. Such a  subpattern
 | 
						|
       starts  with (?| and is itself a non-capturing subpattern. For example,
 | 
						|
       consider this pattern:
 | 
						|
 | 
						|
         (?|(Sat)ur|(Sun))day
 | 
						|
 | 
						|
       Because the two alternatives are inside a (?| group, both sets of  cap-
 | 
						|
       turing  parentheses  are  numbered one. Thus, when the pattern matches,
 | 
						|
       you can look at captured substring number  one,  whichever  alternative
 | 
						|
       matched.  This  construct  is useful when you want to capture part, but
 | 
						|
       not all, of one of a number of alternatives. Inside a (?| group, paren-
 | 
						|
       theses  are  numbered as usual, but the number is reset at the start of
 | 
						|
       each branch. The numbers of any capturing parentheses that  follow  the
 | 
						|
       subpattern  start after the highest number used in any branch. The fol-
 | 
						|
       lowing example is taken from the Perl documentation. The numbers under-
 | 
						|
       neath show in which buffer the captured content will be stored.
 | 
						|
 | 
						|
         # before  ---------------branch-reset----------- after
 | 
						|
         / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
 | 
						|
         # 1            2         2  3        2     3     4
 | 
						|
 | 
						|
       A  back  reference  to a numbered subpattern uses the most recent value
 | 
						|
       that is set for that number by any subpattern.  The  following  pattern
 | 
						|
       matches "abcabc" or "defdef":
 | 
						|
 | 
						|
         /(?|(abc)|(def))\1/
 | 
						|
 | 
						|
       In  contrast,  a subroutine call to a numbered subpattern always refers
 | 
						|
       to the first one in the pattern with the given  number.  The  following
 | 
						|
       pattern matches "abcabc" or "defabc":
 | 
						|
 | 
						|
         /(?|(abc)|(def))(?1)/
 | 
						|
 | 
						|
       A relative reference such as (?-1) is no different: it is just a conve-
 | 
						|
       nient way of computing an absolute group number.
 | 
						|
 | 
						|
       If a condition test for a subpattern's having matched refers to a  non-
 | 
						|
       unique  number, the test is true if any of the subpatterns of that num-
 | 
						|
       ber have matched.
 | 
						|
 | 
						|
       An alternative approach to using this "branch reset" feature is to  use
 | 
						|
       duplicate named subpatterns, as described in the next section.
 | 
						|
 | 
						|
 | 
						|
NAMED SUBPATTERNS
 | 
						|
 | 
						|
       Identifying  capturing  parentheses  by number is simple, but it can be
 | 
						|
       very hard to keep track of the numbers in complicated  regular  expres-
 | 
						|
       sions.  Furthermore,  if  an  expression  is  modified, the numbers may
 | 
						|
       change. To help with this difficulty, PCRE2 supports the naming of sub-
 | 
						|
       patterns. This feature was not added to Perl until release 5.10. Python
 | 
						|
       had the feature earlier, and PCRE1 introduced it at release 4.0,  using
 | 
						|
       the  Python syntax. PCRE2 supports both the Perl and the Python syntax.
 | 
						|
       Perl allows identically numbered subpatterns to have  different  names,
 | 
						|
       but PCRE2 does not.
 | 
						|
 | 
						|
       In  PCRE2, a subpattern can be named in one of three ways: (?<name>...)
 | 
						|
       or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References
 | 
						|
       to  capturing parentheses from other parts of the pattern, such as back
 | 
						|
       references, recursion, and conditions, can be made by name as  well  as
 | 
						|
       by number.
 | 
						|
 | 
						|
       Names  consist of up to 32 alphanumeric characters and underscores, but
 | 
						|
       must start with a non-digit.  Named  capturing  parentheses  are  still
 | 
						|
       allocated  numbers  as  well as names, exactly as if the names were not
 | 
						|
       present. The PCRE2 API provides function calls for extracting the name-
 | 
						|
       to-number  translation  table  from  a compiled pattern. There are also
 | 
						|
       convenience functions for extracting a captured substring by name.
 | 
						|
 | 
						|
       By default, a name must be unique within a pattern, but it is  possible
 | 
						|
       to  relax  this constraint by setting the PCRE2_DUPNAMES option at com-
 | 
						|
       pile time.  (Duplicate names are also always permitted for  subpatterns
 | 
						|
       with  the  same  number,  set up as described in the previous section.)
 | 
						|
       Duplicate names can be useful for patterns where only one  instance  of
 | 
						|
       the named parentheses can match.  Suppose you want to match the name of
 | 
						|
       a weekday, either as a 3-letter abbreviation or as the full  name,  and
 | 
						|
       in  both  cases  you  want  to  extract  the abbreviation. This pattern
 | 
						|
       (ignoring the line breaks) does the job:
 | 
						|
 | 
						|
         (?<DN>Mon|Fri|Sun)(?:day)?|
 | 
						|
         (?<DN>Tue)(?:sday)?|
 | 
						|
         (?<DN>Wed)(?:nesday)?|
 | 
						|
         (?<DN>Thu)(?:rsday)?|
 | 
						|
         (?<DN>Sat)(?:urday)?
 | 
						|
 | 
						|
       There are five capturing substrings, but only one is ever set  after  a
 | 
						|
       match.  (An alternative way of solving this problem is to use a "branch
 | 
						|
       reset" subpattern, as described in the previous section.)
 | 
						|
 | 
						|
       The convenience functions for extracting the data by name  returns  the
 | 
						|
       substring  for  the first (and in this example, the only) subpattern of
 | 
						|
       that name that matched. This saves searching  to  find  which  numbered
 | 
						|
       subpattern it was.
 | 
						|
 | 
						|
       If  you  make  a  back  reference to a non-unique named subpattern from
 | 
						|
       elsewhere in the pattern, the subpatterns to which the name refers  are
 | 
						|
       checked  in  the order in which they appear in the overall pattern. The
 | 
						|
       first one that is set is used for the reference. For example, this pat-
 | 
						|
       tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":
 | 
						|
 | 
						|
         (?:(?<n>foo)|(?<n>bar))\k<n>
 | 
						|
 | 
						|
 | 
						|
       If you make a subroutine call to a non-unique named subpattern, the one
 | 
						|
       that corresponds to the first occurrence of the name is  used.  In  the
 | 
						|
       absence of duplicate numbers (see the previous section) this is the one
 | 
						|
       with the lowest number.
 | 
						|
 | 
						|
       If you use a named reference in a condition test (see the section about
 | 
						|
       conditions below), either to check whether a subpattern has matched, or
 | 
						|
       to check for recursion, all subpatterns with the same name are  tested.
 | 
						|
       If  the condition is true for any one of them, the overall condition is
 | 
						|
       true. This is the same behaviour as  testing  by  number.  For  further
 | 
						|
       details  of  the  interfaces  for  handling  named subpatterns, see the
 | 
						|
       pcre2api documentation.
 | 
						|
 | 
						|
       Warning: You cannot use different names to distinguish between two sub-
 | 
						|
       patterns  with the same number because PCRE2 uses only the numbers when
 | 
						|
       matching. For this reason, an error is given at compile time if differ-
 | 
						|
       ent  names  are given to subpatterns with the same number. However, you
 | 
						|
       can always give the same name to subpatterns with the same number, even
 | 
						|
       when PCRE2_DUPNAMES is not set.
 | 
						|
 | 
						|
 | 
						|
REPETITION
 | 
						|
 | 
						|
       Repetition  is  specified  by  quantifiers, which can follow any of the
 | 
						|
       following items:
 | 
						|
 | 
						|
         a literal data character
 | 
						|
         the dot metacharacter
 | 
						|
         the \C escape sequence
 | 
						|
         the \X escape sequence
 | 
						|
         the \R escape sequence
 | 
						|
         an escape such as \d or \pL that matches a single character
 | 
						|
         a character class
 | 
						|
         a back reference
 | 
						|
         a parenthesized subpattern (including most assertions)
 | 
						|
         a subroutine call to a subpattern (recursive or otherwise)
 | 
						|
 | 
						|
       The general repetition quantifier specifies a minimum and maximum  num-
 | 
						|
       ber  of  permitted matches, by giving the two numbers in curly brackets
 | 
						|
       (braces), separated by a comma. The numbers must be  less  than  65536,
 | 
						|
       and the first must be less than or equal to the second. For example:
 | 
						|
 | 
						|
         z{2,4}
 | 
						|
 | 
						|
       matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
 | 
						|
       special character. If the second number is omitted, but  the  comma  is
 | 
						|
       present,  there  is  no upper limit; if the second number and the comma
 | 
						|
       are both omitted, the quantifier specifies an exact number of  required
 | 
						|
       matches. Thus
 | 
						|
 | 
						|
         [aeiou]{3,}
 | 
						|
 | 
						|
       matches at least 3 successive vowels, but may match many more, whereas
 | 
						|
 | 
						|
         \d{8}
 | 
						|
 | 
						|
       matches  exactly  8  digits. An opening curly bracket that appears in a
 | 
						|
       position where a quantifier is not allowed, or one that does not  match
 | 
						|
       the  syntax of a quantifier, is taken as a literal character. For exam-
 | 
						|
       ple, {,6} is not a quantifier, but a literal string of four characters.
 | 
						|
 | 
						|
       In UTF modes, quantifiers apply to characters rather than to individual
 | 
						|
       code  units. Thus, for example, \x{100}{2} matches two characters, each
 | 
						|
       of which is represented by a two-byte sequence in a UTF-8 string. Simi-
 | 
						|
       larly,  \X{3} matches three Unicode extended grapheme clusters, each of
 | 
						|
       which may be several code units long (and  they  may  be  of  different
 | 
						|
       lengths).
 | 
						|
 | 
						|
       The quantifier {0} is permitted, causing the expression to behave as if
 | 
						|
       the previous item and the quantifier were not present. This may be use-
 | 
						|
       ful  for  subpatterns that are referenced as subroutines from elsewhere
 | 
						|
       in the pattern (but see also the section entitled "Defining subpatterns
 | 
						|
       for  use  by  reference only" below). Items other than subpatterns that
 | 
						|
       have a {0} quantifier are omitted from the compiled pattern.
 | 
						|
 | 
						|
       For convenience, the three most common quantifiers have  single-charac-
 | 
						|
       ter abbreviations:
 | 
						|
 | 
						|
         *    is equivalent to {0,}
 | 
						|
         +    is equivalent to {1,}
 | 
						|
         ?    is equivalent to {0,1}
 | 
						|
 | 
						|
       It  is  possible  to construct infinite loops by following a subpattern
 | 
						|
       that can match no characters with a quantifier that has no upper limit,
 | 
						|
       for example:
 | 
						|
 | 
						|
         (a?)*
 | 
						|
 | 
						|
       Earlier  versions  of  Perl  and PCRE1 used to give an error at compile
 | 
						|
       time for such patterns. However, because there are cases where this can
 | 
						|
       be useful, such patterns are now accepted, but if any repetition of the
 | 
						|
       subpattern does in fact match no characters, the loop is forcibly  bro-
 | 
						|
       ken.
 | 
						|
 | 
						|
       By  default,  the quantifiers are "greedy", that is, they match as much
 | 
						|
       as possible (up to the maximum  number  of  permitted  times),  without
 | 
						|
       causing  the  rest of the pattern to fail. The classic example of where
 | 
						|
       this gives problems is in trying to match comments in C programs. These
 | 
						|
       appear  between  /*  and  */ and within the comment, individual * and /
 | 
						|
       characters may appear. An attempt to match C comments by  applying  the
 | 
						|
       pattern
 | 
						|
 | 
						|
         /\*.*\*/
 | 
						|
 | 
						|
       to the string
 | 
						|
 | 
						|
         /* first comment */  not comment  /* second comment */
 | 
						|
 | 
						|
       fails,  because it matches the entire string owing to the greediness of
 | 
						|
       the .*  item.
 | 
						|
 | 
						|
       If a quantifier is followed by a question mark, it ceases to be greedy,
 | 
						|
       and  instead  matches the minimum number of times possible, so the pat-
 | 
						|
       tern
 | 
						|
 | 
						|
         /\*.*?\*/
 | 
						|
 | 
						|
       does the right thing with the C comments. The meaning  of  the  various
 | 
						|
       quantifiers  is  not  otherwise  changed,  just the preferred number of
 | 
						|
       matches.  Do not confuse this use of question mark with its  use  as  a
 | 
						|
       quantifier  in its own right. Because it has two uses, it can sometimes
 | 
						|
       appear doubled, as in
 | 
						|
 | 
						|
         \d??\d
 | 
						|
 | 
						|
       which matches one digit by preference, but can match two if that is the
 | 
						|
       only way the rest of the pattern matches.
 | 
						|
 | 
						|
       If the PCRE2_UNGREEDY option is set (an option that is not available in
 | 
						|
       Perl), the quantifiers are not greedy by default, but  individual  ones
 | 
						|
       can  be  made  greedy  by following them with a question mark. In other
 | 
						|
       words, it inverts the default behaviour.
 | 
						|
 | 
						|
       When a parenthesized subpattern is quantified  with  a  minimum  repeat
 | 
						|
       count  that is greater than 1 or with a limited maximum, more memory is
 | 
						|
       required for the compiled pattern, in proportion to  the  size  of  the
 | 
						|
       minimum or maximum.
 | 
						|
 | 
						|
       If  a  pattern  starts  with  .*  or  .{0,} and the PCRE2_DOTALL option
 | 
						|
       (equivalent to Perl's /s) is set, thus allowing the dot to  match  new-
 | 
						|
       lines,  the  pattern  is  implicitly anchored, because whatever follows
 | 
						|
       will be tried against every character position in the  subject  string,
 | 
						|
       so  there  is  no  point  in retrying the overall match at any position
 | 
						|
       after the first. PCRE2 normally treats such a pattern as though it were
 | 
						|
       preceded by \A.
 | 
						|
 | 
						|
       In  cases  where  it  is known that the subject string contains no new-
 | 
						|
       lines, it is worth setting PCRE2_DOTALL in order to obtain  this  opti-
 | 
						|
       mization, or alternatively, using ^ to indicate anchoring explicitly.
 | 
						|
 | 
						|
       However,  there  are  some cases where the optimization cannot be used.
 | 
						|
       When .*  is inside capturing parentheses that are the subject of a back
 | 
						|
       reference elsewhere in the pattern, a match at the start may fail where
 | 
						|
       a later one succeeds. Consider, for example:
 | 
						|
 | 
						|
         (.*)abc\1
 | 
						|
 | 
						|
       If the subject is "xyz123abc123" the match point is the fourth  charac-
 | 
						|
       ter. For this reason, such a pattern is not implicitly anchored.
 | 
						|
 | 
						|
       Another  case where implicit anchoring is not applied is when the lead-
 | 
						|
       ing .* is inside an atomic group. Once again, a match at the start  may
 | 
						|
       fail where a later one succeeds. Consider this pattern:
 | 
						|
 | 
						|
         (?>.*?a)b
 | 
						|
 | 
						|
       It  matches "ab" in the subject "aab". The use of the backtracking con-
 | 
						|
       trol verbs (*PRUNE) and (*SKIP) also  disable  this  optimization,  and
 | 
						|
       there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
 | 
						|
 | 
						|
       When a capturing subpattern is repeated, the value captured is the sub-
 | 
						|
       string that matched the final iteration. For example, after
 | 
						|
 | 
						|
         (tweedle[dume]{3}\s*)+
 | 
						|
 | 
						|
       has matched "tweedledum tweedledee" the value of the captured substring
 | 
						|
       is  "tweedledee".  However,  if there are nested capturing subpatterns,
 | 
						|
       the corresponding captured values may have been set in previous  itera-
 | 
						|
       tions. For example, after
 | 
						|
 | 
						|
         (a|(b))+
 | 
						|
 | 
						|
       matches "aba" the value of the second captured substring is "b".
 | 
						|
 | 
						|
 | 
						|
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
 | 
						|
 | 
						|
       With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
 | 
						|
       repetition, failure of what follows normally causes the  repeated  item
 | 
						|
       to  be  re-evaluated to see if a different number of repeats allows the
 | 
						|
       rest of the pattern to match. Sometimes it is useful to  prevent  this,
 | 
						|
       either  to  change the nature of the match, or to cause it fail earlier
 | 
						|
       than it otherwise might, when the author of the pattern knows there  is
 | 
						|
       no point in carrying on.
 | 
						|
 | 
						|
       Consider,  for  example, the pattern \d+foo when applied to the subject
 | 
						|
       line
 | 
						|
 | 
						|
         123456bar
 | 
						|
 | 
						|
       After matching all 6 digits and then failing to match "foo", the normal
 | 
						|
       action  of  the matcher is to try again with only 5 digits matching the
 | 
						|
       \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.
 | 
						|
       "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides
 | 
						|
       the means for specifying that once a subpattern has matched, it is  not
 | 
						|
       to be re-evaluated in this way.
 | 
						|
 | 
						|
       If  we  use atomic grouping for the previous example, the matcher gives
 | 
						|
       up immediately on failing to match "foo" the first time.  The  notation
 | 
						|
       is a kind of special parenthesis, starting with (?> as in this example:
 | 
						|
 | 
						|
         (?>\d+)foo
 | 
						|
 | 
						|
       This  kind  of  parenthesis "locks up" the  part of the pattern it con-
 | 
						|
       tains once it has matched, and a failure further into  the  pattern  is
 | 
						|
       prevented  from  backtracking into it. Backtracking past it to previous
 | 
						|
       items, however, works as normal.
 | 
						|
 | 
						|
       An alternative description is that a subpattern of  this  type  matches
 | 
						|
       exactly  the  string of characters that an identical standalone pattern
 | 
						|
       would match, if anchored at the current point in the subject string.
 | 
						|
 | 
						|
       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
 | 
						|
       such as the above example can be thought of as a maximizing repeat that
 | 
						|
       must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
 | 
						|
       pared  to  adjust  the number of digits they match in order to make the
 | 
						|
       rest of the pattern match, (?>\d+) can only match an entire sequence of
 | 
						|
       digits.
 | 
						|
 | 
						|
       Atomic  groups in general can of course contain arbitrarily complicated
 | 
						|
       subpatterns, and can be nested. However, when  the  subpattern  for  an
 | 
						|
       atomic group is just a single repeated item, as in the example above, a
 | 
						|
       simpler notation, called a "possessive quantifier" can  be  used.  This
 | 
						|
       consists  of  an  additional  + character following a quantifier. Using
 | 
						|
       this notation, the previous example can be rewritten as
 | 
						|
 | 
						|
         \d++foo
 | 
						|
 | 
						|
       Note that a possessive quantifier can be used with an entire group, for
 | 
						|
       example:
 | 
						|
 | 
						|
         (abc|xyz){2,3}+
 | 
						|
 | 
						|
       Possessive   quantifiers   are   always  greedy;  the  setting  of  the
 | 
						|
       PCRE2_UNGREEDY option is ignored. They are a  convenient  notation  for
 | 
						|
       the  simpler  forms of atomic group. However, there is no difference in
 | 
						|
       the meaning of a possessive quantifier and the equivalent atomic group,
 | 
						|
       though  there  may  be a performance difference; possessive quantifiers
 | 
						|
       should be slightly faster.
 | 
						|
 | 
						|
       The possessive quantifier syntax is an extension to the Perl  5.8  syn-
 | 
						|
       tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
 | 
						|
       edition of his book. Mike McCloskey liked it, so implemented it when he
 | 
						|
       built Sun's Java package, and PCRE1 copied it from there. It ultimately
 | 
						|
       found its way into Perl at release 5.10.
 | 
						|
 | 
						|
       PCRE2 has an optimization  that  automatically  "possessifies"  certain
 | 
						|
       simple  pattern constructs. For example, the sequence A+B is treated as
 | 
						|
       A++B because there is no point in backtracking into a sequence  of  A's
 | 
						|
       when B must follow.  This feature can be disabled by the PCRE2_NO_AUTO-
 | 
						|
       POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
 | 
						|
 | 
						|
       When a pattern contains an unlimited repeat inside  a  subpattern  that
 | 
						|
       can  itself  be  repeated  an  unlimited number of times, the use of an
 | 
						|
       atomic group is the only way to avoid some  failing  matches  taking  a
 | 
						|
       very long time indeed. The pattern
 | 
						|
 | 
						|
         (\D+|<\d+>)*[!?]
 | 
						|
 | 
						|
       matches  an  unlimited number of substrings that either consist of non-
 | 
						|
       digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
 | 
						|
       matches, it runs quickly. However, if it is applied to
 | 
						|
 | 
						|
         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 | 
						|
 | 
						|
       it  takes  a  long  time  before reporting failure. This is because the
 | 
						|
       string can be divided between the internal \D+ repeat and the  external
 | 
						|
       *  repeat  in  a  large  number of ways, and all have to be tried. (The
 | 
						|
       example uses [!?] rather than a single character at  the  end,  because
 | 
						|
       both  PCRE2  and Perl have an optimization that allows for fast failure
 | 
						|
       when a single character is used. They remember the last single  charac-
 | 
						|
       ter  that  is required for a match, and fail early if it is not present
 | 
						|
       in the string.) If the pattern is changed so that  it  uses  an  atomic
 | 
						|
       group, like this:
 | 
						|
 | 
						|
         ((?>\D+)|<\d+>)*[!?]
 | 
						|
 | 
						|
       sequences of non-digits cannot be broken, and failure happens quickly.
 | 
						|
 | 
						|
 | 
						|
BACK REFERENCES
 | 
						|
 | 
						|
       Outside a character class, a backslash followed by a digit greater than
 | 
						|
       0 (and possibly further digits) is a back reference to a capturing sub-
 | 
						|
       pattern  earlier  (that is, to its left) in the pattern, provided there
 | 
						|
       have been that many previous capturing left parentheses.
 | 
						|
 | 
						|
       However, if the decimal number following the backslash is less than  8,
 | 
						|
       it  is  always  taken  as a back reference, and causes an error only if
 | 
						|
       there are not that many capturing left parentheses in the  entire  pat-
 | 
						|
       tern.  In  other words, the parentheses that are referenced need not be
 | 
						|
       to the left of the reference for numbers less than 8. A  "forward  back
 | 
						|
       reference"  of  this  type can make sense when a repetition is involved
 | 
						|
       and the subpattern to the right has participated in an  earlier  itera-
 | 
						|
       tion.
 | 
						|
 | 
						|
       It  is  not  possible to have a numerical "forward back reference" to a
 | 
						|
       subpattern whose number is 8  or  more  using  this  syntax  because  a
 | 
						|
       sequence  such  as  \50 is interpreted as a character defined in octal.
 | 
						|
       See the subsection entitled "Non-printing characters" above for further
 | 
						|
       details  of  the  handling of digits following a backslash. There is no
 | 
						|
       such problem when named parentheses are used. A back reference  to  any
 | 
						|
       subpattern is possible using named parentheses (see below).
 | 
						|
 | 
						|
       Another  way  of  avoiding  the ambiguity inherent in the use of digits
 | 
						|
       following a backslash is to use the \g  escape  sequence.  This  escape
 | 
						|
       must be followed by a signed or unsigned number, optionally enclosed in
 | 
						|
       braces. These examples are all identical:
 | 
						|
 | 
						|
         (ring), \1
 | 
						|
         (ring), \g1
 | 
						|
         (ring), \g{1}
 | 
						|
 | 
						|
       An unsigned number specifies an absolute reference without the  ambigu-
 | 
						|
       ity that is present in the older syntax. It is also useful when literal
 | 
						|
       digits follow the reference. A signed number is a  relative  reference.
 | 
						|
       Consider this example:
 | 
						|
 | 
						|
         (abc(def)ghi)\g{-1}
 | 
						|
 | 
						|
       The sequence \g{-1} is a reference to the most recently started captur-
 | 
						|
       ing subpattern before \g, that is, is it equivalent to \2 in this exam-
 | 
						|
       ple.   Similarly, \g{-2} would be equivalent to \1. The use of relative
 | 
						|
       references can be helpful in long patterns, and also in  patterns  that
 | 
						|
       are  created  by  joining  together  fragments  that contain references
 | 
						|
       within themselves.
 | 
						|
 | 
						|
       The sequence \g{+1} is a reference to the  next  capturing  subpattern.
 | 
						|
       This  kind  of forward reference can be useful it patterns that repeat.
 | 
						|
       Perl does not support the use of + in this way.
 | 
						|
 | 
						|
       A back reference matches whatever actually matched the  capturing  sub-
 | 
						|
       pattern  in  the  current subject string, rather than anything matching
 | 
						|
       the subpattern itself (see "Subpatterns as subroutines" below for a way
 | 
						|
       of doing that). So the pattern
 | 
						|
 | 
						|
         (sens|respons)e and \1ibility
 | 
						|
 | 
						|
       matches  "sense and sensibility" and "response and responsibility", but
 | 
						|
       not "sense and responsibility". If caseful matching is in force at  the
 | 
						|
       time  of the back reference, the case of letters is relevant. For exam-
 | 
						|
       ple,
 | 
						|
 | 
						|
         ((?i)rah)\s+\1
 | 
						|
 | 
						|
       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
 | 
						|
       original capturing subpattern is matched caselessly.
 | 
						|
 | 
						|
       There  are  several  different ways of writing back references to named
 | 
						|
       subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
 | 
						|
       \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
 | 
						|
       unified back reference syntax, in which \g can be used for both numeric
 | 
						|
       and  named  references,  is  also supported. We could rewrite the above
 | 
						|
       example in any of the following ways:
 | 
						|
 | 
						|
         (?<p1>(?i)rah)\s+\k<p1>
 | 
						|
         (?'p1'(?i)rah)\s+\k{p1}
 | 
						|
         (?P<p1>(?i)rah)\s+(?P=p1)
 | 
						|
         (?<p1>(?i)rah)\s+\g{p1}
 | 
						|
 | 
						|
       A subpattern that is referenced by  name  may  appear  in  the  pattern
 | 
						|
       before or after the reference.
 | 
						|
 | 
						|
       There  may be more than one back reference to the same subpattern. If a
 | 
						|
       subpattern has not actually been used in a particular match,  any  back
 | 
						|
       references to it always fail by default. For example, the pattern
 | 
						|
 | 
						|
         (a|(bc))\2
 | 
						|
 | 
						|
       always  fails  if  it starts to match "a" rather than "bc". However, if
 | 
						|
       the PCRE2_MATCH_UNSET_BACKREF option is set at  compile  time,  a  back
 | 
						|
       reference to an unset value matches an empty string.
 | 
						|
 | 
						|
       Because  there may be many capturing parentheses in a pattern, all dig-
 | 
						|
       its following a backslash are taken as part of a potential back  refer-
 | 
						|
       ence  number.   If  the  pattern continues with a digit character, some
 | 
						|
       delimiter must  be  used  to  terminate  the  back  reference.  If  the
 | 
						|
       PCRE2_EXTENDED  option  is set, this can be white space. Otherwise, the
 | 
						|
       \g{ syntax or an empty comment (see "Comments" below) can be used.
 | 
						|
 | 
						|
   Recursive back references
 | 
						|
 | 
						|
       A back reference that occurs inside the parentheses to which it  refers
 | 
						|
       fails  when  the subpattern is first used, so, for example, (a\1) never
 | 
						|
       matches.  However, such references can be useful inside  repeated  sub-
 | 
						|
       patterns. For example, the pattern
 | 
						|
 | 
						|
         (a|b\1)+
 | 
						|
 | 
						|
       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
 | 
						|
       ation of the subpattern,  the  back  reference  matches  the  character
 | 
						|
       string  corresponding  to  the previous iteration. In order for this to
 | 
						|
       work, the pattern must be such that the first iteration does  not  need
 | 
						|
       to  match the back reference. This can be done using alternation, as in
 | 
						|
       the example above, or by a quantifier with a minimum of zero.
 | 
						|
 | 
						|
       Back references of this type cause the group that they reference to  be
 | 
						|
       treated  as  an atomic group.  Once the whole group has been matched, a
 | 
						|
       subsequent matching failure cannot cause backtracking into  the  middle
 | 
						|
       of the group.
 | 
						|
 | 
						|
 | 
						|
ASSERTIONS
 | 
						|
 | 
						|
       An  assertion  is  a  test on the characters following or preceding the
 | 
						|
       current matching point that does not consume any characters. The simple
 | 
						|
       assertions  coded  as  \b,  \B,  \A,  \G, \Z, \z, ^ and $ are described
 | 
						|
       above.
 | 
						|
 | 
						|
       More complicated assertions are coded as  subpatterns.  There  are  two
 | 
						|
       kinds:  those  that  look  ahead of the current position in the subject
 | 
						|
       string, and those that look  behind  it.  An  assertion  subpattern  is
 | 
						|
       matched  in  the  normal way, except that it does not cause the current
 | 
						|
       matching position to be changed.
 | 
						|
 | 
						|
       Assertion subpatterns are not capturing subpatterns. If such an  asser-
 | 
						|
       tion  contains  capturing  subpatterns within it, these are counted for
 | 
						|
       the purposes of numbering the capturing subpatterns in the  whole  pat-
 | 
						|
       tern.  However,  substring  capturing  is carried out only for positive
 | 
						|
       assertions. (Perl sometimes, but not always, does do capturing in nega-
 | 
						|
       tive assertions.)
 | 
						|
 | 
						|
       WARNING:  If a positive assertion containing one or more capturing sub-
 | 
						|
       patterns succeeds, but failure to match later  in  the  pattern  causes
 | 
						|
       backtracking over this assertion, the captures within the assertion are
 | 
						|
       reset only if no higher numbered captures are  already  set.  This  is,
 | 
						|
       unfortunately,  a fundamental limitation of the current implementation;
 | 
						|
       it may get removed in a future reworking.
 | 
						|
 | 
						|
       For  compatibility  with  Perl,  most  assertion  subpatterns  may   be
 | 
						|
       repeated;  though  it  makes  no sense to assert the same thing several
 | 
						|
       times, the side effect of capturing  parentheses  may  occasionally  be
 | 
						|
       useful.  However,  an  assertion  that forms the condition for a condi-
 | 
						|
       tional subpattern may not be quantified. In practice, for other  asser-
 | 
						|
       tions, there only three cases:
 | 
						|
 | 
						|
       (1)  If  the  quantifier  is  {0}, the assertion is never obeyed during
 | 
						|
       matching.  However, it may  contain  internal  capturing  parenthesized
 | 
						|
       groups that are called from elsewhere via the subroutine mechanism.
 | 
						|
 | 
						|
       (2)  If quantifier is {0,n} where n is greater than zero, it is treated
 | 
						|
       as if it were {0,1}. At run time, the rest  of  the  pattern  match  is
 | 
						|
       tried with and without the assertion, the order depending on the greed-
 | 
						|
       iness of the quantifier.
 | 
						|
 | 
						|
       (3) If the minimum repetition is greater than zero, the  quantifier  is
 | 
						|
       ignored.   The  assertion  is  obeyed just once when encountered during
 | 
						|
       matching.
 | 
						|
 | 
						|
   Lookahead assertions
 | 
						|
 | 
						|
       Lookahead assertions start with (?= for positive assertions and (?! for
 | 
						|
       negative assertions. For example,
 | 
						|
 | 
						|
         \w+(?=;)
 | 
						|
 | 
						|
       matches  a word followed by a semicolon, but does not include the semi-
 | 
						|
       colon in the match, and
 | 
						|
 | 
						|
         foo(?!bar)
 | 
						|
 | 
						|
       matches any occurrence of "foo" that is not  followed  by  "bar".  Note
 | 
						|
       that the apparently similar pattern
 | 
						|
 | 
						|
         (?!foo)bar
 | 
						|
 | 
						|
       does  not  find  an  occurrence  of "bar" that is preceded by something
 | 
						|
       other than "foo"; it finds any occurrence of "bar" whatsoever,  because
 | 
						|
       the assertion (?!foo) is always true when the next three characters are
 | 
						|
       "bar". A lookbehind assertion is needed to achieve the other effect.
 | 
						|
 | 
						|
       If you want to force a matching failure at some point in a pattern, the
 | 
						|
       most  convenient  way  to  do  it  is with (?!) because an empty string
 | 
						|
       always matches, so an assertion that requires there not to be an  empty
 | 
						|
       string must always fail.  The backtracking control verb (*FAIL) or (*F)
 | 
						|
       is a synonym for (?!).
 | 
						|
 | 
						|
   Lookbehind assertions
 | 
						|
 | 
						|
       Lookbehind assertions start with (?<= for positive assertions and  (?<!
 | 
						|
       for negative assertions. For example,
 | 
						|
 | 
						|
         (?<!foo)bar
 | 
						|
 | 
						|
       does  find  an  occurrence  of "bar" that is not preceded by "foo". The
 | 
						|
       contents of a lookbehind assertion are restricted  such  that  all  the
 | 
						|
       strings it matches must have a fixed length. However, if there are sev-
 | 
						|
       eral top-level alternatives, they do not all  have  to  have  the  same
 | 
						|
       fixed length. Thus
 | 
						|
 | 
						|
         (?<=bullock|donkey)
 | 
						|
 | 
						|
       is permitted, but
 | 
						|
 | 
						|
         (?<!dogs?|cats?)
 | 
						|
 | 
						|
       causes  an  error at compile time. Branches that match different length
 | 
						|
       strings are permitted only at the top level of a lookbehind  assertion.
 | 
						|
       This is an extension compared with Perl, which requires all branches to
 | 
						|
       match the same length of string. An assertion such as
 | 
						|
 | 
						|
         (?<=ab(c|de))
 | 
						|
 | 
						|
       is not permitted, because its single top-level  branch  can  match  two
 | 
						|
       different  lengths,  but  it is acceptable to PCRE2 if rewritten to use
 | 
						|
       two top-level branches:
 | 
						|
 | 
						|
         (?<=abc|abde)
 | 
						|
 | 
						|
       In some cases, the escape sequence \K (see above) can be  used  instead
 | 
						|
       of a lookbehind assertion to get round the fixed-length restriction.
 | 
						|
 | 
						|
       The  implementation  of lookbehind assertions is, for each alternative,
 | 
						|
       to temporarily move the current position back by the fixed  length  and
 | 
						|
       then try to match. If there are insufficient characters before the cur-
 | 
						|
       rent position, the assertion fails.
 | 
						|
 | 
						|
       In UTF-8 and UTF-16 modes, PCRE2 does not allow the  \C  escape  (which
 | 
						|
       matches  a single code unit even in a UTF mode) to appear in lookbehind
 | 
						|
       assertions, because it makes it impossible to calculate the  length  of
 | 
						|
       the  lookbehind.  The \X and \R escapes, which can match different num-
 | 
						|
       bers of code units, are never permitted in lookbehinds.
 | 
						|
 | 
						|
       "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
 | 
						|
       lookbehinds,  as  long as the subpattern matches a fixed-length string.
 | 
						|
       However, recursion, that is, a "subroutine" call into a group  that  is
 | 
						|
       already active, is not supported.
 | 
						|
 | 
						|
       Perl  does  not support back references in lookbehinds. PCRE2 does sup-
 | 
						|
       port  them,   but   only   if   certain   conditions   are   met.   The
 | 
						|
       PCRE2_MATCH_UNSET_BACKREF  option must not be set, there must be no use
 | 
						|
       of (?| in the pattern (it creates duplicate subpattern numbers), and if
 | 
						|
       the  back reference is by name, the name must be unique. Of course, the
 | 
						|
       referenced subpattern must itself be of  fixed  length.  The  following
 | 
						|
       pattern matches words containing at least two characters that begin and
 | 
						|
       end with the same character:
 | 
						|
 | 
						|
          \b(\w)\w++(?<=\1)
 | 
						|
 | 
						|
       Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
 | 
						|
       assertions to specify efficient matching of fixed-length strings at the
 | 
						|
       end of subject strings. Consider a simple pattern such as
 | 
						|
 | 
						|
         abcd$
 | 
						|
 | 
						|
       when applied to a long string that does  not  match.  Because  matching
 | 
						|
       proceeds  from  left to right, PCRE2 will look for each "a" in the sub-
 | 
						|
       ject and then see if what follows matches the rest of the  pattern.  If
 | 
						|
       the pattern is specified as
 | 
						|
 | 
						|
         ^.*abcd$
 | 
						|
 | 
						|
       the  initial .* matches the entire string at first, but when this fails
 | 
						|
       (because there is no following "a"), it backtracks to match all but the
 | 
						|
       last  character,  then all but the last two characters, and so on. Once
 | 
						|
       again the search for "a" covers the entire string, from right to  left,
 | 
						|
       so we are no better off. However, if the pattern is written as
 | 
						|
 | 
						|
         ^.*+(?<=abcd)
 | 
						|
 | 
						|
       there can be no backtracking for the .*+ item because of the possessive
 | 
						|
       quantifier; it can match only the entire string. The subsequent lookbe-
 | 
						|
       hind  assertion  does  a single test on the last four characters. If it
 | 
						|
       fails, the match fails immediately. For  long  strings,  this  approach
 | 
						|
       makes a significant difference to the processing time.
 | 
						|
 | 
						|
   Using multiple assertions
 | 
						|
 | 
						|
       Several assertions (of any sort) may occur in succession. For example,
 | 
						|
 | 
						|
         (?<=\d{3})(?<!999)foo
 | 
						|
 | 
						|
       matches  "foo" preceded by three digits that are not "999". Notice that
 | 
						|
       each of the assertions is applied independently at the  same  point  in
 | 
						|
       the  subject  string.  First  there  is a check that the previous three
 | 
						|
       characters are all digits, and then there is  a  check  that  the  same
 | 
						|
       three characters are not "999".  This pattern does not match "foo" pre-
 | 
						|
       ceded by six characters, the first of which are  digits  and  the  last
 | 
						|
       three  of  which  are not "999". For example, it doesn't match "123abc-
 | 
						|
       foo". A pattern to do that is
 | 
						|
 | 
						|
         (?<=\d{3}...)(?<!999)foo
 | 
						|
 | 
						|
       This time the first assertion looks at the  preceding  six  characters,
 | 
						|
       checking that the first three are digits, and then the second assertion
 | 
						|
       checks that the preceding three characters are not "999".
 | 
						|
 | 
						|
       Assertions can be nested in any combination. For example,
 | 
						|
 | 
						|
         (?<=(?<!foo)bar)baz
 | 
						|
 | 
						|
       matches an occurrence of "baz" that is preceded by "bar" which in  turn
 | 
						|
       is not preceded by "foo", while
 | 
						|
 | 
						|
         (?<=\d{3}(?!999)...)foo
 | 
						|
 | 
						|
       is  another pattern that matches "foo" preceded by three digits and any
 | 
						|
       three characters that are not "999".
 | 
						|
 | 
						|
 | 
						|
CONDITIONAL SUBPATTERNS
 | 
						|
 | 
						|
       It is possible to cause the matching process to obey a subpattern  con-
 | 
						|
       ditionally  or to choose between two alternative subpatterns, depending
 | 
						|
       on the result of an assertion, or whether a specific capturing  subpat-
 | 
						|
       tern  has  already  been matched. The two possible forms of conditional
 | 
						|
       subpattern are:
 | 
						|
 | 
						|
         (?(condition)yes-pattern)
 | 
						|
         (?(condition)yes-pattern|no-pattern)
 | 
						|
 | 
						|
       If the condition is satisfied, the yes-pattern is used;  otherwise  the
 | 
						|
       no-pattern  (if  present)  is used. If there are more than two alterna-
 | 
						|
       tives in the subpattern, a compile-time error occurs. Each of  the  two
 | 
						|
       alternatives may itself contain nested subpatterns of any form, includ-
 | 
						|
       ing  conditional  subpatterns;  the  restriction  to  two  alternatives
 | 
						|
       applies only at the level of the condition. This pattern fragment is an
 | 
						|
       example where the alternatives are complex:
 | 
						|
 | 
						|
         (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
 | 
						|
 | 
						|
 | 
						|
       There are five kinds of condition: references  to  subpatterns,  refer-
 | 
						|
       ences  to  recursion,  two pseudo-conditions called DEFINE and VERSION,
 | 
						|
       and assertions.
 | 
						|
 | 
						|
   Checking for a used subpattern by number
 | 
						|
 | 
						|
       If the text between the parentheses consists of a sequence  of  digits,
 | 
						|
       the condition is true if a capturing subpattern of that number has pre-
 | 
						|
       viously matched. If there is more than one  capturing  subpattern  with
 | 
						|
       the  same  number  (see  the earlier section about duplicate subpattern
 | 
						|
       numbers), the condition is true if any of them have matched. An  alter-
 | 
						|
       native  notation is to precede the digits with a plus or minus sign. In
 | 
						|
       this case, the subpattern number is relative rather than absolute.  The
 | 
						|
       most  recently opened parentheses can be referenced by (?(-1), the next
 | 
						|
       most recent by (?(-2), and so on. Inside loops it can also  make  sense
 | 
						|
       to refer to subsequent groups. The next parentheses to be opened can be
 | 
						|
       referenced as (?(+1), and so on. (The value zero in any of these  forms
 | 
						|
       is not used; it provokes a compile-time error.)
 | 
						|
 | 
						|
       Consider  the  following  pattern, which contains non-significant white
 | 
						|
       space to make it more readable (assume the PCRE2_EXTENDED  option)  and
 | 
						|
       to divide it into three parts for ease of discussion:
 | 
						|
 | 
						|
         ( \( )?    [^()]+    (?(1) \) )
 | 
						|
 | 
						|
       The  first  part  matches  an optional opening parenthesis, and if that
 | 
						|
       character is present, sets it as the first captured substring. The sec-
 | 
						|
       ond  part  matches one or more characters that are not parentheses. The
 | 
						|
       third part is a conditional subpattern that tests whether  or  not  the
 | 
						|
       first  set  of  parentheses  matched.  If they did, that is, if subject
 | 
						|
       started with an opening parenthesis, the condition is true, and so  the
 | 
						|
       yes-pattern  is  executed and a closing parenthesis is required. Other-
 | 
						|
       wise, since no-pattern is not present, the subpattern matches  nothing.
 | 
						|
       In  other  words,  this  pattern matches a sequence of non-parentheses,
 | 
						|
       optionally enclosed in parentheses.
 | 
						|
 | 
						|
       If you were embedding this pattern in a larger one,  you  could  use  a
 | 
						|
       relative reference:
 | 
						|
 | 
						|
         ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
 | 
						|
 | 
						|
       This  makes  the  fragment independent of the parentheses in the larger
 | 
						|
       pattern.
 | 
						|
 | 
						|
   Checking for a used subpattern by name
 | 
						|
 | 
						|
       Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
 | 
						|
       used  subpattern  by  name.  For compatibility with earlier versions of
 | 
						|
       PCRE1, which had this facility before Perl, the syntax (?(name)...)  is
 | 
						|
       also  recognized.  Note,  however, that undelimited names consisting of
 | 
						|
       the letter R followed by digits are ambiguous (see the  following  sec-
 | 
						|
       tion).
 | 
						|
 | 
						|
       Rewriting the above example to use a named subpattern gives this:
 | 
						|
 | 
						|
         (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
 | 
						|
 | 
						|
       If  the  name used in a condition of this kind is a duplicate, the test
 | 
						|
       is applied to all subpatterns of the same name, and is true if any  one
 | 
						|
       of them has matched.
 | 
						|
 | 
						|
   Checking for pattern recursion
 | 
						|
 | 
						|
       "Recursion"  in  this sense refers to any subroutine-like call from one
 | 
						|
       part of the pattern to another, whether or not it  is  actually  recur-
 | 
						|
       sive.  See  the sections entitled "Recursive patterns" and "Subpatterns
 | 
						|
       as subroutines" below for details of recursion and subpattern calls.
 | 
						|
 | 
						|
       If a condition is the string (R), and there is no subpattern  with  the
 | 
						|
       name  R,  the condition is true if matching is currently in a recursion
 | 
						|
       or subroutine call to the whole pattern or any  subpattern.  If  digits
 | 
						|
       follow  the  letter  R,  and there is no subpattern with that name, the
 | 
						|
       condition is true if the most recent call is into a subpattern with the
 | 
						|
       given  number,  which must exist somewhere in the overall pattern. This
 | 
						|
       is a contrived example that is equivalent to a+b:
 | 
						|
 | 
						|
         ((?(R1)a+|(?1)b))
 | 
						|
 | 
						|
       However, in both cases, if there is a subpattern with a matching  name,
 | 
						|
       the  condition  tests  for  its  being set, as described in the section
 | 
						|
       above, instead of testing for recursion. For example, creating a  group
 | 
						|
       with  the  name  R1  by  adding (?<R1>) to the above pattern completely
 | 
						|
       changes its meaning.
 | 
						|
 | 
						|
       If a name preceded by ampersand follows the letter R, for example:
 | 
						|
 | 
						|
         (?(R&name)...)
 | 
						|
 | 
						|
       the condition is true if the most recent recursion is into a subpattern
 | 
						|
       of that name (which must exist within the pattern).
 | 
						|
 | 
						|
       This condition does not check the entire recursion stack. It tests only
 | 
						|
       the current level. If the name used in a condition of this  kind  is  a
 | 
						|
       duplicate, the test is applied to all subpatterns of the same name, and
 | 
						|
       is true if any one of them is the most recent recursion.
 | 
						|
 | 
						|
       At "top level", all these recursion test conditions are false.
 | 
						|
 | 
						|
   Defining subpatterns for use by reference only
 | 
						|
 | 
						|
       If the condition is the string (DEFINE), the condition is always false,
 | 
						|
       even  if there is a group with the name DEFINE. In this case, there may
 | 
						|
       be only one alternative in the subpattern. It is always skipped if con-
 | 
						|
       trol  reaches  this point in the pattern; the idea of DEFINE is that it
 | 
						|
       can be used to define subroutines that can  be  referenced  from  else-
 | 
						|
       where. (The use of subroutines is described below.) For example, a pat-
 | 
						|
       tern to match an IPv4 address such as "192.168.23.245" could be written
 | 
						|
       like this (ignore white space and line breaks):
 | 
						|
 | 
						|
         (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
 | 
						|
         \b (?&byte) (\.(?&byte)){3} \b
 | 
						|
 | 
						|
       The  first part of the pattern is a DEFINE group inside which a another
 | 
						|
       group named "byte" is defined. This matches an individual component  of
 | 
						|
       an  IPv4  address  (a number less than 256). When matching takes place,
 | 
						|
       this part of the pattern is skipped because DEFINE acts  like  a  false
 | 
						|
       condition.  The  rest of the pattern uses references to the named group
 | 
						|
       to match the four dot-separated components of an IPv4 address,  insist-
 | 
						|
       ing on a word boundary at each end.
 | 
						|
 | 
						|
   Checking the PCRE2 version
 | 
						|
 | 
						|
       Programs  that link with a PCRE2 library can check the version by call-
 | 
						|
       ing pcre2_config() with appropriate arguments.  Users  of  applications
 | 
						|
       that  do  not have access to the underlying code cannot do this. A spe-
 | 
						|
       cial "condition" called VERSION exists to allow such users to  discover
 | 
						|
       which version of PCRE2 they are dealing with by using this condition to
 | 
						|
       match a string such as "yesno". VERSION must be followed either by  "="
 | 
						|
       or ">=" and a version number.  For example:
 | 
						|
 | 
						|
         (?(VERSION>=10.4)yes|no)
 | 
						|
 | 
						|
       This  pattern matches "yes" if the PCRE2 version is greater or equal to
 | 
						|
       10.4, or "no" otherwise. The fractional part of the version number  may
 | 
						|
       not contain more than two digits.
 | 
						|
 | 
						|
   Assertion conditions
 | 
						|
 | 
						|
       If  the  condition  is  not  in any of the above formats, it must be an
 | 
						|
       assertion.  This may be a positive or negative lookahead or  lookbehind
 | 
						|
       assertion.  Consider  this  pattern,  again  containing non-significant
 | 
						|
       white space, and with the two alternatives on the second line:
 | 
						|
 | 
						|
         (?(?=[^a-z]*[a-z])
 | 
						|
         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
 | 
						|
 | 
						|
       The condition  is  a  positive  lookahead  assertion  that  matches  an
 | 
						|
       optional  sequence of non-letters followed by a letter. In other words,
 | 
						|
       it tests for the presence of at least one letter in the subject.  If  a
 | 
						|
       letter  is found, the subject is matched against the first alternative;
 | 
						|
       otherwise it is  matched  against  the  second.  This  pattern  matches
 | 
						|
       strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
 | 
						|
       letters and dd are digits.
 | 
						|
 | 
						|
 | 
						|
COMMENTS
 | 
						|
 | 
						|
       There are two ways of including comments in patterns that are processed
 | 
						|
       by  PCRE2.  In  both  cases,  the start of the comment must not be in a
 | 
						|
       character class, nor in the middle of any  other  sequence  of  related
 | 
						|
       characters  such  as (?: or a subpattern name or number. The characters
 | 
						|
       that make up a comment play no part in the pattern matching.
 | 
						|
 | 
						|
       The sequence (?# marks the start of a comment that continues up to  the
 | 
						|
       next  closing parenthesis. Nested parentheses are not permitted. If the
 | 
						|
       PCRE2_EXTENDED option is set, an unescaped # character also  introduces
 | 
						|
       a  comment,  which in this case continues to immediately after the next
 | 
						|
       newline character or character sequence in the pattern.  Which  charac-
 | 
						|
       ters  are  interpreted as newlines is controlled by an option passed to
 | 
						|
       the compiling function or by a special sequence at  the  start  of  the
 | 
						|
       pattern,  as  described  in  the section entitled "Newline conventions"
 | 
						|
       above. Note that the end of this type of comment is a  literal  newline
 | 
						|
       sequence  in  the  pattern; escape sequences that happen to represent a
 | 
						|
       newline  do  not  count.  For  example,  consider  this  pattern   when
 | 
						|
       PCRE2_EXTENDED  is  set,  and  the default newline convention (a single
 | 
						|
       linefeed character) is in force:
 | 
						|
 | 
						|
         abc #comment \n still comment
 | 
						|
 | 
						|
       On encountering the # character, pcre2_compile() skips  along,  looking
 | 
						|
       for  a newline in the pattern. The sequence \n is still literal at this
 | 
						|
       stage, so it does not terminate the comment. Only an  actual  character
 | 
						|
       with the code value 0x0a (the default newline) does so.
 | 
						|
 | 
						|
 | 
						|
RECURSIVE PATTERNS
 | 
						|
 | 
						|
       Consider  the problem of matching a string in parentheses, allowing for
 | 
						|
       unlimited nested parentheses. Without the use of  recursion,  the  best
 | 
						|
       that  can  be  done  is  to use a pattern that matches up to some fixed
 | 
						|
       depth of nesting. It is not possible to  handle  an  arbitrary  nesting
 | 
						|
       depth.
 | 
						|
 | 
						|
       For some time, Perl has provided a facility that allows regular expres-
 | 
						|
       sions to recurse (amongst other things). It does this by  interpolating
 | 
						|
       Perl  code in the expression at run time, and the code can refer to the
 | 
						|
       expression itself. A Perl pattern using code interpolation to solve the
 | 
						|
       parentheses problem can be created like this:
 | 
						|
 | 
						|
         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
 | 
						|
 | 
						|
       The (?p{...}) item interpolates Perl code at run time, and in this case
 | 
						|
       refers recursively to the pattern in which it appears.
 | 
						|
 | 
						|
       Obviously,  PCRE2  cannot  support  the  interpolation  of  Perl  code.
 | 
						|
       Instead,  it  supports  special syntax for recursion of the entire pat-
 | 
						|
       tern, and also for individual subpattern recursion. After its introduc-
 | 
						|
       tion  in  PCRE1  and  Python,  this  kind of recursion was subsequently
 | 
						|
       introduced into Perl at release 5.10.
 | 
						|
 | 
						|
       A special item that consists of (? followed by a  number  greater  than
 | 
						|
       zero  and  a  closing parenthesis is a recursive subroutine call of the
 | 
						|
       subpattern of the given number, provided that  it  occurs  inside  that
 | 
						|
       subpattern.  (If  not,  it is a non-recursive subroutine call, which is
 | 
						|
       described in the next section.) The special item  (?R)  or  (?0)  is  a
 | 
						|
       recursive call of the entire regular expression.
 | 
						|
 | 
						|
       This  PCRE2  pattern  solves the nested parentheses problem (assume the
 | 
						|
       PCRE2_EXTENDED option is set so that white space is ignored):
 | 
						|
 | 
						|
         \( ( [^()]++ | (?R) )* \)
 | 
						|
 | 
						|
       First it matches an opening parenthesis. Then it matches any number  of
 | 
						|
       substrings  which  can  either  be  a sequence of non-parentheses, or a
 | 
						|
       recursive match of the pattern itself (that is, a  correctly  parenthe-
 | 
						|
       sized substring).  Finally there is a closing parenthesis. Note the use
 | 
						|
       of a possessive quantifier to avoid backtracking into sequences of non-
 | 
						|
       parentheses.
 | 
						|
 | 
						|
       If  this  were  part of a larger pattern, you would not want to recurse
 | 
						|
       the entire pattern, so instead you could use this:
 | 
						|
 | 
						|
         ( \( ( [^()]++ | (?1) )* \) )
 | 
						|
 | 
						|
       We have put the pattern into parentheses, and caused the  recursion  to
 | 
						|
       refer to them instead of the whole pattern.
 | 
						|
 | 
						|
       In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
 | 
						|
       tricky. This is made easier by the use of relative references.  Instead
 | 
						|
       of (?1) in the pattern above you can write (?-2) to refer to the second
 | 
						|
       most recently opened parentheses  preceding  the  recursion.  In  other
 | 
						|
       words,  a  negative  number counts capturing parentheses leftwards from
 | 
						|
       the point at which it is encountered.
 | 
						|
 | 
						|
       Be aware however, that if duplicate subpattern numbers are in use, rel-
 | 
						|
       ative  references refer to the earliest subpattern with the appropriate
 | 
						|
       number. Consider, for example:
 | 
						|
 | 
						|
         (?|(a)|(b)) (c) (?-2)
 | 
						|
 | 
						|
       The first two capturing groups (a) and (b) are  both  numbered  1,  and
 | 
						|
       group  (c)  is  number  2. When the reference (?-2) is encountered, the
 | 
						|
       second most recently opened parentheses has the number 1, but it is the
 | 
						|
       first  such  group  (the (a) group) to which the recursion refers. This
 | 
						|
       would be the same if an absolute reference  (?1)  was  used.  In  other
 | 
						|
       words,  relative  references are just a shorthand for computing a group
 | 
						|
       number.
 | 
						|
 | 
						|
       It is also possible to refer to  subsequently  opened  parentheses,  by
 | 
						|
       writing  references  such  as (?+2). However, these cannot be recursive
 | 
						|
       because the reference is not inside the  parentheses  that  are  refer-
 | 
						|
       enced.  They are always non-recursive subroutine calls, as described in
 | 
						|
       the next section.
 | 
						|
 | 
						|
       An alternative approach is to use named parentheses.  The  Perl  syntax
 | 
						|
       for  this  is  (?&name);  PCRE1's earlier syntax (?P>name) is also sup-
 | 
						|
       ported. We could rewrite the above example as follows:
 | 
						|
 | 
						|
         (?<pn> \( ( [^()]++ | (?&pn) )* \) )
 | 
						|
 | 
						|
       If there is more than one subpattern with the same name,  the  earliest
 | 
						|
       one is used.
 | 
						|
 | 
						|
       The example pattern that we have been looking at contains nested unlim-
 | 
						|
       ited repeats, and so the use of a possessive  quantifier  for  matching
 | 
						|
       strings  of  non-parentheses  is important when applying the pattern to
 | 
						|
       strings that do not match. For example, when this pattern is applied to
 | 
						|
 | 
						|
         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
 | 
						|
 | 
						|
       it yields "no match" quickly. However, if a  possessive  quantifier  is
 | 
						|
       not  used, the match runs for a very long time indeed because there are
 | 
						|
       so many different ways the + and * repeats can carve  up  the  subject,
 | 
						|
       and all have to be tested before failure can be reported.
 | 
						|
 | 
						|
       At  the  end  of a match, the values of capturing parentheses are those
 | 
						|
       from the outermost level. If you want to obtain intermediate values,  a
 | 
						|
       callout function can be used (see below and the pcre2callout documenta-
 | 
						|
       tion). If the pattern above is matched against
 | 
						|
 | 
						|
         (ab(cd)ef)
 | 
						|
 | 
						|
       the value for the inner capturing parentheses  (numbered  2)  is  "ef",
 | 
						|
       which  is the last value taken on at the top level. If a capturing sub-
 | 
						|
       pattern is not matched at the top level, its final  captured  value  is
 | 
						|
       unset,  even  if  it was (temporarily) set at a deeper level during the
 | 
						|
       matching process.
 | 
						|
 | 
						|
       If there are more than 15 capturing parentheses in a pattern, PCRE2 has
 | 
						|
       to  obtain extra memory from the heap to store data during a recursion.
 | 
						|
       If  no  memory  can   be   obtained,   the   match   fails   with   the
 | 
						|
       PCRE2_ERROR_NOMEMORY error.
 | 
						|
 | 
						|
       Do  not  confuse  the (?R) item with the condition (R), which tests for
 | 
						|
       recursion.  Consider this pattern, which matches text in  angle  brack-
 | 
						|
       ets,  allowing for arbitrary nesting. Only digits are allowed in nested
 | 
						|
       brackets (that is, when recursing), whereas any characters are  permit-
 | 
						|
       ted at the outer level.
 | 
						|
 | 
						|
         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
 | 
						|
 | 
						|
       In  this  pattern, (?(R) is the start of a conditional subpattern, with
 | 
						|
       two different alternatives for the recursive and  non-recursive  cases.
 | 
						|
       The (?R) item is the actual recursive call.
 | 
						|
 | 
						|
   Differences in recursion processing between PCRE2 and Perl
 | 
						|
 | 
						|
       Recursion  processing in PCRE2 differs from Perl in two important ways.
 | 
						|
       In PCRE2 (like Python, but unlike Perl), a recursive subpattern call is
 | 
						|
       always treated as an atomic group. That is, once it has matched some of
 | 
						|
       the subject string, it is never re-entered, even if it contains untried
 | 
						|
       alternatives  and  there  is a subsequent matching failure. This can be
 | 
						|
       illustrated by the following pattern, which purports to match a  palin-
 | 
						|
       dromic  string  that contains an odd number of characters (for example,
 | 
						|
       "a", "aba", "abcba", "abcdcba"):
 | 
						|
 | 
						|
         ^(.|(.)(?1)\2)$
 | 
						|
 | 
						|
       The idea is that it either matches a single character, or two identical
 | 
						|
       characters  surrounding  a sub-palindrome. In Perl, this pattern works;
 | 
						|
       in PCRE2 it does not if the pattern is longer  than  three  characters.
 | 
						|
       Consider the subject string "abcba":
 | 
						|
 | 
						|
       At  the  top level, the first character is matched, but as it is not at
 | 
						|
       the end of the string, the first alternative fails; the second alterna-
 | 
						|
       tive is taken and the recursion kicks in. The recursive call to subpat-
 | 
						|
       tern 1 successfully matches the next character ("b").  (Note  that  the
 | 
						|
       beginning and end of line tests are not part of the recursion).
 | 
						|
 | 
						|
       Back  at  the top level, the next character ("c") is compared with what
 | 
						|
       subpattern 2 matched, which was "a". This fails. Because the  recursion
 | 
						|
       is  treated  as  an atomic group, there are now no backtracking points,
 | 
						|
       and so the entire match fails. (Perl is able, at  this  point,  to  re-
 | 
						|
       enter  the  recursion  and try the second alternative.) However, if the
 | 
						|
       pattern is written with the alternatives in the other order, things are
 | 
						|
       different:
 | 
						|
 | 
						|
         ^((.)(?1)\2|.)$
 | 
						|
 | 
						|
       This  time,  the recursing alternative is tried first, and continues to
 | 
						|
       recurse until it runs out of characters, at which point  the  recursion
 | 
						|
       fails.  But  this  time  we  do  have another alternative to try at the
 | 
						|
       higher level. That is the big difference:  in  the  previous  case  the
 | 
						|
       remaining  alternative is at a deeper recursion level, which PCRE2 can-
 | 
						|
       not use.
 | 
						|
 | 
						|
       To change the pattern so that it matches all palindromic  strings,  not
 | 
						|
       just  those  with an odd number of characters, it is tempting to change
 | 
						|
       the pattern to this:
 | 
						|
 | 
						|
         ^((.)(?1)\2|.?)$
 | 
						|
 | 
						|
       Again, this works in Perl, but not in PCRE2, and for the  same  reason.
 | 
						|
       When  a  deeper  recursion has matched a single character, it cannot be
 | 
						|
       entered again in order to match an empty string.  The  solution  is  to
 | 
						|
       separate  the two cases, and write out the odd and even cases as alter-
 | 
						|
       natives at the higher level:
 | 
						|
 | 
						|
         ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
 | 
						|
 | 
						|
       If you want to match typical palindromic phrases, the  pattern  has  to
 | 
						|
       ignore all non-word characters, which can be done like this:
 | 
						|
 | 
						|
         ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
 | 
						|
 | 
						|
       If  run  with  the  PCRE2_CASELESS option, this pattern matches phrases
 | 
						|
       such as "A man, a plan, a canal: Panama!" and it works  in  both  PCRE2
 | 
						|
       and  Perl.  Note the use of the possessive quantifier *+ to avoid back-
 | 
						|
       tracking into sequences of non-word  characters.  Without  this,  PCRE2
 | 
						|
       takes a great deal longer (ten times or more) to match typical phrases,
 | 
						|
       and Perl takes so long that you think it has gone into a loop.
 | 
						|
 | 
						|
       WARNING: The palindrome-matching patterns above work only if  the  sub-
 | 
						|
       ject  string  does not start with a palindrome that is shorter than the
 | 
						|
       entire string.  For example, although "abcba" is correctly matched,  if
 | 
						|
       the  subject is "ababa", PCRE2 finds the palindrome "aba" at the start,
 | 
						|
       then fails at top level because the end of the string does not  follow.
 | 
						|
       Once  again, it cannot jump back into the recursion to try other alter-
 | 
						|
       natives, so the entire match fails.
 | 
						|
 | 
						|
       The second way in which PCRE2 and Perl differ in their  recursion  pro-
 | 
						|
       cessing  is in the handling of captured values. In Perl, when a subpat-
 | 
						|
       tern is called recursively or as a subpattern (see the  next  section),
 | 
						|
       it  has  no  access to any values that were captured outside the recur-
 | 
						|
       sion, whereas in PCRE2 these values can be  referenced.  Consider  this
 | 
						|
       pattern:
 | 
						|
 | 
						|
         ^(.)(\1|a(?2))
 | 
						|
 | 
						|
       In  PCRE2,  this pattern matches "bab". The first capturing parentheses
 | 
						|
       match "b", then in the second group, when the back reference  \1  fails
 | 
						|
       to  match "b", the second alternative matches "a" and then recurses. In
 | 
						|
       the recursion, \1 does now match "b" and so the whole  match  succeeds.
 | 
						|
       In  Perl,  the pattern fails to match because inside the recursive call
 | 
						|
       \1 cannot access the externally set value.
 | 
						|
 | 
						|
 | 
						|
SUBPATTERNS AS SUBROUTINES
 | 
						|
 | 
						|
       If the syntax for a recursive subpattern call (either by number  or  by
 | 
						|
       name)  is  used outside the parentheses to which it refers, it operates
 | 
						|
       like a subroutine in a programming language. The called subpattern  may
 | 
						|
       be  defined  before or after the reference. A numbered reference can be
 | 
						|
       absolute or relative, as in these examples:
 | 
						|
 | 
						|
         (...(absolute)...)...(?2)...
 | 
						|
         (...(relative)...)...(?-1)...
 | 
						|
         (...(?+1)...(relative)...
 | 
						|
 | 
						|
       An earlier example pointed out that the pattern
 | 
						|
 | 
						|
         (sens|respons)e and \1ibility
 | 
						|
 | 
						|
       matches "sense and sensibility" and "response and responsibility",  but
 | 
						|
       not "sense and responsibility". If instead the pattern
 | 
						|
 | 
						|
         (sens|respons)e and (?1)ibility
 | 
						|
 | 
						|
       is  used, it does match "sense and responsibility" as well as the other
 | 
						|
       two strings. Another example is  given  in  the  discussion  of  DEFINE
 | 
						|
       above.
 | 
						|
 | 
						|
       All  subroutine  calls, whether recursive or not, are always treated as
 | 
						|
       atomic groups. That is, once a subroutine has matched some of the  sub-
 | 
						|
       ject string, it is never re-entered, even if it contains untried alter-
 | 
						|
       natives and there is  a  subsequent  matching  failure.  Any  capturing
 | 
						|
       parentheses  that  are  set  during the subroutine call revert to their
 | 
						|
       previous values afterwards.
 | 
						|
 | 
						|
       Processing options such as case-independence are fixed when  a  subpat-
 | 
						|
       tern  is defined, so if it is used as a subroutine, such options cannot
 | 
						|
       be changed for different calls. For example, consider this pattern:
 | 
						|
 | 
						|
         (abc)(?i:(?-1))
 | 
						|
 | 
						|
       It matches "abcabc". It does not match "abcABC" because the  change  of
 | 
						|
       processing option does not affect the called subpattern.
 | 
						|
 | 
						|
 | 
						|
ONIGURUMA SUBROUTINE SYNTAX
 | 
						|
 | 
						|
       For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
 | 
						|
       name or a number enclosed either in angle brackets or single quotes, is
 | 
						|
       an  alternative  syntax  for  referencing a subpattern as a subroutine,
 | 
						|
       possibly recursively. Here are two of the examples used above,  rewrit-
 | 
						|
       ten using this syntax:
 | 
						|
 | 
						|
         (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
 | 
						|
         (sens|respons)e and \g'1'ibility
 | 
						|
 | 
						|
       PCRE2  supports an extension to Oniguruma: if a number is preceded by a
 | 
						|
       plus or a minus sign it is taken as a relative reference. For example:
 | 
						|
 | 
						|
         (abc)(?i:\g<-1>)
 | 
						|
 | 
						|
       Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
 | 
						|
       synonymous.  The former is a back reference; the latter is a subroutine
 | 
						|
       call.
 | 
						|
 | 
						|
 | 
						|
CALLOUTS
 | 
						|
 | 
						|
       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
 | 
						|
       Perl  code to be obeyed in the middle of matching a regular expression.
 | 
						|
       This makes it possible, amongst other things, to extract different sub-
 | 
						|
       strings that match the same pair of parentheses when there is a repeti-
 | 
						|
       tion.
 | 
						|
 | 
						|
       PCRE2 provides a similar feature, but of course it  cannot  obey  arbi-
 | 
						|
       trary  Perl  code. The feature is called "callout". The caller of PCRE2
 | 
						|
       provides an external function by putting its entry  point  in  a  match
 | 
						|
       context  using  the function pcre2_set_callout(), and then passing that
 | 
						|
       context to pcre2_match() or pcre2_dfa_match(). If no match  context  is
 | 
						|
       passed, or if the callout entry point is set to NULL, callouts are dis-
 | 
						|
       abled.
 | 
						|
 | 
						|
       Within a regular expression, (?C<arg>) indicates a point at  which  the
 | 
						|
       external  function  is  to  be  called. There are two kinds of callout:
 | 
						|
       those with a numerical argument and those with a string argument.  (?C)
 | 
						|
       on  its  own with no argument is treated as (?C0). A numerical argument
 | 
						|
       allows the  application  to  distinguish  between  different  callouts.
 | 
						|
       String  arguments  were added for release 10.20 to make it possible for
 | 
						|
       script languages that use PCRE2 to embed short scripts within  patterns
 | 
						|
       in a similar way to Perl.
 | 
						|
 | 
						|
       During matching, when PCRE2 reaches a callout point, the external func-
 | 
						|
       tion is called. It is provided with the number or  string  argument  of
 | 
						|
       the  callout, the position in the pattern, and one item of data that is
 | 
						|
       also set in the match block. The callout function may cause matching to
 | 
						|
       proceed, to backtrack, or to fail.
 | 
						|
 | 
						|
       By  default,  PCRE2  implements  a  number of optimizations at matching
 | 
						|
       time, and one side-effect is that sometimes callouts  are  skipped.  If
 | 
						|
       you  need all possible callouts to happen, you need to set options that
 | 
						|
       disable the relevant optimizations. More details, including a  complete
 | 
						|
       description  of  the programming interface to the callout function, are
 | 
						|
       given in the pcre2callout documentation.
 | 
						|
 | 
						|
   Callouts with numerical arguments
 | 
						|
 | 
						|
       If you just want to have  a  means  of  identifying  different  callout
 | 
						|
       points,  put  a  number  less than 256 after the letter C. For example,
 | 
						|
       this pattern has two callout points:
 | 
						|
 | 
						|
         (?C1)abc(?C2)def
 | 
						|
 | 
						|
       If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(),  numerical
 | 
						|
       callouts  are  automatically installed before each item in the pattern.
 | 
						|
       They are all numbered 255. If there is a conditional group in the  pat-
 | 
						|
       tern whose condition is an assertion, an additional callout is inserted
 | 
						|
       just before the condition. An explicit callout may also be set at  this
 | 
						|
       position, as in this example:
 | 
						|
 | 
						|
         (?(?C9)(?=a)abc|def)
 | 
						|
 | 
						|
       Note that this applies only to assertion conditions, not to other types
 | 
						|
       of condition.
 | 
						|
 | 
						|
   Callouts with string arguments
 | 
						|
 | 
						|
       A delimited string may be used instead of a number as a  callout  argu-
 | 
						|
       ment.  The  starting  delimiter  must be one of ` ' " ^ % # $ { and the
 | 
						|
       ending delimiter is the same as the start, except for {, where the end-
 | 
						|
       ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
 | 
						|
       string, it must be doubled. For example:
 | 
						|
 | 
						|
         (?C'ab ''c'' d')xyz(?C{any text})pqr
 | 
						|
 | 
						|
       The doubling is removed before the string  is  passed  to  the  callout
 | 
						|
       function.
 | 
						|
 | 
						|
 | 
						|
BACKTRACKING CONTROL
 | 
						|
 | 
						|
       Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
 | 
						|
       which are still described in the Perl  documentation  as  "experimental
 | 
						|
       and  subject to change or removal in a future version of Perl". It goes
 | 
						|
       on to say: "Their usage in production code should  be  noted  to  avoid
 | 
						|
       problems during upgrades." The same remarks apply to the PCRE2 features
 | 
						|
       described in this section.
 | 
						|
 | 
						|
       The new verbs make use of what was previously invalid syntax: an  open-
 | 
						|
       ing parenthesis followed by an asterisk. They are generally of the form
 | 
						|
       (*VERB) or (*VERB:NAME). Some verbs take either form, possibly behaving
 | 
						|
       differently depending on whether or not a name is present.
 | 
						|
 | 
						|
       By  default,  for  compatibility  with  Perl, a name is any sequence of
 | 
						|
       characters that does not include a closing parenthesis. The name is not
 | 
						|
       processed  in  any  way,  and  it  is not possible to include a closing
 | 
						|
       parenthesis  in  the  name.   This  can  be  changed  by  setting   the
 | 
						|
       PCRE2_ALT_VERBNAMES  option,  but the result is no longer Perl-compati-
 | 
						|
       ble.
 | 
						|
 | 
						|
       When PCRE2_ALT_VERBNAMES is set, backslash  processing  is  applied  to
 | 
						|
       verb  names  and  only  an unescaped closing parenthesis terminates the
 | 
						|
       name. However, the only backslash items that are permitted are \Q,  \E,
 | 
						|
       and  sequences such as \x{100} that define character code points. Char-
 | 
						|
       acter type escapes such as \d are faulted.
 | 
						|
 | 
						|
       A closing parenthesis can be included in a name either as \) or between
 | 
						|
       \Q  and  \E. In addition to backslash processing, if the PCRE2_EXTENDED
 | 
						|
       option is also set, unescaped whitespace in verb names is skipped,  and
 | 
						|
       #-comments  are  recognized,  exactly  as  in  the rest of the pattern.
 | 
						|
       PCRE2_EXTENDED does not affect verb names unless PCRE2_ALT_VERBNAMES is
 | 
						|
       also set.
 | 
						|
 | 
						|
       The  maximum  length of a name is 255 in the 8-bit library and 65535 in
 | 
						|
       the 16-bit and 32-bit libraries. If the name is empty, that is, if  the
 | 
						|
       closing  parenthesis immediately follows the colon, the effect is as if
 | 
						|
       the colon were not there. Any number of these verbs may occur in a pat-
 | 
						|
       tern.
 | 
						|
 | 
						|
       Since  these  verbs  are  specifically related to backtracking, most of
 | 
						|
       them can be used only when the pattern is to be matched using the  tra-
 | 
						|
       ditional matching function, because these use a backtracking algorithm.
 | 
						|
       With the exception of (*FAIL), which behaves like  a  failing  negative
 | 
						|
       assertion, the backtracking control verbs cause an error if encountered
 | 
						|
       by the DFA matching function.
 | 
						|
 | 
						|
       The behaviour of these verbs in repeated  groups,  assertions,  and  in
 | 
						|
       subpatterns called as subroutines (whether or not recursively) is docu-
 | 
						|
       mented below.
 | 
						|
 | 
						|
   Optimizations that affect backtracking verbs
 | 
						|
 | 
						|
       PCRE2 contains some optimizations that are used to speed up matching by
 | 
						|
       running some checks at the start of each match attempt. For example, it
 | 
						|
       may know the minimum length of matching subject, or that  a  particular
 | 
						|
       character must be present. When one of these optimizations bypasses the
 | 
						|
       running of a match,  any  included  backtracking  verbs  will  not,  of
 | 
						|
       course, be processed. You can suppress the start-of-match optimizations
 | 
						|
       by setting the PCRE2_NO_START_OPTIMIZE option when  calling  pcre2_com-
 | 
						|
       pile(),  or by starting the pattern with (*NO_START_OPT). There is more
 | 
						|
       discussion of this option in the section entitled "Compiling a pattern"
 | 
						|
       in the pcre2api documentation.
 | 
						|
 | 
						|
       Experiments  with  Perl  suggest that it too has similar optimizations,
 | 
						|
       sometimes leading to anomalous results.
 | 
						|
 | 
						|
   Verbs that act immediately
 | 
						|
 | 
						|
       The following verbs act as soon as they are encountered. They  may  not
 | 
						|
       be followed by a name.
 | 
						|
 | 
						|
          (*ACCEPT)
 | 
						|
 | 
						|
       This  verb causes the match to end successfully, skipping the remainder
 | 
						|
       of the pattern. However, when it is inside a subpattern that is  called
 | 
						|
       as  a  subroutine, only that subpattern is ended successfully. Matching
 | 
						|
       then continues at the outer level. If (*ACCEPT) in triggered in a posi-
 | 
						|
       tive  assertion,  the  assertion succeeds; in a negative assertion, the
 | 
						|
       assertion fails.
 | 
						|
 | 
						|
       If (*ACCEPT) is inside capturing parentheses, the data so far  is  cap-
 | 
						|
       tured. For example:
 | 
						|
 | 
						|
         A((?:A|B(*ACCEPT)|C)D)
 | 
						|
 | 
						|
       This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
 | 
						|
       tured by the outer parentheses.
 | 
						|
 | 
						|
         (*FAIL) or (*F)
 | 
						|
 | 
						|
       This verb causes a matching failure, forcing backtracking to occur.  It
 | 
						|
       is  equivalent to (?!) but easier to read. The Perl documentation notes
 | 
						|
       that it is probably useful only when combined  with  (?{})  or  (??{}).
 | 
						|
       Those  are, of course, Perl features that are not present in PCRE2. The
 | 
						|
       nearest equivalent is the callout feature, as for example in this  pat-
 | 
						|
       tern:
 | 
						|
 | 
						|
         a+(?C)(*FAIL)
 | 
						|
 | 
						|
       A  match  with the string "aaaa" always fails, but the callout is taken
 | 
						|
       before each backtrack happens (in this example, 10 times).
 | 
						|
 | 
						|
   Recording which path was taken
 | 
						|
 | 
						|
       There is one verb whose main purpose  is  to  track  how  a  match  was
 | 
						|
       arrived  at,  though  it  also  has a secondary use in conjunction with
 | 
						|
       advancing the match starting point (see (*SKIP) below).
 | 
						|
 | 
						|
         (*MARK:NAME) or (*:NAME)
 | 
						|
 | 
						|
       A name is always  required  with  this  verb.  There  may  be  as  many
 | 
						|
       instances  of  (*MARK) as you like in a pattern, and their names do not
 | 
						|
       have to be unique.
 | 
						|
 | 
						|
       When a match succeeds, the name of the  last-encountered  (*MARK:NAME),
 | 
						|
       (*PRUNE:NAME),  or  (*THEN:NAME) on the matching path is passed back to
 | 
						|
       the caller as described in  the  section  entitled  "Other  information
 | 
						|
       about  the  match" in the pcre2api documentation. Here is an example of
 | 
						|
       pcre2test output, where the "mark" modifier requests the retrieval  and
 | 
						|
       outputting of (*MARK) data:
 | 
						|
 | 
						|
           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
 | 
						|
         data> XY
 | 
						|
          0: XY
 | 
						|
         MK: A
 | 
						|
         XZ
 | 
						|
          0: XZ
 | 
						|
         MK: B
 | 
						|
 | 
						|
       The (*MARK) name is tagged with "MK:" in this output, and in this exam-
 | 
						|
       ple it indicates which of the two alternatives matched. This is a  more
 | 
						|
       efficient  way of obtaining this information than putting each alterna-
 | 
						|
       tive in its own capturing parentheses.
 | 
						|
 | 
						|
       If a verb with a name is encountered in a positive  assertion  that  is
 | 
						|
       true,  the  name  is recorded and passed back if it is the last-encoun-
 | 
						|
       tered. This does not happen for negative assertions or failing positive
 | 
						|
       assertions.
 | 
						|
 | 
						|
       After  a  partial match or a failed match, the last encountered name in
 | 
						|
       the entire match process is returned. For example:
 | 
						|
 | 
						|
           re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
 | 
						|
         data> XP
 | 
						|
         No match, mark = B
 | 
						|
 | 
						|
       Note that in this unanchored example the  mark  is  retained  from  the
 | 
						|
       match attempt that started at the letter "X" in the subject. Subsequent
 | 
						|
       match attempts starting at "P" and then with an empty string do not get
 | 
						|
       as far as the (*MARK) item, but nevertheless do not reset it.
 | 
						|
 | 
						|
       If  you  are  interested  in  (*MARK)  values after failed matches, you
 | 
						|
       should probably set the PCRE2_NO_START_OPTIMIZE option (see  above)  to
 | 
						|
       ensure that the match is always attempted.
 | 
						|
 | 
						|
   Verbs that act after backtracking
 | 
						|
 | 
						|
       The following verbs do nothing when they are encountered. Matching con-
 | 
						|
       tinues with what follows, but if there is no subsequent match,  causing
 | 
						|
       a  backtrack  to  the  verb, a failure is forced. That is, backtracking
 | 
						|
       cannot pass to the left of the verb. However, when one of  these  verbs
 | 
						|
       appears inside an atomic group (which includes any group that is called
 | 
						|
       as a subroutine) or in an assertion that is true, its  effect  is  con-
 | 
						|
       fined  to that group, because once the group has been matched, there is
 | 
						|
       never any backtracking into it. In this situation, backtracking has  to
 | 
						|
       jump to the left of the entire atomic group or assertion.
 | 
						|
 | 
						|
       These  verbs  differ  in exactly what kind of failure occurs when back-
 | 
						|
       tracking reaches them. The behaviour described below  is  what  happens
 | 
						|
       when  the  verb is not in a subroutine or an assertion. Subsequent sec-
 | 
						|
       tions cover these special cases.
 | 
						|
 | 
						|
         (*COMMIT)
 | 
						|
 | 
						|
       This verb, which may not be followed by a name, causes the whole  match
 | 
						|
       to fail outright if there is a later matching failure that causes back-
 | 
						|
       tracking to reach it. Even if the pattern  is  unanchored,  no  further
 | 
						|
       attempts to find a match by advancing the starting point take place. If
 | 
						|
       (*COMMIT) is the only backtracking verb that is  encountered,  once  it
 | 
						|
       has  been  passed  pcre2_match() is committed to finding a match at the
 | 
						|
       current starting point, or not at all. For example:
 | 
						|
 | 
						|
         a+(*COMMIT)b
 | 
						|
 | 
						|
       This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
 | 
						|
       of dynamic anchor, or "I've started, so I must finish." The name of the
 | 
						|
       most recently passed (*MARK) in the path is passed back when  (*COMMIT)
 | 
						|
       forces a match failure.
 | 
						|
 | 
						|
       If  there  is more than one backtracking verb in a pattern, a different
 | 
						|
       one that follows (*COMMIT) may be triggered first,  so  merely  passing
 | 
						|
       (*COMMIT) during a match does not always guarantee that a match must be
 | 
						|
       at this starting point.
 | 
						|
 | 
						|
       Note that (*COMMIT) at the start of a pattern is not  the  same  as  an
 | 
						|
       anchor,  unless PCRE2's start-of-match optimizations are turned off, as
 | 
						|
       shown in this output from pcre2test:
 | 
						|
 | 
						|
           re> /(*COMMIT)abc/
 | 
						|
         data> xyzabc
 | 
						|
          0: abc
 | 
						|
         data>
 | 
						|
         re> /(*COMMIT)abc/no_start_optimize
 | 
						|
         data> xyzabc
 | 
						|
         No match
 | 
						|
 | 
						|
       For the first pattern, PCRE2 knows that any match must start with  "a",
 | 
						|
       so  the optimization skips along the subject to "a" before applying the
 | 
						|
       pattern to the first set of data. The match attempt then succeeds.  The
 | 
						|
       second  pattern disables the optimization that skips along to the first
 | 
						|
       character. The pattern is now applied  starting  at  "x",  and  so  the
 | 
						|
       (*COMMIT)  causes  the  match to fail without trying any other starting
 | 
						|
       points.
 | 
						|
 | 
						|
         (*PRUNE) or (*PRUNE:NAME)
 | 
						|
 | 
						|
       This verb causes the match to fail at the current starting position  in
 | 
						|
       the subject if there is a later matching failure that causes backtrack-
 | 
						|
       ing to reach it. If the pattern is unanchored, the  normal  "bumpalong"
 | 
						|
       advance  to  the next starting character then happens. Backtracking can
 | 
						|
       occur as usual to the left of (*PRUNE), before it is reached,  or  when
 | 
						|
       matching  to  the  right  of  (*PRUNE), but if there is no match to the
 | 
						|
       right, backtracking cannot cross (*PRUNE). In simple cases, the use  of
 | 
						|
       (*PRUNE)  is just an alternative to an atomic group or possessive quan-
 | 
						|
       tifier, but there are some uses of (*PRUNE) that cannot be expressed in
 | 
						|
       any  other  way. In an anchored pattern (*PRUNE) has the same effect as
 | 
						|
       (*COMMIT).
 | 
						|
 | 
						|
       The   behaviour   of   (*PRUNE:NAME)   is   the   not   the   same   as
 | 
						|
       (*MARK:NAME)(*PRUNE).   It  is  like  (*MARK:NAME)  in that the name is
 | 
						|
       remembered for  passing  back  to  the  caller.  However,  (*SKIP:NAME)
 | 
						|
       searches  only  for  names  set  with  (*MARK),  ignoring  those set by
 | 
						|
       (*PRUNE) or (*THEN).
 | 
						|
 | 
						|
         (*SKIP)
 | 
						|
 | 
						|
       This verb, when given without a name, is like (*PRUNE), except that  if
 | 
						|
       the  pattern  is unanchored, the "bumpalong" advance is not to the next
 | 
						|
       character, but to the position in the subject where (*SKIP) was encoun-
 | 
						|
       tered.  (*SKIP)  signifies that whatever text was matched leading up to
 | 
						|
       it cannot be part of a successful match. Consider:
 | 
						|
 | 
						|
         a+(*SKIP)b
 | 
						|
 | 
						|
       If the subject is "aaaac...",  after  the  first  match  attempt  fails
 | 
						|
       (starting  at  the  first  character in the string), the starting point
 | 
						|
       skips on to start the next attempt at "c". Note that a possessive quan-
 | 
						|
       tifer  does not have the same effect as this example; although it would
 | 
						|
       suppress backtracking  during  the  first  match  attempt,  the  second
 | 
						|
       attempt  would  start at the second character instead of skipping on to
 | 
						|
       "c".
 | 
						|
 | 
						|
         (*SKIP:NAME)
 | 
						|
 | 
						|
       When (*SKIP) has an associated name, its behaviour is modified. When it
 | 
						|
       is triggered, the previous path through the pattern is searched for the
 | 
						|
       most recent (*MARK) that has the  same  name.  If  one  is  found,  the
 | 
						|
       "bumpalong" advance is to the subject position that corresponds to that
 | 
						|
       (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with
 | 
						|
       a matching name is found, the (*SKIP) is ignored.
 | 
						|
 | 
						|
       Note  that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
 | 
						|
       ignores names that are set by (*PRUNE:NAME) or (*THEN:NAME).
 | 
						|
 | 
						|
         (*THEN) or (*THEN:NAME)
 | 
						|
 | 
						|
       This verb causes a skip to the next innermost  alternative  when  back-
 | 
						|
       tracking  reaches  it.  That  is,  it  cancels any further backtracking
 | 
						|
       within the current alternative. Its name  comes  from  the  observation
 | 
						|
       that it can be used for a pattern-based if-then-else block:
 | 
						|
 | 
						|
         ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
 | 
						|
 | 
						|
       If  the COND1 pattern matches, FOO is tried (and possibly further items
 | 
						|
       after the end of the group if FOO succeeds); on  failure,  the  matcher
 | 
						|
       skips  to  the second alternative and tries COND2, without backtracking
 | 
						|
       into COND1. If that succeeds and BAR fails, COND3 is tried.  If  subse-
 | 
						|
       quently  BAZ fails, there are no more alternatives, so there is a back-
 | 
						|
       track to whatever came before the  entire  group.  If  (*THEN)  is  not
 | 
						|
       inside an alternation, it acts like (*PRUNE).
 | 
						|
 | 
						|
       The    behaviour   of   (*THEN:NAME)   is   the   not   the   same   as
 | 
						|
       (*MARK:NAME)(*THEN).  It is like  (*MARK:NAME)  in  that  the  name  is
 | 
						|
       remembered  for  passing  back  to  the  caller.  However, (*SKIP:NAME)
 | 
						|
       searches only for  names  set  with  (*MARK),  ignoring  those  set  by
 | 
						|
       (*PRUNE) and (*THEN).
 | 
						|
 | 
						|
       A  subpattern that does not contain a | character is just a part of the
 | 
						|
       enclosing alternative; it is not a nested  alternation  with  only  one
 | 
						|
       alternative.  The effect of (*THEN) extends beyond such a subpattern to
 | 
						|
       the enclosing alternative. Consider this pattern, where A, B, etc.  are
 | 
						|
       complex  pattern fragments that do not contain any | characters at this
 | 
						|
       level:
 | 
						|
 | 
						|
         A (B(*THEN)C) | D
 | 
						|
 | 
						|
       If A and B are matched, but there is a failure in C, matching does  not
 | 
						|
       backtrack into A; instead it moves to the next alternative, that is, D.
 | 
						|
       However, if the subpattern containing (*THEN) is given an  alternative,
 | 
						|
       it behaves differently:
 | 
						|
 | 
						|
         A (B(*THEN)C | (*FAIL)) | D
 | 
						|
 | 
						|
       The  effect of (*THEN) is now confined to the inner subpattern. After a
 | 
						|
       failure in C, matching moves to (*FAIL), which causes the whole subpat-
 | 
						|
       tern  to  fail  because  there are no more alternatives to try. In this
 | 
						|
       case, matching does now backtrack into A.
 | 
						|
 | 
						|
       Note that a conditional subpattern is  not  considered  as  having  two
 | 
						|
       alternatives,  because  only  one  is  ever used. In other words, the |
 | 
						|
       character in a conditional subpattern has a different meaning. Ignoring
 | 
						|
       white space, consider:
 | 
						|
 | 
						|
         ^.*? (?(?=a) a | b(*THEN)c )
 | 
						|
 | 
						|
       If  the  subject  is  "ba", this pattern does not match. Because .*? is
 | 
						|
       ungreedy, it initially matches zero  characters.  The  condition  (?=a)
 | 
						|
       then  fails,  the  character  "b"  is  matched, but "c" is not. At this
 | 
						|
       point, matching does not backtrack to .*? as might perhaps be  expected
 | 
						|
       from  the  presence  of  the | character. The conditional subpattern is
 | 
						|
       part of the single alternative that comprises the whole pattern, and so
 | 
						|
       the  match  fails.  (If  there was a backtrack into .*?, allowing it to
 | 
						|
       match "b", the match would succeed.)
 | 
						|
 | 
						|
       The verbs just described provide four different "strengths" of  control
 | 
						|
       when subsequent matching fails. (*THEN) is the weakest, carrying on the
 | 
						|
       match at the next alternative. (*PRUNE) comes next, failing  the  match
 | 
						|
       at  the  current starting position, but allowing an advance to the next
 | 
						|
       character (for an unanchored pattern). (*SKIP) is similar, except  that
 | 
						|
       the advance may be more than one character. (*COMMIT) is the strongest,
 | 
						|
       causing the entire match to fail.
 | 
						|
 | 
						|
   More than one backtracking verb
 | 
						|
 | 
						|
       If more than one backtracking verb is present in  a  pattern,  the  one
 | 
						|
       that  is  backtracked  onto first acts. For example, consider this pat-
 | 
						|
       tern, where A, B, etc. are complex pattern fragments:
 | 
						|
 | 
						|
         (A(*COMMIT)B(*THEN)C|ABD)
 | 
						|
 | 
						|
       If A matches but B fails, the backtrack to (*COMMIT) causes the  entire
 | 
						|
       match to fail. However, if A and B match, but C fails, the backtrack to
 | 
						|
       (*THEN) causes the next alternative (ABD) to be tried.  This  behaviour
 | 
						|
       is  consistent,  but is not always the same as Perl's. It means that if
 | 
						|
       two or more backtracking verbs appear in succession, all the  the  last
 | 
						|
       of them has no effect. Consider this example:
 | 
						|
 | 
						|
         ...(*COMMIT)(*PRUNE)...
 | 
						|
 | 
						|
       If there is a matching failure to the right, backtracking onto (*PRUNE)
 | 
						|
       causes it to be triggered, and its action is taken. There can never  be
 | 
						|
       a backtrack onto (*COMMIT).
 | 
						|
 | 
						|
   Backtracking verbs in repeated groups
 | 
						|
 | 
						|
       PCRE2  differs  from  Perl  in  its  handling  of backtracking verbs in
 | 
						|
       repeated groups. For example, consider:
 | 
						|
 | 
						|
         /(a(*COMMIT)b)+ac/
 | 
						|
 | 
						|
       If the subject is "abac", Perl matches, but  PCRE2  fails  because  the
 | 
						|
       (*COMMIT) in the second repeat of the group acts.
 | 
						|
 | 
						|
   Backtracking verbs in assertions
 | 
						|
 | 
						|
       (*FAIL)  in  an assertion has its normal effect: it forces an immediate
 | 
						|
       backtrack.
 | 
						|
 | 
						|
       (*ACCEPT) in a positive assertion causes the assertion to succeed with-
 | 
						|
       out  any  further processing. In a negative assertion, (*ACCEPT) causes
 | 
						|
       the assertion to fail without any further processing.
 | 
						|
 | 
						|
       The other backtracking verbs are not treated specially if  they  appear
 | 
						|
       in  a  positive  assertion.  In  particular,  (*THEN) skips to the next
 | 
						|
       alternative in the innermost enclosing  group  that  has  alternations,
 | 
						|
       whether or not this is within the assertion.
 | 
						|
 | 
						|
       Negative  assertions  are,  however, different, in order to ensure that
 | 
						|
       changing a positive assertion into a  negative  assertion  changes  its
 | 
						|
       result. Backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes a neg-
 | 
						|
       ative assertion to be true, without considering any further alternative
 | 
						|
       branches in the assertion.  Backtracking into (*THEN) causes it to skip
 | 
						|
       to the next enclosing alternative within the assertion (the normal  be-
 | 
						|
       haviour),  but  if  the  assertion  does  not have such an alternative,
 | 
						|
       (*THEN) behaves like (*PRUNE).
 | 
						|
 | 
						|
   Backtracking verbs in subroutines
 | 
						|
 | 
						|
       These behaviours occur whether or not the subpattern is  called  recur-
 | 
						|
       sively.  Perl's treatment of subroutines is different in some cases.
 | 
						|
 | 
						|
       (*FAIL)  in  a subpattern called as a subroutine has its normal effect:
 | 
						|
       it forces an immediate backtrack.
 | 
						|
 | 
						|
       (*ACCEPT) in a subpattern called as a subroutine causes the  subroutine
 | 
						|
       match  to succeed without any further processing. Matching then contin-
 | 
						|
       ues after the subroutine call.
 | 
						|
 | 
						|
       (*COMMIT), (*SKIP), and (*PRUNE) in a subpattern called as a subroutine
 | 
						|
       cause the subroutine match to fail.
 | 
						|
 | 
						|
       (*THEN)  skips to the next alternative in the innermost enclosing group
 | 
						|
       within the subpattern that has alternatives. If there is no such  group
 | 
						|
       within the subpattern, (*THEN) causes the subroutine match to fail.
 | 
						|
 | 
						|
 | 
						|
SEE ALSO
 | 
						|
 | 
						|
       pcre2api(3),    pcre2callout(3),    pcre2matching(3),   pcre2syntax(3),
 | 
						|
       pcre2(3).
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 27 December 2016
 | 
						|
       Copyright (c) 1997-2016 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2PERFORM(3)            Library Functions Manual            PCRE2PERFORM(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
PCRE2 PERFORMANCE
 | 
						|
 | 
						|
       Two  aspects  of performance are discussed below: memory usage and pro-
 | 
						|
       cessing time. The way you express your pattern as a regular  expression
 | 
						|
       can affect both of them.
 | 
						|
 | 
						|
 | 
						|
COMPILED PATTERN MEMORY USAGE
 | 
						|
 | 
						|
       Patterns are compiled by PCRE2 into a reasonably efficient interpretive
 | 
						|
       code, so that most simple patterns do not  use  much  memory.  However,
 | 
						|
       there  is  one case where the memory usage of a compiled pattern can be
 | 
						|
       unexpectedly large. If a parenthesized subpattern has a quantifier with
 | 
						|
       a minimum greater than 1 and/or a limited maximum, the whole subpattern
 | 
						|
       is repeated in the compiled code. For example, the pattern
 | 
						|
 | 
						|
         (abc|def){2,4}
 | 
						|
 | 
						|
       is compiled as if it were
 | 
						|
 | 
						|
         (abc|def)(abc|def)((abc|def)(abc|def)?)?
 | 
						|
 | 
						|
       (Technical aside: It is done this way so that backtrack  points  within
 | 
						|
       each of the repetitions can be independently maintained.)
 | 
						|
 | 
						|
       For  regular expressions whose quantifiers use only small numbers, this
 | 
						|
       is not usually a problem. However, if the numbers are large,  and  par-
 | 
						|
       ticularly  if  such repetitions are nested, the memory usage can become
 | 
						|
       an embarrassment. For example, the very simple pattern
 | 
						|
 | 
						|
         ((ab){1,1000}c){1,3}
 | 
						|
 | 
						|
       uses 51K bytes when compiled using the 8-bit  library.  When  PCRE2  is
 | 
						|
       compiled  with its default internal pointer size of two bytes, the size
 | 
						|
       limit on a compiled pattern is 64K code units in the 8-bit  and  16-bit
 | 
						|
       libraries, and this is reached with the above pattern if the outer rep-
 | 
						|
       etition is increased from 3 to 4. PCRE2 can be compiled to  use  larger
 | 
						|
       internal  pointers  and thus handle larger compiled patterns, but it is
 | 
						|
       better to try to rewrite your pattern to use less memory if you can.
 | 
						|
 | 
						|
       One way of reducing the memory usage for such patterns is to  make  use
 | 
						|
       of PCRE2's "subroutine" facility. Re-writing the above pattern as
 | 
						|
 | 
						|
         ((ab)(?2){0,999}c)(?1){0,2}
 | 
						|
 | 
						|
       reduces the memory requirements to 18K, and indeed it remains under 20K
 | 
						|
       even with the outer repetition increased to 100. However, this  pattern
 | 
						|
       is  not  exactly equivalent, because the "subroutine" calls are treated
 | 
						|
       as atomic groups into which there can be no backtracking if there is  a
 | 
						|
       subsequent  matching  failure.  Therefore, PCRE2 cannot do this kind of
 | 
						|
       rewriting automatically.  Furthermore, there is a  noticeable  loss  of
 | 
						|
       speed  when executing the modified pattern. Nevertheless, if the atomic
 | 
						|
       grouping is not a problem and the loss of  speed  is  acceptable,  this
 | 
						|
       kind  of rewriting will allow you to process patterns that PCRE2 cannot
 | 
						|
       otherwise handle.
 | 
						|
 | 
						|
 | 
						|
STACK USAGE AT RUN TIME
 | 
						|
 | 
						|
       When pcre2_match() is used for matching, certain kinds of  pattern  can
 | 
						|
       cause  it  to  use large amounts of the process stack. In some environ-
 | 
						|
       ments the default process stack is quite small, and if it runs out  the
 | 
						|
       result  is  often  SIGSEGV.  Rewriting your pattern can often help. The
 | 
						|
       pcre2stack documentation discusses this issue in detail.
 | 
						|
 | 
						|
 | 
						|
PROCESSING TIME
 | 
						|
 | 
						|
       Certain items in regular expression patterns are processed  more  effi-
 | 
						|
       ciently than others. It is more efficient to use a character class like
 | 
						|
       [aeiou]  than  a  set  of   single-character   alternatives   such   as
 | 
						|
       (a|e|i|o|u).  In  general,  the simplest construction that provides the
 | 
						|
       required behaviour is usually the most efficient. Jeffrey Friedl's book
 | 
						|
       contains  a  lot  of useful general discussion about optimizing regular
 | 
						|
       expressions for efficient performance. This  document  contains  a  few
 | 
						|
       observations about PCRE2.
 | 
						|
 | 
						|
       Using  Unicode  character  properties  (the  \p, \P, and \X escapes) is
 | 
						|
       slow, because PCRE2 has to use a multi-stage table lookup  whenever  it
 | 
						|
       needs  a  character's  property. If you can find an alternative pattern
 | 
						|
       that does not use character properties, it will probably be faster.
 | 
						|
 | 
						|
       By default, the escape sequences \b, \d, \s,  and  \w,  and  the  POSIX
 | 
						|
       character  classes  such  as  [:alpha:]  do not use Unicode properties,
 | 
						|
       partly for backwards compatibility, and partly for performance reasons.
 | 
						|
       However,  you  can  set  the PCRE2_UCP option or start the pattern with
 | 
						|
       (*UCP) if you want Unicode character properties to be  used.  This  can
 | 
						|
       double  the  matching  time  for  items  such  as \d, when matched with
 | 
						|
       pcre2_match(); the performance loss is less with a DFA  matching  func-
 | 
						|
       tion, and in both cases there is not much difference for \b.
 | 
						|
 | 
						|
       When  a pattern begins with .* not in atomic parentheses, nor in paren-
 | 
						|
       theses that are the subject of a backreference,  and  the  PCRE2_DOTALL
 | 
						|
       option  is  set,  the pattern is implicitly anchored by PCRE2, since it
 | 
						|
       can match only at the start of a subject string.  If  the  pattern  has
 | 
						|
       multiple top-level branches, they must all be anchorable. The optimiza-
 | 
						|
       tion can be disabled by  the  PCRE2_NO_DOTSTAR_ANCHOR  option,  and  is
 | 
						|
       automatically disabled if the pattern contains (*PRUNE) or (*SKIP).
 | 
						|
 | 
						|
       If  PCRE2_DOTALL  is  not  set,  PCRE2  cannot  make this optimization,
 | 
						|
       because the dot metacharacter does not then match a newline, and if the
 | 
						|
       subject  string contains newlines, the pattern may match from the char-
 | 
						|
       acter immediately following one of them instead of from the very start.
 | 
						|
       For example, the pattern
 | 
						|
 | 
						|
         .*second
 | 
						|
 | 
						|
       matches  the subject "first\nand second" (where \n stands for a newline
 | 
						|
       character), with the match starting at the seventh character. In  order
 | 
						|
       to  do  this, PCRE2 has to retry the match starting after every newline
 | 
						|
       in the subject.
 | 
						|
 | 
						|
       If you are using such a pattern with subject strings that do  not  con-
 | 
						|
       tain   newlines,   the   best   performance   is  obtained  by  setting
 | 
						|
       PCRE2_DOTALL, or starting the pattern with  ^.*  or  ^.*?  to  indicate
 | 
						|
       explicit anchoring. That saves PCRE2 from having to scan along the sub-
 | 
						|
       ject looking for a newline to restart at.
 | 
						|
 | 
						|
       Beware of patterns that contain nested indefinite  repeats.  These  can
 | 
						|
       take  a  long time to run when applied to a string that does not match.
 | 
						|
       Consider the pattern fragment
 | 
						|
 | 
						|
         ^(a+)*
 | 
						|
 | 
						|
       This can match "aaaa" in 16 different ways, and this  number  increases
 | 
						|
       very  rapidly  as the string gets longer. (The * repeat can match 0, 1,
 | 
						|
       2, 3, or 4 times, and for each of those cases other than 0 or 4, the  +
 | 
						|
       repeats  can  match  different numbers of times.) When the remainder of
 | 
						|
       the pattern is such that the entire match is going to fail,  PCRE2  has
 | 
						|
       in  principle  to  try  every  possible variation, and this can take an
 | 
						|
       extremely long time, even for relatively short strings.
 | 
						|
 | 
						|
       An optimization catches some of the more simple cases such as
 | 
						|
 | 
						|
         (a+)*b
 | 
						|
 | 
						|
       where a literal character follows. Before  embarking  on  the  standard
 | 
						|
       matching  procedure, PCRE2 checks that there is a "b" later in the sub-
 | 
						|
       ject string, and if there is not, it fails the match immediately.  How-
 | 
						|
       ever,  when  there  is no following literal this optimization cannot be
 | 
						|
       used. You can see the difference by comparing the behaviour of
 | 
						|
 | 
						|
         (a+)*\d
 | 
						|
 | 
						|
       with the pattern above. The former gives  a  failure  almost  instantly
 | 
						|
       when  applied  to  a  whole  line of "a" characters, whereas the latter
 | 
						|
       takes an appreciable time with strings longer than about 20 characters.
 | 
						|
 | 
						|
       In many cases, the solution to this kind of performance issue is to use
 | 
						|
       an atomic group or a possessive quantifier.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 02 January 2015
 | 
						|
       Copyright (c) 1997-2015 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2POSIX(3)              Library Functions Manual              PCRE2POSIX(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
SYNOPSIS
 | 
						|
 | 
						|
       #include <pcre2posix.h>
 | 
						|
 | 
						|
       int regcomp(regex_t *preg, const char *pattern,
 | 
						|
            int cflags);
 | 
						|
 | 
						|
       int regexec(const regex_t *preg, const char *string,
 | 
						|
            size_t nmatch, regmatch_t pmatch[], int eflags);
 | 
						|
 | 
						|
       size_t regerror(int errcode, const regex_t *preg,
 | 
						|
            char *errbuf, size_t errbuf_size);
 | 
						|
 | 
						|
       void regfree(regex_t *preg);
 | 
						|
 | 
						|
 | 
						|
DESCRIPTION
 | 
						|
 | 
						|
       This  set of functions provides a POSIX-style API for the PCRE2 regular
 | 
						|
       expression 8-bit library. See the pcre2api documentation for a descrip-
 | 
						|
       tion  of PCRE2's native API, which contains much additional functional-
 | 
						|
       ity. There are no POSIX-style wrappers for PCRE2's  16-bit  and  32-bit
 | 
						|
       libraries.
 | 
						|
 | 
						|
       The functions described here are just wrapper functions that ultimately
 | 
						|
       call the  PCRE2  native  API.  Their  prototypes  are  defined  in  the
 | 
						|
       pcre2posix.h  header  file,  and  on Unix systems the library itself is
 | 
						|
       called libpcre2-posix.a, so can be accessed by adding -lpcre2-posix  to
 | 
						|
       the  command  for  linking  an  application that uses them. Because the
 | 
						|
       POSIX functions call the native ones,  it  is  also  necessary  to  add
 | 
						|
       -lpcre2-8.
 | 
						|
 | 
						|
       Those  POSIX  option bits that can reasonably be mapped to PCRE2 native
 | 
						|
       options have been implemented. In addition, the option REG_EXTENDED  is
 | 
						|
       defined  with  the  value  zero. This has no effect, but since programs
 | 
						|
       that are written to the POSIX interface often use  it,  this  makes  it
 | 
						|
       easier  to  slot in PCRE2 as a replacement library. Other POSIX options
 | 
						|
       are not even defined.
 | 
						|
 | 
						|
       There are also some options that are not defined by POSIX.  These  have
 | 
						|
       been  added  at  the  request  of users who want to make use of certain
 | 
						|
       PCRE2-specific features via the POSIX calling interface.
 | 
						|
 | 
						|
       When PCRE2 is called via these functions, it is only the  API  that  is
 | 
						|
       POSIX-like  in  style.  The syntax and semantics of the regular expres-
 | 
						|
       sions themselves are still those of Perl, subject  to  the  setting  of
 | 
						|
       various  PCRE2 options, as described below. "POSIX-like in style" means
 | 
						|
       that the API approximates to the POSIX  definition;  it  is  not  fully
 | 
						|
       POSIX-compatible,  and  in  multi-unit  encoding domains it is probably
 | 
						|
       even less compatible.
 | 
						|
 | 
						|
       The header for these functions is supplied as pcre2posix.h to avoid any
 | 
						|
       potential  clash  with  other  POSIX  libraries.  It can, of course, be
 | 
						|
       renamed or aliased as regex.h, which is the "correct" name. It provides
 | 
						|
       two  structure  types,  regex_t  for  compiled internal forms, and reg-
 | 
						|
       match_t for returning captured substrings. It also  defines  some  con-
 | 
						|
       stants  whose  names  start  with  "REG_";  these  are used for setting
 | 
						|
       options and identifying error codes.
 | 
						|
 | 
						|
 | 
						|
COMPILING A PATTERN
 | 
						|
 | 
						|
       The function regcomp() is called to compile a pattern into an  internal
 | 
						|
       form.  The  pattern  is  a C string terminated by a binary zero, and is
 | 
						|
       passed in the argument pattern. The preg argument is  a  pointer  to  a
 | 
						|
       regex_t  structure that is used as a base for storing information about
 | 
						|
       the compiled regular expression.
 | 
						|
 | 
						|
       The argument cflags is either zero, or contains one or more of the bits
 | 
						|
       defined by the following macros:
 | 
						|
 | 
						|
         REG_DOTALL
 | 
						|
 | 
						|
       The  PCRE2_DOTALL  option  is set when the regular expression is passed
 | 
						|
       for compilation to the native function. Note  that  REG_DOTALL  is  not
 | 
						|
       part of the POSIX standard.
 | 
						|
 | 
						|
         REG_ICASE
 | 
						|
 | 
						|
       The  PCRE2_CASELESS option is set when the regular expression is passed
 | 
						|
       for compilation to the native function.
 | 
						|
 | 
						|
         REG_NEWLINE
 | 
						|
 | 
						|
       The PCRE2_MULTILINE option is set when the regular expression is passed
 | 
						|
       for  compilation  to the native function. Note that this does not mimic
 | 
						|
       the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec-
 | 
						|
       tion).
 | 
						|
 | 
						|
         REG_NOSUB
 | 
						|
 | 
						|
       When  a  pattern that is compiled with this flag is passed to regexec()
 | 
						|
       for matching, the nmatch and pmatch arguments are ignored, and no  cap-
 | 
						|
       tured strings are returned. Versions of the PCRE library prior to 10.22
 | 
						|
       used to set the  PCRE2_NO_AUTO_CAPTURE  compile  option,  but  this  no
 | 
						|
       longer happens because it disables the use of back references.
 | 
						|
 | 
						|
         REG_UCP
 | 
						|
 | 
						|
       The  PCRE2_UCP  option is set when the regular expression is passed for
 | 
						|
       compilation to the native function. This causes PCRE2  to  use  Unicode
 | 
						|
       properties  when  matchine  \d,  \w,  etc., instead of just recognizing
 | 
						|
       ASCII values. Note that REG_UCP is not part of the POSIX standard.
 | 
						|
 | 
						|
         REG_UNGREEDY
 | 
						|
 | 
						|
       The PCRE2_UNGREEDY option is set when the regular expression is  passed
 | 
						|
       for  compilation  to the native function. Note that REG_UNGREEDY is not
 | 
						|
       part of the POSIX standard.
 | 
						|
 | 
						|
         REG_UTF
 | 
						|
 | 
						|
       The PCRE2_UTF option is set when the regular expression is  passed  for
 | 
						|
       compilation  to the native function. This causes the pattern itself and
 | 
						|
       all data strings used for matching it to be treated as  UTF-8  strings.
 | 
						|
       Note that REG_UTF is not part of the POSIX standard.
 | 
						|
 | 
						|
       In  the  absence  of  these  flags, no options are passed to the native
 | 
						|
       function.  This means the the regex  is  compiled  with  PCRE2  default
 | 
						|
       semantics.  In particular, the way it handles newline characters in the
 | 
						|
       subject string is the Perl way, not the POSIX way.  Note  that  setting
 | 
						|
       PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
 | 
						|
       It does not affect the way newlines are matched by the dot  metacharac-
 | 
						|
       ter (they are not) or by a negative class such as [^a] (they are).
 | 
						|
 | 
						|
       The  yield of regcomp() is zero on success, and non-zero otherwise. The
 | 
						|
       preg structure is filled in on success, and one member of the structure
 | 
						|
       is  public: re_nsub contains the number of capturing subpatterns in the
 | 
						|
       regular expression. Various error codes are defined in the header file.
 | 
						|
 | 
						|
       NOTE: If the yield of regcomp() is non-zero, you must  not  attempt  to
 | 
						|
       use the contents of the preg structure. If, for example, you pass it to
 | 
						|
       regexec(), the result is undefined and your program is likely to crash.
 | 
						|
 | 
						|
 | 
						|
MATCHING NEWLINE CHARACTERS
 | 
						|
 | 
						|
       This area is not simple, because POSIX and Perl take different views of
 | 
						|
       things.   It  is not possible to get PCRE2 to obey POSIX semantics, but
 | 
						|
       then PCRE2 was never intended to be a POSIX engine. The following table
 | 
						|
       lists  the  different  possibilities for matching newline characters in
 | 
						|
       Perl and PCRE2:
 | 
						|
 | 
						|
                                 Default   Change with
 | 
						|
 | 
						|
         . matches newline          no     PCRE2_DOTALL
 | 
						|
         newline matches [^a]       yes    not changeable
 | 
						|
         $ matches \n at end        yes    PCRE2_DOLLAR_ENDONLY
 | 
						|
         $ matches \n in middle     no     PCRE2_MULTILINE
 | 
						|
         ^ matches \n in middle     no     PCRE2_MULTILINE
 | 
						|
 | 
						|
       This is the equivalent table for a POSIX-compatible pattern matcher:
 | 
						|
 | 
						|
                                 Default   Change with
 | 
						|
 | 
						|
         . matches newline          yes    REG_NEWLINE
 | 
						|
         newline matches [^a]       yes    REG_NEWLINE
 | 
						|
         $ matches \n at end        no     REG_NEWLINE
 | 
						|
         $ matches \n in middle     no     REG_NEWLINE
 | 
						|
         ^ matches \n in middle     no     REG_NEWLINE
 | 
						|
 | 
						|
       This behaviour is not what happens when PCRE2 is called via  its  POSIX
 | 
						|
       API.  By  default, PCRE2's behaviour is the same as Perl's, except that
 | 
						|
       there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both  PCRE2
 | 
						|
       and Perl, there is no way to stop newline from matching [^a].
 | 
						|
 | 
						|
       Default  POSIX newline handling can be obtained by setting PCRE2_DOTALL
 | 
						|
       and PCRE2_DOLLAR_ENDONLY when  calling  pcre2_compile()  directly,  but
 | 
						|
       there  is  no  way  to make PCRE2 behave exactly as for the REG_NEWLINE
 | 
						|
       action. When using the POSIX API, passing REG_NEWLINE to  PCRE2's  reg-
 | 
						|
       comp() function causes PCRE2_MULTILINE to be passed to pcre2_compile(),
 | 
						|
       and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass  PCRE2_DOL-
 | 
						|
       LAR_ENDONLY.
 | 
						|
 | 
						|
 | 
						|
MATCHING A PATTERN
 | 
						|
 | 
						|
       The  function  regexec()  is  called  to  match a compiled pattern preg
 | 
						|
       against a given string, which is by default terminated by a  zero  byte
 | 
						|
       (but  see  REG_STARTEND below), subject to the options in eflags. These
 | 
						|
       can be:
 | 
						|
 | 
						|
         REG_NOTBOL
 | 
						|
 | 
						|
       The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
 | 
						|
       ing function.
 | 
						|
 | 
						|
         REG_NOTEMPTY
 | 
						|
 | 
						|
       The  PCRE2_NOTEMPTY  option  is  set  when calling the underlying PCRE2
 | 
						|
       matching function. Note that REG_NOTEMPTY is  not  part  of  the  POSIX
 | 
						|
       standard.  However, setting this option can give more POSIX-like behav-
 | 
						|
       iour in some situations.
 | 
						|
 | 
						|
         REG_NOTEOL
 | 
						|
 | 
						|
       The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
 | 
						|
       ing function.
 | 
						|
 | 
						|
         REG_STARTEND
 | 
						|
 | 
						|
       The  string  is  considered to start at string + pmatch[0].rm_so and to
 | 
						|
       have a terminating NUL located at string + pmatch[0].rm_eo (there  need
 | 
						|
       not  actually  be  a  NUL at that location), regardless of the value of
 | 
						|
       nmatch. This is a BSD extension, compatible with but not  specified  by
 | 
						|
       IEEE  Standard  1003.2  (POSIX.2),  and  should be used with caution in
 | 
						|
       software intended to be portable to other systems. Note that a non-zero
 | 
						|
       rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location
 | 
						|
       of the string, not how it is matched. Setting REG_STARTEND and  passing
 | 
						|
       pmatch  as  NULL  are  mutually  exclusive;  the  error  REG_INVARG  is
 | 
						|
       returned.
 | 
						|
 | 
						|
       If the pattern was compiled with the REG_NOSUB flag, no data about  any
 | 
						|
       matched  strings  is  returned.  The  nmatch  and  pmatch  arguments of
 | 
						|
       regexec() are ignored (except possibly as input for REG_STARTEND).
 | 
						|
 | 
						|
       The value of nmatch may be zero, and  the  value  pmatch  may  be  NULL
 | 
						|
       (unless  REG_STARTEND  is  set);  in both these cases no data about any
 | 
						|
       matched strings is returned.
 | 
						|
 | 
						|
       Otherwise, the portion of the string that was  matched,  and  also  any
 | 
						|
       captured substrings, are returned via the pmatch argument, which points
 | 
						|
       to an array of nmatch structures of  type  regmatch_t,  containing  the
 | 
						|
       members  rm_so  and  rm_eo.  These contain the byte offset to the first
 | 
						|
       character of each substring and the offset to the first character after
 | 
						|
       the  end of each substring, respectively. The 0th element of the vector
 | 
						|
       relates to the entire portion of string that  was  matched;  subsequent
 | 
						|
       elements relate to the capturing subpatterns of the regular expression.
 | 
						|
       Unused entries in the array have both structure members set to -1.
 | 
						|
 | 
						|
       A successful match yields  a  zero  return;  various  error  codes  are
 | 
						|
       defined  in  the  header  file,  of which REG_NOMATCH is the "expected"
 | 
						|
       failure code.
 | 
						|
 | 
						|
 | 
						|
ERROR MESSAGES
 | 
						|
 | 
						|
       The regerror() function maps a non-zero errorcode from either regcomp()
 | 
						|
       or  regexec()  to  a  printable message. If preg is not NULL, the error
 | 
						|
       should have arisen from the use of that structure. A message terminated
 | 
						|
       by  a binary zero is placed in errbuf. If the buffer is too short, only
 | 
						|
       the first errbuf_size - 1 characters of the error message are used. The
 | 
						|
       yield  of  the  function is the size of buffer needed to hold the whole
 | 
						|
       message, including the terminating zero. This  value  is  greater  than
 | 
						|
       errbuf_size if the message was truncated.
 | 
						|
 | 
						|
 | 
						|
MEMORY USAGE
 | 
						|
 | 
						|
       Compiling  a regular expression causes memory to be allocated and asso-
 | 
						|
       ciated with the preg structure. The function regfree() frees  all  such
 | 
						|
       memory,  after  which  preg may no longer be used as a compiled expres-
 | 
						|
       sion.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 31 January 2016
 | 
						|
       Copyright (c) 1997-2016 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2SAMPLE(3)             Library Functions Manual             PCRE2SAMPLE(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
PCRE2 SAMPLE PROGRAM
 | 
						|
 | 
						|
       A  simple, complete demonstration program to get you started with using
 | 
						|
       PCRE2 is supplied in the file pcre2demo.c in the src directory  in  the
 | 
						|
       PCRE2 distribution. A listing of this program is given in the pcre2demo
 | 
						|
       documentation. If you do not have a copy of the PCRE2 distribution, you
 | 
						|
       can save this listing to re-create the contents of pcre2demo.c.
 | 
						|
 | 
						|
       The  demonstration  program compiles the regular expression that is its
 | 
						|
       first argument, and matches it against the subject string in its second
 | 
						|
       argument.  No  PCRE2  options are set, and default character tables are
 | 
						|
       used. If matching succeeds, the program outputs the portion of the sub-
 | 
						|
       ject  that  matched,  together  with  the contents of any captured sub-
 | 
						|
       strings.
 | 
						|
 | 
						|
       If the -g option is given on the command line, the program then goes on
 | 
						|
       to check for further matches of the same regular expression in the same
 | 
						|
       subject string. The logic is a little bit tricky because of the  possi-
 | 
						|
       bility  of  matching an empty string. Comments in the code explain what
 | 
						|
       is going on.
 | 
						|
 | 
						|
       The code in pcre2demo.c is an 8-bit program that uses the  PCRE2  8-bit
 | 
						|
       library.  It  handles  strings  and characters that are stored in 8-bit
 | 
						|
       code units.  By default, one character corresponds to  one  code  unit,
 | 
						|
       but  if  the  pattern starts with "(*UTF)", both it and the subject are
 | 
						|
       treated as UTF-8 strings, where characters  may  occupy  multiple  code
 | 
						|
       units.
 | 
						|
 | 
						|
       If  PCRE2  is installed in the standard include and library directories
 | 
						|
       for your operating system, you should be able to compile the demonstra-
 | 
						|
       tion program using a command like this:
 | 
						|
 | 
						|
         cc -o pcre2demo pcre2demo.c -lpcre2-8
 | 
						|
 | 
						|
       If PCRE2 is installed elsewhere, you may need to add additional options
 | 
						|
       to the command line. For example, on a Unix-like system that has  PCRE2
 | 
						|
       installed  in  /usr/local,  you  can  compile the demonstration program
 | 
						|
       using a command like this:
 | 
						|
 | 
						|
         cc -o pcre2demo -I/usr/local/include pcre2demo.c \
 | 
						|
            -L/usr/local/lib -lpcre2-8
 | 
						|
 | 
						|
       Once you have built the demonstration program, you can run simple tests
 | 
						|
       like this:
 | 
						|
 | 
						|
         ./pcre2demo 'cat|dog' 'the cat sat on the mat'
 | 
						|
         ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
 | 
						|
 | 
						|
       Note  that  there  is  a  much  more comprehensive test program, called
 | 
						|
       pcre2test, which supports many  more  facilities  for  testing  regular
 | 
						|
       expressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
 | 
						|
       though not all three need be installed). The pcre2demo program is  pro-
 | 
						|
       vided as a relatively simple coding example.
 | 
						|
 | 
						|
       If you try to run pcre2demo when PCRE2 is not installed in the standard
 | 
						|
       library directory, you may get an error like  this  on  some  operating
 | 
						|
       systems (e.g. Solaris):
 | 
						|
 | 
						|
         ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
 | 
						|
       or directory
 | 
						|
 | 
						|
       This is caused by the way shared library support works  on  those  sys-
 | 
						|
       tems. You need to add
 | 
						|
 | 
						|
         -R/usr/local/lib
 | 
						|
 | 
						|
       (for example) to the compile command to get round this problem.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 02 February 2016
 | 
						|
       Copyright (c) 1997-2016 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
PCRE2SERIALIZE(3)          Library Functions Manual          PCRE2SERIALIZE(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
 | 
						|
 | 
						|
       int32_t pcre2_serialize_decode(pcre2_code **codes,
 | 
						|
         int32_t number_of_codes, const uint32_t *bytes,
 | 
						|
         pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       int32_t pcre2_serialize_encode(pcre2_code **codes,
 | 
						|
         int32_t number_of_codes, uint32_t **serialized_bytes,
 | 
						|
         PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
 | 
						|
 | 
						|
       void pcre2_serialize_free(uint8_t *bytes);
 | 
						|
 | 
						|
       int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
 | 
						|
 | 
						|
       If  you  are running an application that uses a large number of regular
 | 
						|
       expression patterns, it may be useful to store them  in  a  precompiled
 | 
						|
       form  instead  of  having to compile them every time the application is
 | 
						|
       run. However, if you are using the just-in-time  optimization  feature,
 | 
						|
       it is not possible to save and reload the JIT data, because it is posi-
 | 
						|
       tion-dependent. The host on which the patterns  are  reloaded  must  be
 | 
						|
       running  the  same version of PCRE2, with the same code unit width, and
 | 
						|
       must also have the same endianness, pointer width and PCRE2_SIZE  type.
 | 
						|
       For  example, patterns compiled on a 32-bit system using PCRE2's 16-bit
 | 
						|
       library cannot be reloaded on a 64-bit system, nor can they be reloaded
 | 
						|
       using the 8-bit library.
 | 
						|
 | 
						|
 | 
						|
SECURITY CONCERNS
 | 
						|
 | 
						|
       The facility for saving and restoring compiled patterns is intended for
 | 
						|
       use within individual applications.  As  such,  the  data  supplied  to
 | 
						|
       pcre2_serialize_decode()  is expected to be trusted data, not data from
 | 
						|
       arbitrary external sources.  There  is  only  some  simple  consistency
 | 
						|
       checking, not complete validation of what is being re-loaded.
 | 
						|
 | 
						|
 | 
						|
SAVING COMPILED PATTERNS
 | 
						|
 | 
						|
       Before compiled patterns can be saved they must be serialized, that is,
 | 
						|
       converted to a stream of bytes. A single byte stream  may  contain  any
 | 
						|
       number  of  compiled patterns, but they must all use the same character
 | 
						|
       tables. A single copy of the tables is included in the byte stream (its
 | 
						|
       size is 1088 bytes). For more details of character tables, see the sec-
 | 
						|
       tion on locale support in the pcre2api documentation.
 | 
						|
 | 
						|
       The function pcre2_serialize_encode() creates a serialized byte  stream
 | 
						|
       from  a  list of compiled patterns. Its first two arguments specify the
 | 
						|
       list, being a pointer to a vector of pointers to compiled patterns, and
 | 
						|
       the length of the vector. The third and fourth arguments point to vari-
 | 
						|
       ables which are set to point to the created byte stream and its length,
 | 
						|
       respectively.  The  final  argument  is a pointer to a general context,
 | 
						|
       which can be used to specify custom memory  mangagement  functions.  If
 | 
						|
       this  argument  is NULL, malloc() is used to obtain memory for the byte
 | 
						|
       stream. The yield of the function is the number of serialized patterns,
 | 
						|
       or one of the following negative error codes:
 | 
						|
 | 
						|
         PCRE2_ERROR_BADDATA      the number of patterns is zero or less
 | 
						|
         PCRE2_ERROR_BADMAGIC     mismatch of id bytes in one of the patterns
 | 
						|
         PCRE2_ERROR_MEMORY       memory allocation failed
 | 
						|
         PCRE2_ERROR_MIXEDTABLES  the patterns do not all use the same tables
 | 
						|
         PCRE2_ERROR_NULL         the 1st, 3rd, or 4th argument is NULL
 | 
						|
 | 
						|
       PCRE2_ERROR_BADMAGIC  means  either that a pattern's code has been cor-
 | 
						|
       rupted, or that a slot in the vector does not point to a compiled  pat-
 | 
						|
       tern.
 | 
						|
 | 
						|
       Once a set of patterns has been serialized you can save the data in any
 | 
						|
       appropriate manner. Here is sample code that compiles two patterns  and
 | 
						|
       writes them to a file. It assumes that the variable fd refers to a file
 | 
						|
       that is open for output. The error checking that should be present in a
 | 
						|
       real application has been omitted for simplicity.
 | 
						|
 | 
						|
         int errorcode;
 | 
						|
         uint8_t *bytes;
 | 
						|
         PCRE2_SIZE erroroffset;
 | 
						|
         PCRE2_SIZE bytescount;
 | 
						|
         pcre2_code *list_of_codes[2];
 | 
						|
         list_of_codes[0] = pcre2_compile("first pattern",
 | 
						|
           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
 | 
						|
         list_of_codes[1] = pcre2_compile("second pattern",
 | 
						|
           PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
 | 
						|
         errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes,
 | 
						|
           &bytescount, NULL);
 | 
						|
         errorcode = fwrite(bytes, 1, bytescount, fd);
 | 
						|
 | 
						|
       Note  that  the  serialized data is binary data that may contain any of
 | 
						|
       the 256 possible byte  values.  On  systems  that  make  a  distinction
 | 
						|
       between binary and non-binary data, be sure that the file is opened for
 | 
						|
       binary output.
 | 
						|
 | 
						|
       Serializing a set of patterns leaves the original  data  untouched,  so
 | 
						|
       they  can  still  be used for matching. Their memory must eventually be
 | 
						|
       freed in the usual way by calling pcre2_code_free(). When you have fin-
 | 
						|
       ished with the byte stream, it too must be freed by calling pcre2_seri-
 | 
						|
       alize_free().
 | 
						|
 | 
						|
 | 
						|
RE-USING PRECOMPILED PATTERNS
 | 
						|
 | 
						|
       In order to re-use a set of saved patterns  you  must  first  make  the
 | 
						|
       serialized  byte stream available in main memory (for example, by read-
 | 
						|
       ing from a file). The management of this memory  block  is  up  to  the
 | 
						|
       application.  You  can  use  the  pcre2_serialize_get_number_of_codes()
 | 
						|
       function to find out how many compiled patterns are in  the  serialized
 | 
						|
       data without actually decoding the patterns:
 | 
						|
 | 
						|
         uint8_t *bytes = <serialized data>;
 | 
						|
         int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes);
 | 
						|
 | 
						|
       The pcre2_serialize_decode() function reads a byte stream and recreates
 | 
						|
       the compiled patterns in new memory blocks, setting pointers to them in
 | 
						|
       a  vector.  The  first two arguments are a pointer to a suitable vector
 | 
						|
       and its length, and the third argument points to  a  byte  stream.  The
 | 
						|
       final  argument is a pointer to a general context, which can be used to
 | 
						|
       specify custom memory mangagement functions for the  decoded  patterns.
 | 
						|
       If this argument is NULL, malloc() and free() are used. After deserial-
 | 
						|
       ization, the byte stream is no longer needed and can be discarded.
 | 
						|
 | 
						|
         int32_t number_of_codes;
 | 
						|
         pcre2_code *list_of_codes[2];
 | 
						|
         uint8_t *bytes = <serialized data>;
 | 
						|
         int32_t number_of_codes =
 | 
						|
           pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
 | 
						|
 | 
						|
       If the vector is not large enough for all  the  patterns  in  the  byte
 | 
						|
       stream,  it  is  filled  with  those  that  fit,  and the remainder are
 | 
						|
       ignored. The yield of the function is the number of  decoded  patterns,
 | 
						|
       or one of the following negative error codes:
 | 
						|
 | 
						|
         PCRE2_ERROR_BADDATA    second argument is zero or less
 | 
						|
         PCRE2_ERROR_BADMAGIC   mismatch of id bytes in the data
 | 
						|
         PCRE2_ERROR_BADMODE    mismatch of code unit size or PCRE2 version
 | 
						|
         PCRE2_ERROR_BADSERIALIZEDDATA  other sanity check failure
 | 
						|
         PCRE2_ERROR_MEMORY     memory allocation failed
 | 
						|
         PCRE2_ERROR_NULL       first or third argument is NULL
 | 
						|
 | 
						|
       PCRE2_ERROR_BADMAGIC  may mean that the data is corrupt, or that it was
 | 
						|
       compiled on a system with different endianness.
 | 
						|
 | 
						|
       Decoded patterns can be used for matching in the usual way, and must be
 | 
						|
       freed  by  calling pcre2_code_free(). However, be aware that there is a
 | 
						|
       potential race issue if you  are  using  multiple  patterns  that  were
 | 
						|
       decoded  from  a  single  byte stream in a multithreaded application. A
 | 
						|
       single copy of the character tables is used by all the decoded patterns
 | 
						|
       and a reference count is used to arrange for its memory to be automati-
 | 
						|
       cally freed when the last pattern is freed, but there is no locking  on
 | 
						|
       this  reference count. Therefore, if you want to call pcre2_code_free()
 | 
						|
       for these patterns in different threads,  you  must  arrange  your  own
 | 
						|
       locking,  and  ensure  that  pcre2_code_free()  cannot be called by two
 | 
						|
       threads at the same time.
 | 
						|
 | 
						|
       If a pattern was processed by pcre2_jit_compile() before being  serial-
 | 
						|
       ized,  the  JIT data is discarded and so is no longer available after a
 | 
						|
       save/restore cycle. You can, however, process a restored  pattern  with
 | 
						|
       pcre2_jit_compile() if you wish.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 24 May 2016
 | 
						|
       Copyright (c) 1997-2016 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2STACK(3)              Library Functions Manual              PCRE2STACK(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
PCRE2 DISCUSSION OF STACK USAGE
 | 
						|
 | 
						|
       When  you  call  pcre2_match(),  it  makes  use of an internal function
 | 
						|
       called match(). This calls itself recursively at branch points  in  the
 | 
						|
       pattern,  in  order  to  remember the state of the match so that it can
 | 
						|
       back up and try a different alternative after a  failure.  As  matching
 | 
						|
       proceeds  deeper  and deeper into the tree of possibilities, the recur-
 | 
						|
       sion depth increases. The match() function is also called in other cir-
 | 
						|
       cumstances,  for  example,  whenever  a  parenthesized  sub-pattern  is
 | 
						|
       entered, and in certain cases of repetition.
 | 
						|
 | 
						|
       Not all calls of match() increase the recursion depth; for an item such
 | 
						|
       as  a* it may be called several times at the same level, after matching
 | 
						|
       different numbers of a's. Furthermore, in a number of cases  where  the
 | 
						|
       result  of  the  recursive call would immediately be passed back as the
 | 
						|
       result of the current call (a "tail recursion"), the function  is  just
 | 
						|
       restarted instead.
 | 
						|
 | 
						|
       Each  time the internal match() function is called recursively, it uses
 | 
						|
       memory from the process stack. For certain kinds of pattern  and  data,
 | 
						|
       very  large  amounts of stack may be needed, despite the recognition of
 | 
						|
       "tail recursion". Note that if  PCRE2  is  compiled  with  the  -fsani-
 | 
						|
       tize=address  option  of  the  GCC compiler, the stack requirements are
 | 
						|
       greatly increased.
 | 
						|
 | 
						|
       The above comments apply when pcre2_match() is run in its normal inter-
 | 
						|
       pretive manner. If the compiled pattern was processed by pcre2_jit_com-
 | 
						|
       pile(), and just-in-time compiling  was  successful,  and  the  options
 | 
						|
       passed  to  pcre2_match()  were  not incompatible, the matching process
 | 
						|
       uses the JIT-compiled code instead of the  match()  function.  In  this
 | 
						|
       case, the memory requirements are handled entirely differently. See the
 | 
						|
       pcre2jit documentation for details.
 | 
						|
 | 
						|
       The  pcre2_dfa_match()  function  operates  in  a  different   way   to
 | 
						|
       pcre2_match(),  and uses recursion only when there is a regular expres-
 | 
						|
       sion recursion or subroutine call in the  pattern.  This  includes  the
 | 
						|
       processing  of assertion and "once-only" subpatterns, which are handled
 | 
						|
       like subroutine calls.  Normally, these are never very  deep,  and  the
 | 
						|
       limit  on  the  complexity  of  pcre2_dfa_match()  is controlled by the
 | 
						|
       amount of workspace it is given.  However, it is possible to write pat-
 | 
						|
       terns  with  runaway  infinite  recursions;  such  patterns  will cause
 | 
						|
       pcre2_dfa_match() to run out of stack unless a limit  is  applied  (see
 | 
						|
       below).
 | 
						|
 | 
						|
       The   comments   in   the   next   three   sections  do  not  apply  to
 | 
						|
       pcre2_dfa_match(); they are relevant only for pcre2_match() without the
 | 
						|
       JIT optimization.
 | 
						|
 | 
						|
   Reducing pcre2_match()'s stack usage
 | 
						|
 | 
						|
       You  can often reduce the amount of recursion, and therefore the amount
 | 
						|
       of stack used, by modifying the pattern that  is  being  matched.  Con-
 | 
						|
       sider, for example, this pattern:
 | 
						|
 | 
						|
         ([^<]|<(?!inet))+
 | 
						|
 | 
						|
       It  matches  from wherever it starts until it encounters "<inet" or the
 | 
						|
       end of the data, and is the kind of pattern that  might  be  used  when
 | 
						|
       processing an XML file. Each iteration of the outer parentheses matches
 | 
						|
       either one character that is not "<" or a "<" that is not  followed  by
 | 
						|
       "inet".  However,  each  time  a  parenthesis is processed, a recursion
 | 
						|
       occurs, so this formulation uses a stack frame for each matched charac-
 | 
						|
       ter.  For  a long string, a lot of stack is required. Consider now this
 | 
						|
       rewritten pattern, which matches exactly the same strings:
 | 
						|
 | 
						|
         ([^<]++|<(?!inet))+
 | 
						|
 | 
						|
       This uses very much less stack, because runs of characters that do  not
 | 
						|
       contain  "<" are "swallowed" in one item inside the parentheses. Recur-
 | 
						|
       sion happens only when a "<" character that is not followed  by  "inet"
 | 
						|
       is  encountered  (and  we assume this is relatively rare). A possessive
 | 
						|
       quantifier is used to stop any backtracking into the  runs  of  non-"<"
 | 
						|
       characters, but that is not related to stack usage.
 | 
						|
 | 
						|
       This  example shows that one way of avoiding stack problems when match-
 | 
						|
       ing long subject strings is to write repeated parenthesized subpatterns
 | 
						|
       to match more than one character whenever possible.
 | 
						|
 | 
						|
   Compiling PCRE2 to use heap instead of stack for pcre2_match()
 | 
						|
 | 
						|
       In  environments  where  stack memory is constrained, you might want to
 | 
						|
       compile PCRE2 to use heap memory instead of stack for remembering back-
 | 
						|
       up points when pcre2_match() is running. This makes it run more slowly,
 | 
						|
       however. Details of how to do this are given in the pcre2build documen-
 | 
						|
       tation.  When built in this way, instead of using the stack, PCRE2 gets
 | 
						|
       memory for remembering backup points from the  heap.  By  default,  the
 | 
						|
       memory is obtained by calling the system malloc() function, but you can
 | 
						|
       arrange to supply your own memory management function. For details, see
 | 
						|
       the section entitled "The match context" in the pcre2api documentation.
 | 
						|
       Since the block sizes are always the same, it may be possible to imple-
 | 
						|
       ment  a customized memory handler that is more efficient than the stan-
 | 
						|
       dard function. The memory blocks obtained for this purpose are retained
 | 
						|
       and  re-used  if  possible while pcre2_match() is running. They are all
 | 
						|
       freed just before it exits.
 | 
						|
 | 
						|
   Limiting pcre2_match()'s stack usage
 | 
						|
 | 
						|
       You can set limits on the number of times the internal match() function
 | 
						|
       is  called,  both  in  total  and  recursively. If a limit is exceeded,
 | 
						|
       pcre2_match() returns an error code.  Setting  suitable  limits  should
 | 
						|
       prevent  it from running out of stack. The default values of the limits
 | 
						|
       are very large, and unlikely ever to operate. They can be changed  when
 | 
						|
       PCRE2  is built, and they can also be set when pcre2_match() is called.
 | 
						|
       For details of these interfaces, see the pcre2build  documentation  and
 | 
						|
       the section entitled "The match context" in the pcre2api documentation.
 | 
						|
 | 
						|
       As a very rough rule of thumb, you should reckon on about 500 bytes per
 | 
						|
       recursion. Thus, if you want to limit your  stack  usage  to  8Mb,  you
 | 
						|
       should  set  the  limit at 16000 recursions. A 64Mb stack, on the other
 | 
						|
       hand, can support around 128000 recursions.
 | 
						|
 | 
						|
       The pcre2test test program has a modifier called  "find_limits"  which,
 | 
						|
       if  applied  to  a  subject line, causes it to find the smallest limits
 | 
						|
       that allow a a pattern to match. This is done by calling  pcre2_match()
 | 
						|
       repeatedly with different limits.
 | 
						|
 | 
						|
   Limiting pcre2_dfa_match()'s stack usage
 | 
						|
 | 
						|
       The recursion limit, as described above for pcre2_match(), also applies
 | 
						|
       to pcre2_dfa_match(), whose use of recursive function calls for  recur-
 | 
						|
       sions in the pattern can lead to runaway stack usage. The non-recursive
 | 
						|
       match limit is not relevant for DFA matching, and is ignored.
 | 
						|
 | 
						|
   Changing stack size in Unix-like systems
 | 
						|
 | 
						|
       In Unix-like environments, there is not often a problem with the  stack
 | 
						|
       unless  very  long  strings  are  involved, though the default limit on
 | 
						|
       stack size varies from system to system. Values from 8Mb  to  64Mb  are
 | 
						|
       common. You can find your default limit by running the command:
 | 
						|
 | 
						|
         ulimit -s
 | 
						|
 | 
						|
       Unfortunately,  the  effect  of  running out of stack is often SIGSEGV,
 | 
						|
       though sometimes a more explicit error message is given. You  can  nor-
 | 
						|
       mally increase the limit on stack size by code such as this:
 | 
						|
 | 
						|
         struct rlimit rlim;
 | 
						|
         getrlimit(RLIMIT_STACK, &rlim);
 | 
						|
         rlim.rlim_cur = 100*1024*1024;
 | 
						|
         setrlimit(RLIMIT_STACK, &rlim);
 | 
						|
 | 
						|
       This  reads  the current limits (soft and hard) using getrlimit(), then
 | 
						|
       attempts to increase the soft limit to  100Mb  using  setrlimit().  You
 | 
						|
       must do this before calling pcre2_match().
 | 
						|
 | 
						|
   Changing stack size in Mac OS X
 | 
						|
 | 
						|
       Using setrlimit(), as described above, should also work on Mac OS X. It
 | 
						|
       is also possible to set a stack size when linking a program. There is a
 | 
						|
       discussion   about   stack  sizes  in  Mac  OS  X  at  this  web  site:
 | 
						|
       http://developer.apple.com/qa/qa2005/qa1419.html.
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 23 December 2016
 | 
						|
       Copyright (c) 1997-2016 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2SYNTAX(3)             Library Functions Manual             PCRE2SYNTAX(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE2 - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY
 | 
						|
 | 
						|
       The  full syntax and semantics of the regular expressions that are sup-
 | 
						|
       ported by PCRE2 are described in the pcre2pattern  documentation.  This
 | 
						|
       document contains a quick-reference summary of the syntax.
 | 
						|
 | 
						|
 | 
						|
QUOTING
 | 
						|
 | 
						|
         \x         where x is non-alphanumeric is a literal x
 | 
						|
         \Q...\E    treat enclosed characters as literal
 | 
						|
 | 
						|
 | 
						|
ESCAPED CHARACTERS
 | 
						|
 | 
						|
       This table applies to ASCII and Unicode environments.
 | 
						|
 | 
						|
         \a         alarm, that is, the BEL character (hex 07)
 | 
						|
         \cx        "control-x", where x is any ASCII printing character
 | 
						|
         \e         escape (hex 1B)
 | 
						|
         \f         form feed (hex 0C)
 | 
						|
         \n         newline (hex 0A)
 | 
						|
         \r         carriage return (hex 0D)
 | 
						|
         \t         tab (hex 09)
 | 
						|
         \0dd       character with octal code 0dd
 | 
						|
         \ddd       character with octal code ddd, or backreference
 | 
						|
         \o{ddd..}  character with octal code ddd..
 | 
						|
         \U         "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
 | 
						|
         \uhhhh     character with hex code hhhh (if PCRE2_ALT_BSUX is set)
 | 
						|
         \xhh       character with hex code hh
 | 
						|
         \x{hhh..}  character with hex code hhh..
 | 
						|
 | 
						|
       Note that \0dd is always an octal code. The treatment of backslash fol-
 | 
						|
       lowed by a non-zero digit is complicated; for details see  the  section
 | 
						|
       "Non-printing  characters"  in  the  pcre2pattern  documentation, where
 | 
						|
       details of escape processing in EBCDIC environments are also given.
 | 
						|
 | 
						|
       When \x is not followed by {, from zero to two hexadecimal  digits  are
 | 
						|
       read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
 | 
						|
       imal digits to be recognized as  a  hexadecimal  escape;  otherwise  it
 | 
						|
       matches  a literal "x".  Likewise, if \u (in ALT_BSUX mode) is not fol-
 | 
						|
       lowed by four hexadecimal digits, it matches a literal "u".
 | 
						|
 | 
						|
 | 
						|
CHARACTER TYPES
 | 
						|
 | 
						|
         .          any character except newline;
 | 
						|
                      in dotall mode, any character whatsoever
 | 
						|
         \C         one code unit, even in UTF mode (best avoided)
 | 
						|
         \d         a decimal digit
 | 
						|
         \D         a character that is not a decimal digit
 | 
						|
         \h         a horizontal white space character
 | 
						|
         \H         a character that is not a horizontal white space character
 | 
						|
         \N         a character that is not a newline
 | 
						|
         \p{xx}     a character with the xx property
 | 
						|
         \P{xx}     a character without the xx property
 | 
						|
         \R         a newline sequence
 | 
						|
         \s         a white space character
 | 
						|
         \S         a character that is not a white space character
 | 
						|
         \v         a vertical white space character
 | 
						|
         \V         a character that is not a vertical white space character
 | 
						|
         \w         a "word" character
 | 
						|
         \W         a "non-word" character
 | 
						|
         \X         a Unicode extended grapheme cluster
 | 
						|
 | 
						|
       \C is dangerous because it may leave the current matching point in  the
 | 
						|
       middle of a UTF-8 or UTF-16 character. The application can lock out the
 | 
						|
       use of \C by setting the PCRE2_NEVER_BACKSLASH_C  option.  It  is  also
 | 
						|
       possible to build PCRE2 with the use of \C permanently disabled.
 | 
						|
 | 
						|
       By  default,  \d, \s, and \w match only ASCII characters, even in UTF-8
 | 
						|
       mode or in the 16-bit and 32-bit libraries. However, if locale-specific
 | 
						|
       matching  is  happening,  \s and \w may also match characters with code
 | 
						|
       points in the range 128-255. If the PCRE2_UCP option is set, the behav-
 | 
						|
       iour of these escape sequences is changed to use Unicode properties and
 | 
						|
       they match many more characters.
 | 
						|
 | 
						|
 | 
						|
GENERAL CATEGORY PROPERTIES FOR \p and \P
 | 
						|
 | 
						|
         C          Other
 | 
						|
         Cc         Control
 | 
						|
         Cf         Format
 | 
						|
         Cn         Unassigned
 | 
						|
         Co         Private use
 | 
						|
         Cs         Surrogate
 | 
						|
 | 
						|
         L          Letter
 | 
						|
         Ll         Lower case letter
 | 
						|
         Lm         Modifier letter
 | 
						|
         Lo         Other letter
 | 
						|
         Lt         Title case letter
 | 
						|
         Lu         Upper case letter
 | 
						|
         L&         Ll, Lu, or Lt
 | 
						|
 | 
						|
         M          Mark
 | 
						|
         Mc         Spacing mark
 | 
						|
         Me         Enclosing mark
 | 
						|
         Mn         Non-spacing mark
 | 
						|
 | 
						|
         N          Number
 | 
						|
         Nd         Decimal number
 | 
						|
         Nl         Letter number
 | 
						|
         No         Other number
 | 
						|
 | 
						|
         P          Punctuation
 | 
						|
         Pc         Connector punctuation
 | 
						|
         Pd         Dash punctuation
 | 
						|
         Pe         Close punctuation
 | 
						|
         Pf         Final punctuation
 | 
						|
         Pi         Initial punctuation
 | 
						|
         Po         Other punctuation
 | 
						|
         Ps         Open punctuation
 | 
						|
 | 
						|
         S          Symbol
 | 
						|
         Sc         Currency symbol
 | 
						|
         Sk         Modifier symbol
 | 
						|
         Sm         Mathematical symbol
 | 
						|
         So         Other symbol
 | 
						|
 | 
						|
         Z          Separator
 | 
						|
         Zl         Line separator
 | 
						|
         Zp         Paragraph separator
 | 
						|
         Zs         Space separator
 | 
						|
 | 
						|
 | 
						|
PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
 | 
						|
 | 
						|
         Xan        Alphanumeric: union of properties L and N
 | 
						|
         Xps        POSIX space: property Z or tab, NL, VT, FF, CR
 | 
						|
         Xsp        Perl space: property Z or tab, NL, VT, FF, CR
 | 
						|
         Xuc        Univerally-named character: one that can be
 | 
						|
                      represented by a Universal Character Name
 | 
						|
         Xwd        Perl word: property Xan or underscore
 | 
						|
 | 
						|
       Perl and POSIX space are now the same. Perl added VT to its space char-
 | 
						|
       acter set at release 5.18.
 | 
						|
 | 
						|
 | 
						|
SCRIPT NAMES FOR \p AND \P
 | 
						|
 | 
						|
       Ahom,   Anatolian_Hieroglyphs,  Arabic,  Armenian,  Avestan,  Balinese,
 | 
						|
       Bamum, Bassa_Vah, Batak, Bengali, Bopomofo, Brahmi, Braille,  Buginese,
 | 
						|
       Buhid,  Canadian_Aboriginal,  Carian, Caucasian_Albanian, Chakma, Cham,
 | 
						|
       Cherokee,  Common,  Coptic,  Cuneiform,  Cypriot,  Cyrillic,   Deseret,
 | 
						|
       Devanagari,  Duployan,  Egyptian_Hieroglyphs,  Elbasan, Ethiopic, Geor-
 | 
						|
       gian, Glagolitic, Gothic,  Grantha,  Greek,  Gujarati,  Gurmukhi,  Han,
 | 
						|
       Hangul, Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
 | 
						|
       Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese,  Kaithi,  Kan-
 | 
						|
       nada,  Katakana,  Kayah_Li,  Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
 | 
						|
       Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian,  Lydian,  Maha-
 | 
						|
       jani,  Malayalam,  Mandaic,  Manichaean,  Meetei_Mayek,  Mende_Kikakui,
 | 
						|
       Meroitic_Cursive, Meroitic_Hieroglyphs,  Miao,  Modi,  Mongolian,  Mro,
 | 
						|
       Multani,   Myanmar,   Nabataean,  New_Tai_Lue,  Nko,  Ogham,  Ol_Chiki,
 | 
						|
       Old_Hungarian, Old_Italic, Old_North_Arabian, Old_Permic,  Old_Persian,
 | 
						|
       Old_South_Arabian, Old_Turkic, Oriya, Osmanya, Pahawh_Hmong, Palmyrene,
 | 
						|
       Pau_Cin_Hau,  Phags_Pa,  Phoenician,  Psalter_Pahlavi,  Rejang,  Runic,
 | 
						|
       Samaritan, Saurashtra, Sharada, Shavian, Siddham, SignWriting, Sinhala,
 | 
						|
       Sora_Sompeng,  Sundanese,  Syloti_Nagri,  Syriac,  Tagalog,   Tagbanwa,
 | 
						|
       Tai_Le,   Tai_Tham,  Tai_Viet,  Takri,  Tamil,  Telugu,  Thaana,  Thai,
 | 
						|
       Tibetan, Tifinagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi.
 | 
						|
 | 
						|
 | 
						|
CHARACTER CLASSES
 | 
						|
 | 
						|
         [...]       positive character class
 | 
						|
         [^...]      negative character class
 | 
						|
         [x-y]       range (can be used for hex characters)
 | 
						|
         [[:xxx:]]   positive POSIX named set
 | 
						|
         [[:^xxx:]]  negative POSIX named set
 | 
						|
 | 
						|
         alnum       alphanumeric
 | 
						|
         alpha       alphabetic
 | 
						|
         ascii       0-127
 | 
						|
         blank       space or tab
 | 
						|
         cntrl       control character
 | 
						|
         digit       decimal digit
 | 
						|
         graph       printing, excluding space
 | 
						|
         lower       lower case letter
 | 
						|
         print       printing, including space
 | 
						|
         punct       printing, excluding alphanumeric
 | 
						|
         space       white space
 | 
						|
         upper       upper case letter
 | 
						|
         word        same as \w
 | 
						|
         xdigit      hexadecimal digit
 | 
						|
 | 
						|
       In PCRE2, POSIX character set names recognize only ASCII characters  by
 | 
						|
       default,  but  some of them use Unicode properties if PCRE2_UCP is set.
 | 
						|
       You can use \Q...\E inside a character class.
 | 
						|
 | 
						|
 | 
						|
QUANTIFIERS
 | 
						|
 | 
						|
         ?           0 or 1, greedy
 | 
						|
         ?+          0 or 1, possessive
 | 
						|
         ??          0 or 1, lazy
 | 
						|
         *           0 or more, greedy
 | 
						|
         *+          0 or more, possessive
 | 
						|
         *?          0 or more, lazy
 | 
						|
         +           1 or more, greedy
 | 
						|
         ++          1 or more, possessive
 | 
						|
         +?          1 or more, lazy
 | 
						|
         {n}         exactly n
 | 
						|
         {n,m}       at least n, no more than m, greedy
 | 
						|
         {n,m}+      at least n, no more than m, possessive
 | 
						|
         {n,m}?      at least n, no more than m, lazy
 | 
						|
         {n,}        n or more, greedy
 | 
						|
         {n,}+       n or more, possessive
 | 
						|
         {n,}?       n or more, lazy
 | 
						|
 | 
						|
 | 
						|
ANCHORS AND SIMPLE ASSERTIONS
 | 
						|
 | 
						|
         \b          word boundary
 | 
						|
         \B          not a word boundary
 | 
						|
         ^           start of subject
 | 
						|
                       also after an internal newline in multiline mode
 | 
						|
                       (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
 | 
						|
         \A          start of subject
 | 
						|
         $           end of subject
 | 
						|
                       also before newline at end of subject
 | 
						|
                       also before internal newline in multiline mode
 | 
						|
         \Z          end of subject
 | 
						|
                       also before newline at end of subject
 | 
						|
         \z          end of subject
 | 
						|
         \G          first matching position in subject
 | 
						|
 | 
						|
 | 
						|
MATCH POINT RESET
 | 
						|
 | 
						|
         \K          reset start of match
 | 
						|
 | 
						|
       \K is honoured in positive assertions, but ignored in negative ones.
 | 
						|
 | 
						|
 | 
						|
ALTERNATION
 | 
						|
 | 
						|
         expr|expr|expr...
 | 
						|
 | 
						|
 | 
						|
CAPTURING
 | 
						|
 | 
						|
         (...)           capturing group
 | 
						|
         (?<name>...)    named capturing group (Perl)
 | 
						|
         (?'name'...)    named capturing group (Perl)
 | 
						|
         (?P<name>...)   named capturing group (Python)
 | 
						|
         (?:...)         non-capturing group
 | 
						|
         (?|...)         non-capturing group; reset group numbers for
 | 
						|
                          capturing groups in each alternative
 | 
						|
 | 
						|
 | 
						|
ATOMIC GROUPS
 | 
						|
 | 
						|
         (?>...)         atomic, non-capturing group
 | 
						|
 | 
						|
 | 
						|
COMMENT
 | 
						|
 | 
						|
         (?#....)        comment (not nestable)
 | 
						|
 | 
						|
 | 
						|
OPTION SETTING
 | 
						|
 | 
						|
         (?i)            caseless
 | 
						|
         (?J)            allow duplicate names
 | 
						|
         (?m)            multiline
 | 
						|
         (?s)            single line (dotall)
 | 
						|
         (?U)            default ungreedy (lazy)
 | 
						|
         (?x)            extended (ignore white space)
 | 
						|
         (?-...)         unset option(s)
 | 
						|
 | 
						|
       The following are recognized only at the very start  of  a  pattern  or
 | 
						|
       after  one  of the newline or \R options with similar syntax. More than
 | 
						|
       one of them may appear.
 | 
						|
 | 
						|
         (*LIMIT_MATCH=d) set the match limit to d (decimal number)
 | 
						|
         (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
 | 
						|
         (*NOTEMPTY)     set PCRE2_NOTEMPTY when matching
 | 
						|
         (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
 | 
						|
         (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
 | 
						|
         (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
 | 
						|
         (*NO_JIT)       disable JIT optimization
 | 
						|
         (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
 | 
						|
         (*UTF)          set appropriate UTF mode for the library in use
 | 
						|
         (*UCP)          set PCRE2_UCP (use Unicode properties for \d etc)
 | 
						|
 | 
						|
       Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value  of
 | 
						|
       the limits set by the caller of pcre2_match() or pcre2_dfa_match(), not
 | 
						|
       increase them. The application can lock  out  the  use  of  (*UTF)  and
 | 
						|
       (*UCP)  by  setting  the  PCRE2_NEVER_UTF  or  PCRE2_NEVER_UCP options,
 | 
						|
       respectively, at compile time.
 | 
						|
 | 
						|
 | 
						|
NEWLINE CONVENTION
 | 
						|
 | 
						|
       These are recognized only at the very start of  the  pattern  or  after
 | 
						|
       option settings with a similar syntax.
 | 
						|
 | 
						|
         (*CR)           carriage return only
 | 
						|
         (*LF)           linefeed only
 | 
						|
         (*CRLF)         carriage return followed by linefeed
 | 
						|
         (*ANYCRLF)      all three of the above
 | 
						|
         (*ANY)          any Unicode newline sequence
 | 
						|
 | 
						|
 | 
						|
WHAT \R MATCHES
 | 
						|
 | 
						|
       These  are  recognized  only  at the very start of the pattern or after
 | 
						|
       option setting with a similar syntax.
 | 
						|
 | 
						|
         (*BSR_ANYCRLF)  CR, LF, or CRLF
 | 
						|
         (*BSR_UNICODE)  any Unicode newline sequence
 | 
						|
 | 
						|
 | 
						|
LOOKAHEAD AND LOOKBEHIND ASSERTIONS
 | 
						|
 | 
						|
         (?=...)         positive look ahead
 | 
						|
         (?!...)         negative look ahead
 | 
						|
         (?<=...)        positive look behind
 | 
						|
         (?<!...)        negative look behind
 | 
						|
 | 
						|
       Each top-level branch of a look behind must be of a fixed length.
 | 
						|
 | 
						|
 | 
						|
BACKREFERENCES
 | 
						|
 | 
						|
         \n              reference by number (can be ambiguous)
 | 
						|
         \gn             reference by number
 | 
						|
         \g{n}           reference by number
 | 
						|
         \g+n            relative reference by number (PCRE2 extension)
 | 
						|
         \g-n            relative reference by number
 | 
						|
         \g{+n}          relative reference by number (PCRE2 extension)
 | 
						|
         \g{-n}          relative reference by number
 | 
						|
         \k<name>        reference by name (Perl)
 | 
						|
         \k'name'        reference by name (Perl)
 | 
						|
         \g{name}        reference by name (Perl)
 | 
						|
         \k{name}        reference by name (.NET)
 | 
						|
         (?P=name)       reference by name (Python)
 | 
						|
 | 
						|
 | 
						|
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
 | 
						|
 | 
						|
         (?R)            recurse whole pattern
 | 
						|
         (?n)            call subpattern by absolute number
 | 
						|
         (?+n)           call subpattern by relative number
 | 
						|
         (?-n)           call subpattern by relative number
 | 
						|
         (?&name)        call subpattern by name (Perl)
 | 
						|
         (?P>name)       call subpattern by name (Python)
 | 
						|
         \g<name>        call subpattern by name (Oniguruma)
 | 
						|
         \g'name'        call subpattern by name (Oniguruma)
 | 
						|
         \g<n>           call subpattern by absolute number (Oniguruma)
 | 
						|
         \g'n'           call subpattern by absolute number (Oniguruma)
 | 
						|
         \g<+n>          call subpattern by relative number (PCRE2 extension)
 | 
						|
         \g'+n'          call subpattern by relative number (PCRE2 extension)
 | 
						|
         \g<-n>          call subpattern by relative number (PCRE2 extension)
 | 
						|
         \g'-n'          call subpattern by relative number (PCRE2 extension)
 | 
						|
 | 
						|
 | 
						|
CONDITIONAL PATTERNS
 | 
						|
 | 
						|
         (?(condition)yes-pattern)
 | 
						|
         (?(condition)yes-pattern|no-pattern)
 | 
						|
 | 
						|
         (?(n)               absolute reference condition
 | 
						|
         (?(+n)              relative reference condition
 | 
						|
         (?(-n)              relative reference condition
 | 
						|
         (?(<name>)          named reference condition (Perl)
 | 
						|
         (?('name')          named reference condition (Perl)
 | 
						|
         (?(name)            named reference condition (PCRE2, deprecated)
 | 
						|
         (?(R)               overall recursion condition
 | 
						|
         (?(Rn)              specific numbered group recursion condition
 | 
						|
         (?(R&name)          specific named group recursion condition
 | 
						|
         (?(DEFINE)          define subpattern for reference
 | 
						|
         (?(VERSION[>]=n.m)  test PCRE2 version
 | 
						|
         (?(assert)          assertion condition
 | 
						|
 | 
						|
       Note the ambiguity of (?(R) and (?(Rn) which might be  named  reference
 | 
						|
       conditions  or  recursion  tests.  Such a condition is interpreted as a
 | 
						|
       reference condition if the relevant named group exists.
 | 
						|
 | 
						|
 | 
						|
BACKTRACKING CONTROL
 | 
						|
 | 
						|
       The following act immediately they are reached:
 | 
						|
 | 
						|
         (*ACCEPT)       force successful match
 | 
						|
         (*FAIL)         force backtrack; synonym (*F)
 | 
						|
         (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
 | 
						|
 | 
						|
       The following act only when a subsequent match failure causes  a  back-
 | 
						|
       track to reach them. They all force a match failure, but they differ in
 | 
						|
       what happens afterwards. Those that advance the start-of-match point do
 | 
						|
       so only if the pattern is not anchored.
 | 
						|
 | 
						|
         (*COMMIT)       overall failure, no advance of starting point
 | 
						|
         (*PRUNE)        advance to next starting character
 | 
						|
         (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
 | 
						|
         (*SKIP)         advance to current matching position
 | 
						|
         (*SKIP:NAME)    advance to position corresponding to an earlier
 | 
						|
                         (*MARK:NAME); if not found, the (*SKIP) is ignored
 | 
						|
         (*THEN)         local failure, backtrack to next alternation
 | 
						|
         (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
 | 
						|
 | 
						|
 | 
						|
CALLOUTS
 | 
						|
 | 
						|
         (?C)            callout (assumed number 0)
 | 
						|
         (?Cn)           callout with numerical data n
 | 
						|
         (?C"text")      callout with string data
 | 
						|
 | 
						|
       The allowed string delimiters are ` ' " ^ % # $ (which are the same for
 | 
						|
       the start and the end), and the starting delimiter { matched  with  the
 | 
						|
       ending  delimiter  }. To encode the ending delimiter within the string,
 | 
						|
       double it.
 | 
						|
 | 
						|
 | 
						|
SEE ALSO
 | 
						|
 | 
						|
       pcre2pattern(3),   pcre2api(3),   pcre2callout(3),    pcre2matching(3),
 | 
						|
       pcre2(3).
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 23 December 2016
 | 
						|
       Copyright (c) 1997-2016 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 | 
						|
PCRE2UNICODE(3)            Library Functions Manual            PCRE2UNICODE(3)
 | 
						|
 | 
						|
 | 
						|
 | 
						|
NAME
 | 
						|
       PCRE - Perl-compatible regular expressions (revised API)
 | 
						|
 | 
						|
UNICODE AND UTF SUPPORT
 | 
						|
 | 
						|
       When PCRE2 is built with Unicode support (which is the default), it has
 | 
						|
       knowledge of Unicode character properties and can process text  strings
 | 
						|
       in  UTF-8, UTF-16, or UTF-32 format (depending on the code unit width).
 | 
						|
       However, by default, PCRE2 assumes that one code unit is one character.
 | 
						|
       To  process  a  pattern  as a UTF string, where a character may require
 | 
						|
       more than one  code  unit,  you  must  call  pcre2_compile()  with  the
 | 
						|
       PCRE2_UTF  option  flag,  or  the  pattern must start with the sequence
 | 
						|
       (*UTF). When either of these is the case, both the pattern and any sub-
 | 
						|
       ject  strings  that  are  matched against it are treated as UTF strings
 | 
						|
       instead of strings of individual one-code-unit characters.
 | 
						|
 | 
						|
       If you do not need Unicode support you can build PCRE2 without  it,  in
 | 
						|
       which case the library will be smaller.
 | 
						|
 | 
						|
 | 
						|
UNICODE PROPERTY SUPPORT
 | 
						|
 | 
						|
       When  PCRE2 is built with Unicode support, the escape sequences \p{..},
 | 
						|
       \P{..}, and \X can be used. The Unicode properties that can  be  tested
 | 
						|
       are  limited to the general category properties such as Lu for an upper
 | 
						|
       case letter or Nd for a decimal number, the Unicode script  names  such
 | 
						|
       as Arabic or Han, and the derived properties Any and L&. Full lists are
 | 
						|
       given in the pcre2pattern and pcre2syntax documentation. Only the short
 | 
						|
       names  for  properties are supported. For example, \p{L} matches a let-
 | 
						|
       ter. Its Perl synonym, \p{Letter}, is not supported.   Furthermore,  in
 | 
						|
       Perl,  many properties may optionally be prefixed by "Is", for compati-
 | 
						|
       bility with Perl 5.6. PCRE does not support this.
 | 
						|
 | 
						|
 | 
						|
WIDE CHARACTERS AND UTF MODES
 | 
						|
 | 
						|
       Codepoints less than 256 can be specified in patterns by either  braced
 | 
						|
       or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
 | 
						|
       Larger values have to use braced sequences. Unbraced octal code  points
 | 
						|
       up to \777 are also recognized; larger ones can be coded using \o{...}.
 | 
						|
 | 
						|
       In  UTF modes, repeat quantifiers apply to complete UTF characters, not
 | 
						|
       to individual code units.
 | 
						|
 | 
						|
       In UTF modes, the dot metacharacter matches one UTF  character  instead
 | 
						|
       of a single code unit.
 | 
						|
 | 
						|
       The escape sequence \C can be used to match a single code unit in a UTF
 | 
						|
       mode, but its use can lead to some strange effects because it breaks up
 | 
						|
       multi-unit  characters  (see  the description of \C in the pcre2pattern
 | 
						|
       documentation).
 | 
						|
 | 
						|
       The use of \C is not supported by  the  alternative  matching  function
 | 
						|
       pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
 | 
						|
       ter may consist of more than one code unit. The  use  of  \C  in  these
 | 
						|
       modes  provokes a match-time error. Also, the JIT optimization does not
 | 
						|
       support \C in these modes. If JIT optimization is requested for a UTF-8
 | 
						|
       or  UTF-16  pattern  that contains \C, it will not succeed, and so when
 | 
						|
       pcre2_match() is called, the matching will be carried out by the normal
 | 
						|
       interpretive function.
 | 
						|
 | 
						|
       The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
 | 
						|
       characters of any code value, but,  by  default,  the  characters  that
 | 
						|
       PCRE2  recognizes as digits, spaces, or word characters remain the same
 | 
						|
       set as in non-UTF mode, all  with  code  points  less  than  256.  This
 | 
						|
       remains  true  even  when  PCRE2  is  built to include Unicode support,
 | 
						|
       because to do otherwise would slow down matching in many common  cases.
 | 
						|
       Note  that  this also applies to \b and \B, because they are defined in
 | 
						|
       terms of \w and \W. If you want to test for  a  wider  sense  of,  say,
 | 
						|
       "digit",  you  can  use explicit Unicode property tests such as \p{Nd}.
 | 
						|
       Alternatively, if you set the PCRE2_UCP option, the way that the  char-
 | 
						|
       acter  escapes  work  is changed so that Unicode properties are used to
 | 
						|
       determine which characters match. There are more details in the section
 | 
						|
       on generic character types in the pcre2pattern documentation.
 | 
						|
 | 
						|
       Similarly,  characters that match the POSIX named character classes are
 | 
						|
       all low-valued characters, unless the PCRE2_UCP option is set.
 | 
						|
 | 
						|
       However, the special  horizontal  and  vertical  white  space  matching
 | 
						|
       escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
 | 
						|
       acters, whether or not PCRE2_UCP is set.
 | 
						|
 | 
						|
       Case-insensitive matching in UTF mode makes use of Unicode  properties.
 | 
						|
       A  few  Unicode characters such as Greek sigma have more than two code-
 | 
						|
       points that are case-equivalent, and these are treated as such.
 | 
						|
 | 
						|
 | 
						|
VALIDITY OF UTF STRINGS
 | 
						|
 | 
						|
       When the PCRE2_UTF option is set, the strings passed  as  patterns  and
 | 
						|
       subjects are (by default) checked for validity on entry to the relevant
 | 
						|
       functions.  If an invalid UTF string is passed, an negative error  code
 | 
						|
       is  returned.  The  code  unit offset to the offending character can be
 | 
						|
       extracted from the match data block by  calling  pcre2_get_startchar(),
 | 
						|
       which is used for this purpose after a UTF error.
 | 
						|
 | 
						|
       UTF-16 and UTF-32 strings can indicate their endianness by special code
 | 
						|
       knows as a byte-order mark (BOM). The PCRE2  functions  do  not  handle
 | 
						|
       this, expecting strings to be in host byte order.
 | 
						|
 | 
						|
       A UTF string is checked before any other processing takes place. In the
 | 
						|
       case of pcre2_match()  and  pcre2_dfa_match()  calls  with  a  non-zero
 | 
						|
       starting  offset, the check is applied only to that part of the subject
 | 
						|
       that could be inspected during matching, and there is a check that  the
 | 
						|
       starting  offset points to the first code unit of a character or to the
 | 
						|
       end of the subject. If there are no lookbehind assertions in  the  pat-
 | 
						|
       tern,  the check starts at the starting offset. Otherwise, it starts at
 | 
						|
       the length of the longest lookbehind before the starting offset, or  at
 | 
						|
       the  start  of the subject if there are not that many characters before
 | 
						|
       the starting offset. Note that the sequences \b and \B are  one-charac-
 | 
						|
       ter lookbehinds.
 | 
						|
 | 
						|
       In  addition  to checking the format of the string, there is a check to
 | 
						|
       ensure that all code points lie in the range U+0 to U+10FFFF, excluding
 | 
						|
       the  surrogate  area. The so-called "non-character" code points are not
 | 
						|
       excluded because Unicode corrigendum #9 makes it clear that they should
 | 
						|
       not be.
 | 
						|
 | 
						|
       Characters  in  the "Surrogate Area" of Unicode are reserved for use by
 | 
						|
       UTF-16, where they are used in pairs to encode code points with  values
 | 
						|
       greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
 | 
						|
       are available independently in the  UTF-8  and  UTF-32  encodings.  (In
 | 
						|
       other  words,  the  whole  surrogate  thing is a fudge for UTF-16 which
 | 
						|
       unfortunately messes up UTF-8 and UTF-32.)
 | 
						|
 | 
						|
       In some situations, you may already know that your strings  are  valid,
 | 
						|
       and  therefore  want  to  skip these checks in order to improve perfor-
 | 
						|
       mance, for example in the case of a long subject string that  is  being
 | 
						|
       scanned  repeatedly.   If you set the PCRE2_NO_UTF_CHECK option at com-
 | 
						|
       pile time or at match time, PCRE2 assumes that the pattern  or  subject
 | 
						|
       it is given (respectively) contains only valid UTF code unit sequences.
 | 
						|
 | 
						|
       Passing  PCRE2_NO_UTF_CHECK  to pcre2_compile() just disables the check
 | 
						|
       for the pattern; it does not also apply to subject strings. If you want
 | 
						|
       to  disable the check for a subject string you must pass this option to
 | 
						|
       pcre2_match() or pcre2_dfa_match().
 | 
						|
 | 
						|
       If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is  set,  the
 | 
						|
       result is undefined and your program may crash or loop indefinitely.
 | 
						|
 | 
						|
   Errors in UTF-8 strings
 | 
						|
 | 
						|
       The following negative error codes are given for invalid UTF-8 strings:
 | 
						|
 | 
						|
         PCRE2_ERROR_UTF8_ERR1
 | 
						|
         PCRE2_ERROR_UTF8_ERR2
 | 
						|
         PCRE2_ERROR_UTF8_ERR3
 | 
						|
         PCRE2_ERROR_UTF8_ERR4
 | 
						|
         PCRE2_ERROR_UTF8_ERR5
 | 
						|
 | 
						|
       The  string  ends  with a truncated UTF-8 character; the code specifies
 | 
						|
       how many bytes are missing (1 to 5). Although RFC 3629 restricts  UTF-8
 | 
						|
       characters  to  be  no longer than 4 bytes, the encoding scheme (origi-
 | 
						|
       nally defined by RFC 2279) allows for  up  to  6  bytes,  and  this  is
 | 
						|
       checked first; hence the possibility of 4 or 5 missing bytes.
 | 
						|
 | 
						|
         PCRE2_ERROR_UTF8_ERR6
 | 
						|
         PCRE2_ERROR_UTF8_ERR7
 | 
						|
         PCRE2_ERROR_UTF8_ERR8
 | 
						|
         PCRE2_ERROR_UTF8_ERR9
 | 
						|
         PCRE2_ERROR_UTF8_ERR10
 | 
						|
 | 
						|
       The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
 | 
						|
       the character do not have the binary value 0b10 (that  is,  either  the
 | 
						|
       most significant bit is 0, or the next bit is 1).
 | 
						|
 | 
						|
         PCRE2_ERROR_UTF8_ERR11
 | 
						|
         PCRE2_ERROR_UTF8_ERR12
 | 
						|
 | 
						|
       A  character that is valid by the RFC 2279 rules is either 5 or 6 bytes
 | 
						|
       long; these code points are excluded by RFC 3629.
 | 
						|
 | 
						|
         PCRE2_ERROR_UTF8_ERR13
 | 
						|
 | 
						|
       A 4-byte character has a value greater than 0x10fff; these code  points
 | 
						|
       are excluded by RFC 3629.
 | 
						|
 | 
						|
         PCRE2_ERROR_UTF8_ERR14
 | 
						|
 | 
						|
       A  3-byte  character  has  a  value in the range 0xd800 to 0xdfff; this
 | 
						|
       range of code points are reserved by RFC 3629 for use with UTF-16,  and
 | 
						|
       so are excluded from UTF-8.
 | 
						|
 | 
						|
         PCRE2_ERROR_UTF8_ERR15
 | 
						|
         PCRE2_ERROR_UTF8_ERR16
 | 
						|
         PCRE2_ERROR_UTF8_ERR17
 | 
						|
         PCRE2_ERROR_UTF8_ERR18
 | 
						|
         PCRE2_ERROR_UTF8_ERR19
 | 
						|
 | 
						|
       A  2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
 | 
						|
       for a value that can be represented by fewer bytes, which  is  invalid.
 | 
						|
       For  example,  the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
 | 
						|
       rect coding uses just one byte.
 | 
						|
 | 
						|
         PCRE2_ERROR_UTF8_ERR20
 | 
						|
 | 
						|
       The two most significant bits of the first byte of a character have the
 | 
						|
       binary  value 0b10 (that is, the most significant bit is 1 and the sec-
 | 
						|
       ond is 0). Such a byte can only validly occur as the second  or  subse-
 | 
						|
       quent byte of a multi-byte character.
 | 
						|
 | 
						|
         PCRE2_ERROR_UTF8_ERR21
 | 
						|
 | 
						|
       The  first byte of a character has the value 0xfe or 0xff. These values
 | 
						|
       can never occur in a valid UTF-8 string.
 | 
						|
 | 
						|
   Errors in UTF-16 strings
 | 
						|
 | 
						|
       The following  negative  error  codes  are  given  for  invalid  UTF-16
 | 
						|
       strings:
 | 
						|
 | 
						|
         PCRE2_ERROR_UTF16_ERR1  Missing low surrogate at end of string
 | 
						|
         PCRE2_ERROR_UTF16_ERR2  Invalid low surrogate follows high surrogate
 | 
						|
         PCRE2_ERROR_UTF16_ERR3  Isolated low surrogate
 | 
						|
 | 
						|
 | 
						|
   Errors in UTF-32 strings
 | 
						|
 | 
						|
       The  following  negative  error  codes  are  given  for  invalid UTF-32
 | 
						|
       strings:
 | 
						|
 | 
						|
         PCRE2_ERROR_UTF32_ERR1  Surrogate character (0xd800 to 0xdfff)
 | 
						|
         PCRE2_ERROR_UTF32_ERR2  Code point is greater than 0x10ffff
 | 
						|
 | 
						|
 | 
						|
AUTHOR
 | 
						|
 | 
						|
       Philip Hazel
 | 
						|
       University Computing Service
 | 
						|
       Cambridge, England.
 | 
						|
 | 
						|
 | 
						|
REVISION
 | 
						|
 | 
						|
       Last updated: 03 July 2016
 | 
						|
       Copyright (c) 1997-2016 University of Cambridge.
 | 
						|
------------------------------------------------------------------------------
 | 
						|
 | 
						|
 |