5200 lines
		
	
	
		
			248 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			5200 lines
		
	
	
		
			248 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
| -----------------------------------------------------------------------------
 | |
| This file contains a concatenation of the PCRE2 man pages, converted to plain
 | |
| text format for ease of searching with a text editor, or for use on systems
 | |
| that do not have a man page processor. The small individual files that give
 | |
| synopses of each function in the library have not been included. Neither has
 | |
| the pcre2demo program. There are separate text files for the pcre2grep and
 | |
| pcre2test commands.
 | |
| -----------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRE2(3)                   Library Functions Manual                   PCRE2(3)
 | |
| 
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE2 - Perl-compatible regular expressions (revised API)
 | |
| 
 | |
| INTRODUCTION
 | |
| 
 | |
|        PCRE2 is the name used for a revised API for the PCRE library, which is
 | |
|        a set of functions, written in C,  that  implement  regular  expression
 | |
|        pattern matching using the same syntax and semantics as Perl, with just
 | |
|        a few differences. Some features that appeared in Python and the origi-
 | |
|        nal  PCRE  before  they  appeared  in Perl are also available using the
 | |
|        Python syntax. There is also some support for one or two .NET and Onig-
 | |
|        uruma  syntax  items,  and  there are options for requesting some minor
 | |
|        changes that give better ECMAScript (aka JavaScript) compatibility.
 | |
| 
 | |
|        The source code for PCRE2 can be compiled to support 8-bit, 16-bit,  or
 | |
|        32-bit  code units, which means that up to three separate libraries may
 | |
|        be installed.  The original work to extend PCRE to  16-bit  and  32-bit
 | |
|        code  units  was  done  by Zoltan Herczeg and Christian Persch, respec-
 | |
|        tively. In all three cases, strings can be interpreted  either  as  one
 | |
|        character  per  code  unit, or as UTF-encoded Unicode, with support for
 | |
|        Unicode general category properties. Unicode  support  is  optional  at
 | |
|        build  time  (but  is  the default). However, processing strings as UTF
 | |
|        code units must be enabled explicitly at run time. The version of  Uni-
 | |
|        code in use can be discovered by running
 | |
| 
 | |
|          pcre2test -C
 | |
| 
 | |
|        The  three  libraries  contain  identical sets of functions, with names
 | |
|        ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com-
 | |
|        pile_8()).  However,  by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
 | |
|        32, a program that uses just one code unit width can be  written  using
 | |
|        generic names such as pcre2_compile(), and the documentation is written
 | |
|        assuming that this is the case.
 | |
| 
 | |
|        In addition to the Perl-compatible matching function, PCRE2 contains an
 | |
|        alternative  function that matches the same compiled patterns in a dif-
 | |
|        ferent way. In certain circumstances, the alternative function has some
 | |
|        advantages.   For  a discussion of the two matching algorithms, see the
 | |
|        pcre2matching page.
 | |
| 
 | |
|        Details of exactly which Perl regular expression features are  and  are
 | |
|        not  supported  by  PCRE2  are  given  in  separate  documents. See the
 | |
|        pcre2pattern and pcre2compat pages. There is a syntax  summary  in  the
 | |
|        pcre2syntax page.
 | |
| 
 | |
|        Some  features  of PCRE2 can be included, excluded, or changed when the
 | |
|        library is built. The pcre2_config() function makes it possible  for  a
 | |
|        client  to  discover  which  features are available. The features them-
 | |
|        selves are described in the pcre2build page. Documentation about build-
 | |
|        ing  PCRE2 for various operating systems can be found in the README and
 | |
|        NON-AUTOTOOLS_BUILD files in the source distribution.
 | |
| 
 | |
|        The libraries contains a number of undocumented internal functions  and
 | |
|        data  tables  that  are  used by more than one of the exported external
 | |
|        functions, but which are not intended  for  use  by  external  callers.
 | |
|        Their  names  all begin with "_pcre2", which hopefully will not provoke
 | |
|        any name clashes. In some environments, it is possible to control which
 | |
|        external  symbols  are  exported when a shared library is built, and in
 | |
|        these cases the undocumented symbols are not exported.
 | |
| 
 | |
| 
 | |
| SECURITY CONSIDERATIONS
 | |
| 
 | |
|        If you are using PCRE2 in a non-UTF application that permits  users  to
 | |
|        supply  arbitrary  patterns  for  compilation, you should be aware of a
 | |
|        feature that allows users to turn on UTF support from within a pattern.
 | |
|        For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
 | |
|        mode, which interprets patterns and subjects as strings of  UTF-8  code
 | |
|        units instead of individual 8-bit characters. This causes both the pat-
 | |
|        tern and any data against which it is matched to be checked  for  UTF-8
 | |
|        validity.  If the data string is very long, such a check might use suf-
 | |
|        ficiently many resources as to cause your application to  lose  perfor-
 | |
|        mance.
 | |
| 
 | |
|        One  way  of guarding against this possibility is to use the pcre2_pat-
 | |
|        tern_info() function  to  check  the  compiled  pattern's  options  for
 | |
|        PCRE2_UTF.  Alternatively,  you can set the PCRE2_NEVER_UTF option when
 | |
|        calling pcre2_compile(). This causes an compile time error if a pattern
 | |
|        contains a UTF-setting sequence.
 | |
| 
 | |
|        The  use  of Unicode properties for character types such as \d can also
 | |
|        be enabled from within the pattern, by specifying "(*UCP)".  This  fea-
 | |
|        ture can be disallowed by setting the PCRE2_NEVER_UCP option.
 | |
| 
 | |
|        If  your  application  is one that supports UTF, be aware that validity
 | |
|        checking can take time. If the same data string is to be  matched  many
 | |
|        times,  you  can  use  the PCRE2_NO_UTF_CHECK option for the second and
 | |
|        subsequent matches to avoid running redundant checks.
 | |
| 
 | |
|        The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
 | |
|        to  problems,  because  it  may leave the current matching point in the
 | |
|        middle of  a  multi-code-unit  character.  The  PCRE2_NEVER_BACKSLASH_C
 | |
|        option  can  be  used to lock out the use of \C, causing a compile-time
 | |
|        error if it is encountered.
 | |
| 
 | |
|        Another way that performance can be hit is by running  a  pattern  that
 | |
|        has  a  very  large search tree against a string that will never match.
 | |
|        Nested unlimited repeats in a pattern are a common example. PCRE2  pro-
 | |
|        vides  some  protection  against  this: see the pcre2_set_match_limit()
 | |
|        function in the pcre2api page.
 | |
| 
 | |
| 
 | |
| USER DOCUMENTATION
 | |
| 
 | |
|        The user documentation for PCRE2 comprises a number of  different  sec-
 | |
|        tions.  In the "man" format, each of these is a separate "man page". In
 | |
|        the HTML format, each is a separate page, linked from the  index  page.
 | |
|        In  the  plain  text  format,  the  descriptions  of  the pcre2grep and
 | |
|        pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
 | |
|        respectively.  The remaining sections, except for the pcre2demo section
 | |
|        (which is a program listing), and the short pages for individual  func-
 | |
|        tions,  are  concatenated in pcre2.txt, for ease of searching. The sec-
 | |
|        tions are as follows:
 | |
| 
 | |
|          pcre2              this document
 | |
|          pcre2-config       show PCRE2 installation configuration information
 | |
|          pcre2api           details of PCRE2's native C API
 | |
|          pcre2build         building PCRE2
 | |
|          pcre2callout       details of the callout feature
 | |
|          pcre2compat        discussion of Perl compatibility
 | |
|          pcre2demo          a demonstration C program that uses PCRE2
 | |
|          pcre2grep          description of the pcre2grep command (8-bit only)
 | |
|          pcre2jit           discussion of just-in-time optimization support
 | |
|          pcre2limits        details of size and other limits
 | |
|          pcre2matching      discussion of the two matching algorithms
 | |
|          pcre2partial       details of the partial matching facility
 | |
|          pcre2pattern       syntax and semantics of supported regular
 | |
|                               expression patterns
 | |
|          pcre2perform       discussion of performance issues
 | |
|          pcre2posix         the POSIX-compatible C API for the 8-bit library
 | |
|          pcre2sample        discussion of the pcre2demo program
 | |
|          pcre2stack         discussion of stack usage
 | |
|          pcre2syntax        quick syntax reference
 | |
|          pcre2test          description of the pcre2test command
 | |
|          pcre2unicode       discussion of Unicode and UTF support
 | |
| 
 | |
|        In the "man" and HTML formats, there is also a short page  for  each  C
 | |
|        library function, listing its arguments and results.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge, England.
 | |
| 
 | |
|        Putting  an  actual email address here is a spam magnet. If you want to
 | |
|        email me, use my two initials, followed by the two digits  10,  at  the
 | |
|        domain cam.ac.uk.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 13 April 2015
 | |
|        Copyright (c) 1997-2015 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRE2API(3)                Library Functions Manual                PCRE2API(3)
 | |
| 
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE2 - Perl-compatible regular expressions (revised API)
 | |
| 
 | |
|        #include <pcre2.h>
 | |
| 
 | |
|        PCRE2  is  a  new API for PCRE. This document contains a description of
 | |
|        all its functions. See the pcre2 document for an overview  of  all  the
 | |
|        PCRE2 documentation.
 | |
| 
 | |
| 
 | |
| PCRE2 NATIVE API BASIC FUNCTIONS
 | |
| 
 | |
|        pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
 | |
|          uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
 | |
|          pcre2_compile_context *ccontext);
 | |
| 
 | |
|        pcre2_code_free(pcre2_code *code);
 | |
| 
 | |
|        pcre2_match_data_create(uint32_t ovecsize,
 | |
|          pcre2_general_context *gcontext);
 | |
| 
 | |
|        pcre2_match_data_create_from_pattern(const pcre2_code *code,
 | |
|          pcre2_general_context *gcontext);
 | |
| 
 | |
|        int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
 | |
|          PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | |
|          uint32_t options, pcre2_match_data *match_data,
 | |
|          pcre2_match_context *mcontext);
 | |
| 
 | |
|        int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
 | |
|          PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | |
|          uint32_t options, pcre2_match_data *match_data,
 | |
|          pcre2_match_context *mcontext,
 | |
|          int *workspace, PCRE2_SIZE wscount);
 | |
| 
 | |
|        void pcre2_match_data_free(pcre2_match_data *match_data);
 | |
| 
 | |
| 
 | |
| PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS
 | |
| 
 | |
|        PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
 | |
| 
 | |
|        uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
 | |
| 
 | |
|        PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
 | |
| 
 | |
|        PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
 | |
| 
 | |
| 
 | |
| PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS
 | |
| 
 | |
|        pcre2_general_context *pcre2_general_context_create(
 | |
|          void *(*private_malloc)(PCRE2_SIZE, void *),
 | |
|          void (*private_free)(void *, void *), void *memory_data);
 | |
| 
 | |
|        pcre2_general_context *pcre2_general_context_copy(
 | |
|          pcre2_general_context *gcontext);
 | |
| 
 | |
|        void pcre2_general_context_free(pcre2_general_context *gcontext);
 | |
| 
 | |
| 
 | |
| PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
 | |
| 
 | |
|        pcre2_compile_context *pcre2_compile_context_create(
 | |
|          pcre2_general_context *gcontext);
 | |
| 
 | |
|        pcre2_compile_context *pcre2_compile_context_copy(
 | |
|          pcre2_compile_context *ccontext);
 | |
| 
 | |
|        void pcre2_compile_context_free(pcre2_compile_context *ccontext);
 | |
| 
 | |
|        int pcre2_set_bsr(pcre2_compile_context *ccontext,
 | |
|          uint32_t value);
 | |
| 
 | |
|        int pcre2_set_character_tables(pcre2_compile_context *ccontext,
 | |
|          const unsigned char *tables);
 | |
| 
 | |
|        int pcre2_set_newline(pcre2_compile_context *ccontext,
 | |
|          uint32_t value);
 | |
| 
 | |
|        int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
 | |
|          uint32_t value);
 | |
| 
 | |
|        int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
 | |
|          int (*guard_function)(uint32_t, void *), void *user_data);
 | |
| 
 | |
| 
 | |
| PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
 | |
| 
 | |
|        pcre2_match_context *pcre2_match_context_create(
 | |
|          pcre2_general_context *gcontext);
 | |
| 
 | |
|        pcre2_match_context *pcre2_match_context_copy(
 | |
|          pcre2_match_context *mcontext);
 | |
| 
 | |
|        void pcre2_match_context_free(pcre2_match_context *mcontext);
 | |
| 
 | |
|        int pcre2_set_callout(pcre2_match_context *mcontext,
 | |
|          int (*callout_function)(pcre2_callout_block *, void *),
 | |
|          void *callout_data);
 | |
| 
 | |
|        int pcre2_set_match_limit(pcre2_match_context *mcontext,
 | |
|          uint32_t value);
 | |
| 
 | |
|        int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
 | |
|          uint32_t value);
 | |
| 
 | |
|        int pcre2_set_recursion_memory_management(
 | |
|          pcre2_match_context *mcontext,
 | |
|          void *(*private_malloc)(PCRE2_SIZE, void *),
 | |
|          void (*private_free)(void *, void *), void *memory_data);
 | |
| 
 | |
| 
 | |
| PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
 | |
| 
 | |
|        int pcre2_substring_copy_byname(pcre2_match_data *match_data,
 | |
|          PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
 | |
| 
 | |
|        int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
 | |
|          uint32_t number, PCRE2_UCHAR *buffer,
 | |
|          PCRE2_SIZE *bufflen);
 | |
| 
 | |
|        void pcre2_substring_free(PCRE2_UCHAR *buffer);
 | |
| 
 | |
|        int pcre2_substring_get_byname(pcre2_match_data *match_data,
 | |
|          PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
 | |
| 
 | |
|        int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
 | |
|          uint32_t number, PCRE2_UCHAR **bufferptr,
 | |
|          PCRE2_SIZE *bufflen);
 | |
| 
 | |
|        int pcre2_substring_length_byname(pcre2_match_data *match_data,
 | |
|          PCRE2_SPTR name, PCRE2_SIZE *length);
 | |
| 
 | |
|        int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
 | |
|          uint32_t number, PCRE2_SIZE *length);
 | |
| 
 | |
|        int pcre2_substring_nametable_scan(const pcre2_code *code,
 | |
|          PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
 | |
| 
 | |
|        int pcre2_substring_number_from_name(const pcre2_code *code,
 | |
|          PCRE2_SPTR name);
 | |
| 
 | |
|        void pcre2_substring_list_free(PCRE2_SPTR *list);
 | |
| 
 | |
|        int pcre2_substring_list_get(pcre2_match_data *match_data,
 | |
|          PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
 | |
| 
 | |
| 
 | |
| PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION
 | |
| 
 | |
|        int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
 | |
|          PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | |
|          uint32_t options, pcre2_match_data *match_data,
 | |
|          pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP,
 | |
|          PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
 | |
|          PCRE2_SIZE *outlengthptr);
 | |
| 
 | |
| 
 | |
| PCRE2 NATIVE API JIT FUNCTIONS
 | |
| 
 | |
|        int pcre2_jit_compile(pcre2_code *code, uint32_t options);
 | |
| 
 | |
|        int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
 | |
|          PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | |
|          uint32_t options, pcre2_match_data *match_data,
 | |
|          pcre2_match_context *mcontext);
 | |
| 
 | |
|        void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
 | |
| 
 | |
|        pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
 | |
|          PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
 | |
| 
 | |
|        void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
 | |
|          pcre2_jit_callback callback_function, void *callback_data);
 | |
| 
 | |
|        void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
 | |
| 
 | |
| 
 | |
| PCRE2 NATIVE API SERIALIZATION FUNCTIONS
 | |
| 
 | |
|        int32_t pcre2_serialize_decode(pcre2_code **codes,
 | |
|          int32_t number_of_codes, const uint32_t *bytes,
 | |
|          pcre2_general_context *gcontext);
 | |
| 
 | |
|        int32_t pcre2_serialize_encode(pcre2_code **codes,
 | |
|          int32_t number_of_codes, uint32_t **serialized_bytes,
 | |
|          PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
 | |
| 
 | |
|        void pcre2_serialize_free(uint8_t *bytes);
 | |
| 
 | |
|        int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
 | |
| 
 | |
| 
 | |
| PCRE2 NATIVE API AUXILIARY FUNCTIONS
 | |
| 
 | |
|        int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
 | |
|          PCRE2_SIZE bufflen);
 | |
| 
 | |
|        const unsigned char *pcre2_maketables(pcre2_general_context *gcontext);
 | |
| 
 | |
|        int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
 | |
| 
 | |
|        int pcre2_callout_enumerate(const pcre2_code *code,
 | |
|          int (*callback)(pcre2_callout_enumerate_block *, void *),
 | |
|          void *user_data);
 | |
| 
 | |
|        int pcre2_config(uint32_t what, void *where);
 | |
| 
 | |
| 
 | |
| PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
 | |
| 
 | |
|        There  are  three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
 | |
|        code units, respectively. However,  there  is  just  one  header  file,
 | |
|        pcre2.h.   This  contains the function prototypes and other definitions
 | |
|        for all three libraries. One, two, or all three can be installed simul-
 | |
|        taneously.  On  Unix-like  systems the libraries are called libpcre2-8,
 | |
|        libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
 | |
|        inal PCRE libraries.
 | |
| 
 | |
|        Character  strings are passed to and from a PCRE2 library as a sequence
 | |
|        of unsigned integers in code units  of  the  appropriate  width.  Every
 | |
|        PCRE2  function  comes  in three different forms, one for each library,
 | |
|        for example:
 | |
| 
 | |
|          pcre2_compile_8()
 | |
|          pcre2_compile_16()
 | |
|          pcre2_compile_32()
 | |
| 
 | |
|        There are also three different sets of data types:
 | |
| 
 | |
|          PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
 | |
|          PCRE2_SPTR8,  PCRE2_SPTR16,  PCRE2_SPTR32
 | |
| 
 | |
|        The UCHAR types define unsigned code units of the  appropriate  widths.
 | |
|        For  example,  PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR
 | |
|        types are constant pointers to the equivalent  UCHAR  types,  that  is,
 | |
|        they are pointers to vectors of unsigned code units.
 | |
| 
 | |
|        Many  applications use only one code unit width. For their convenience,
 | |
|        macros are defined whose names are the generic forms such as pcre2_com-
 | |
|        pile()  and  PCRE2_SPTR.  These  macros  use  the  value  of  the macro
 | |
|        PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific  func-
 | |
|        tion and macro names.  PCRE2_CODE_UNIT_WIDTH is not defined by default.
 | |
|        An application must define it to be  8,  16,  or  32  before  including
 | |
|        pcre2.h in order to make use of the generic names.
 | |
| 
 | |
|        Applications  that use more than one code unit width can be linked with
 | |
|        more than one PCRE2 library, but must define  PCRE2_CODE_UNIT_WIDTH  to
 | |
|        be  0  before  including pcre2.h, and then use the real function names.
 | |
|        Any code that is to be included in an environment where  the  value  of
 | |
|        PCRE2_CODE_UNIT_WIDTH  is  unknown  should  also  use the real function
 | |
|        names. (Unfortunately, it is not possible in C code to save and restore
 | |
|        the value of a macro.)
 | |
| 
 | |
|        If  PCRE2_CODE_UNIT_WIDTH  is  not  defined before including pcre2.h, a
 | |
|        compiler error occurs.
 | |
| 
 | |
|        When using multiple libraries in an application,  you  must  take  care
 | |
|        when  processing  any  particular  pattern to use only functions from a
 | |
|        single library.  For example, if you want to run a match using  a  pat-
 | |
|        tern  that  was  compiled  with pcre2_compile_16(), you must do so with
 | |
|        pcre2_match_16(), not pcre2_match_8().
 | |
| 
 | |
|        In the function summaries above, and in the rest of this  document  and
 | |
|        other  PCRE2  documents,  functions  and data types are described using
 | |
|        their generic names, without the 8, 16, or 32 suffix.
 | |
| 
 | |
| 
 | |
| PCRE2 API OVERVIEW
 | |
| 
 | |
|        PCRE2 has its own native API, which  is  described  in  this  document.
 | |
|        There are also some wrapper functions for the 8-bit library that corre-
 | |
|        spond to the POSIX regular expression API, but they do not give  access
 | |
|        to all the functionality. They are described in the pcre2posix documen-
 | |
|        tation. Both these APIs define a set of C function calls.
 | |
| 
 | |
|        The native API C data types, function prototypes,  option  values,  and
 | |
|        error codes are defined in the header file pcre2.h, which contains def-
 | |
|        initions of PCRE2_MAJOR and PCRE2_MINOR, the major  and  minor  release
 | |
|        numbers  for the library. Applications can use these to include support
 | |
|        for different releases of PCRE2.
 | |
| 
 | |
|        In a Windows environment, if you want to statically link an application
 | |
|        program  against  a non-dll PCRE2 library, you must define PCRE2_STATIC
 | |
|        before including pcre2.h.
 | |
| 
 | |
|        The functions pcre2_compile(), and pcre2_match() are used for compiling
 | |
|        and  matching regular expressions in a Perl-compatible manner. A sample
 | |
|        program that demonstrates the simplest way of using them is provided in
 | |
|        the file called pcre2demo.c in the PCRE2 source distribution. A listing
 | |
|        of this program is  given  in  the  pcre2demo  documentation,  and  the
 | |
|        pcre2sample documentation describes how to compile and run it.
 | |
| 
 | |
|        Just-in-time  compiler support is an optional feature of PCRE2 that can
 | |
|        be built in appropriate hardware environments. It greatly speeds up the
 | |
|        matching  performance of many patterns. Programs can request that it be
 | |
|        used if available, by calling pcre2_jit_compile() after a  pattern  has
 | |
|        been successfully compiled by pcre2_compile(). This does nothing if JIT
 | |
|        support is not available.
 | |
| 
 | |
|        More complicated programs might need to  make  use  of  the  specialist
 | |
|        functions    pcre2_jit_stack_create(),    pcre2_jit_stack_free(),   and
 | |
|        pcre2_jit_stack_assign() in order to  control  the  JIT  code's  memory
 | |
|        usage.
 | |
| 
 | |
|        JIT matching is automatically used by pcre2_match() if it is available.
 | |
|        There is also a direct interface for JIT matching, which gives improved
 | |
|        performance.  The  JIT-specific functions are discussed in the pcre2jit
 | |
|        documentation.
 | |
| 
 | |
|        A second matching function, pcre2_dfa_match(), which is  not  Perl-com-
 | |
|        patible,  is  also  provided.  This  uses a different algorithm for the
 | |
|        matching. The alternative algorithm finds all possible  matches  (at  a
 | |
|        given  point  in  the subject), and scans the subject just once (unless
 | |
|        there are lookbehind assertions).  However,  this  algorithm  does  not
 | |
|        return  captured  substrings.  A  description of the two matching algo-
 | |
|        rithms  and  their  advantages  and  disadvantages  is  given  in   the
 | |
|        pcre2matching    documentation.   There   is   no   JIT   support   for
 | |
|        pcre2_dfa_match().
 | |
| 
 | |
|        In addition to the main compiling and  matching  functions,  there  are
 | |
|        convenience functions for extracting captured substrings from a subject
 | |
|        string that has been matched by pcre2_match(). They are:
 | |
| 
 | |
|          pcre2_substring_copy_byname()
 | |
|          pcre2_substring_copy_bynumber()
 | |
|          pcre2_substring_get_byname()
 | |
|          pcre2_substring_get_bynumber()
 | |
|          pcre2_substring_list_get()
 | |
|          pcre2_substring_length_byname()
 | |
|          pcre2_substring_length_bynumber()
 | |
|          pcre2_substring_nametable_scan()
 | |
|          pcre2_substring_number_from_name()
 | |
| 
 | |
|        pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro-
 | |
|        vided, to free the memory used for extracted strings.
 | |
| 
 | |
|        The  function  pcre2_substitute()  can be called to match a pattern and
 | |
|        return a copy of the subject string with substitutions for  parts  that
 | |
|        were matched.
 | |
| 
 | |
|        Finally,  there  are functions for finding out information about a com-
 | |
|        piled pattern (pcre2_pattern_info()) and about the  configuration  with
 | |
|        which PCRE2 was built (pcre2_config()).
 | |
| 
 | |
| 
 | |
| STRING LENGTHS AND OFFSETS
 | |
| 
 | |
|        The  PCRE2  API  uses  string  lengths and offsets into strings of code
 | |
|        units in several places. These values are always  of  type  PCRE2_SIZE,
 | |
|        which  is an unsigned integer type, currently always defined as size_t.
 | |
|        The largest  value  that  can  be  stored  in  such  a  type  (that  is
 | |
|        ~(PCRE2_SIZE)0)  is reserved as a special indicator for zero-terminated
 | |
|        strings and unset offsets.  Therefore, the longest string that  can  be
 | |
|        handled is one less than this maximum.
 | |
| 
 | |
| 
 | |
| NEWLINES
 | |
| 
 | |
|        PCRE2 supports five different conventions for indicating line breaks in
 | |
|        strings: a single CR (carriage return) character, a  single  LF  (line-
 | |
|        feed) character, the two-character sequence CRLF, any of the three pre-
 | |
|        ceding, or any Unicode newline sequence. The Unicode newline  sequences
 | |
|        are  the  three just mentioned, plus the single characters VT (vertical
 | |
|        tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
 | |
|        separator, U+2028), and PS (paragraph separator, U+2029).
 | |
| 
 | |
|        Each  of  the first three conventions is used by at least one operating
 | |
|        system as its standard newline sequence. When PCRE2 is built, a default
 | |
|        can  be  specified.  The default default is LF, which is the Unix stan-
 | |
|        dard. However, the newline convention can be changed by an  application
 | |
|        when calling pcre2_compile(), or it can be specified by special text at
 | |
|        the start of the pattern itself; this overrides any other settings. See
 | |
|        the pcre2pattern page for details of the special character sequences.
 | |
| 
 | |
|        In  the  PCRE2  documentation  the  word "newline" is used to mean "the
 | |
|        character or pair of characters that indicate a line break". The choice
 | |
|        of  newline convention affects the handling of the dot, circumflex, and
 | |
|        dollar metacharacters, the handling of #-comments in /x mode, and, when
 | |
|        CRLF  is a recognized line ending sequence, the match position advance-
 | |
|        ment for a non-anchored pattern. There is more detail about this in the
 | |
|        section on pcre2_match() options below.
 | |
| 
 | |
|        The  choice of newline convention does not affect the interpretation of
 | |
|        the \n or \r escape sequences, nor does it affect what \R matches; this
 | |
|        has its own separate convention.
 | |
| 
 | |
| 
 | |
| MULTITHREADING
 | |
| 
 | |
|        In  a multithreaded application it is important to keep thread-specific
 | |
|        data separate from data that can be shared between threads.  The  PCRE2
 | |
|        library  code  itself  is  thread-safe: it contains no static or global
 | |
|        variables. The API is designed to be  fairly  simple  for  non-threaded
 | |
|        applications  while at the same time ensuring that multithreaded appli-
 | |
|        cations can use it.
 | |
| 
 | |
|        There are several different blocks of data that are used to pass infor-
 | |
|        mation between the application and the PCRE2 libraries.
 | |
| 
 | |
|        (1) A pointer to the compiled form of a pattern is returned to the user
 | |
|        when pcre2_compile() is successful. The data in the compiled pattern is
 | |
|        fixed,  and  does not change when the pattern is matched. Therefore, it
 | |
|        is thread-safe, that is, the same compiled pattern can be used by  more
 | |
|        than one thread simultaneously. An application can compile all its pat-
 | |
|        terns at the start, before forking off multiple threads that use  them.
 | |
|        However,  if  the  just-in-time  optimization feature is being used, it
 | |
|        needs separate memory stack areas for each  thread.  See  the  pcre2jit
 | |
|        documentation for more details.
 | |
| 
 | |
|        (2)  The  next section below introduces the idea of "contexts" in which
 | |
|        PCRE2 functions are called. A context is nothing more than a collection
 | |
|        of parameters that control the way PCRE2 operates. Grouping a number of
 | |
|        parameters together in a context is a convenient way of passing them to
 | |
|        a  PCRE2  function without using lots of arguments. The parameters that
 | |
|        are stored in contexts are in some sense  "advanced  features"  of  the
 | |
|        API. Many straightforward applications will not need to use contexts.
 | |
| 
 | |
|        In a multithreaded application, if the parameters in a context are val-
 | |
|        ues that are never changed, the same context can be  used  by  all  the
 | |
|        threads. However, if any thread needs to change any value in a context,
 | |
|        it must make its own thread-specific copy.
 | |
| 
 | |
|        (3) The matching functions need a block of memory for working space and
 | |
|        for  storing  the results of a match. This includes details of what was
 | |
|        matched, as well as additional  information  such  as  the  name  of  a
 | |
|        (*MARK)  setting. Each thread must provide its own version of this mem-
 | |
|        ory.
 | |
| 
 | |
| 
 | |
| PCRE2 CONTEXTS
 | |
| 
 | |
|        Some PCRE2 functions have a lot of parameters, many of which  are  used
 | |
|        only  by  specialist  applications,  for example, those that use custom
 | |
|        memory management or non-standard character tables.  To  keep  function
 | |
|        argument  lists  at a reasonable size, and at the same time to keep the
 | |
|        API extensible, "uncommon" parameters are passed to  certain  functions
 | |
|        in  a  context instead of directly. A context is just a block of memory
 | |
|        that holds the parameter values.  Applications  that  do  not  need  to
 | |
|        adjust  any  of  the  context  parameters  can pass NULL when a context
 | |
|        pointer is required.
 | |
| 
 | |
|        There are three different types of context: a general context  that  is
 | |
|        relevant  for  several  PCRE2 operations, a compile-time context, and a
 | |
|        match-time context.
 | |
| 
 | |
|    The general context
 | |
| 
 | |
|        At present, this context just  contains  pointers  to  (and  data  for)
 | |
|        external  memory  management  functions  that  are  called from several
 | |
|        places in the PCRE2 library. The context is named `general' rather than
 | |
|        specifically  `memory'  because in future other fields may be added. If
 | |
|        you do not want to supply your own custom memory management  functions,
 | |
|        you  do not need to bother with a general context. A general context is
 | |
|        created by:
 | |
| 
 | |
|        pcre2_general_context *pcre2_general_context_create(
 | |
|          void *(*private_malloc)(PCRE2_SIZE, void *),
 | |
|          void (*private_free)(void *, void *), void *memory_data);
 | |
| 
 | |
|        The two function pointers specify custom memory  management  functions,
 | |
|        whose prototypes are:
 | |
| 
 | |
|          void *private_malloc(PCRE2_SIZE, void *);
 | |
|          void  private_free(void *, void *);
 | |
| 
 | |
|        Whenever code in PCRE2 calls these functions, the final argument is the
 | |
|        value of memory_data. Either of the first two arguments of the creation
 | |
|        function  may be NULL, in which case the system memory management func-
 | |
|        tions malloc() and free() are used. (This is not currently  useful,  as
 | |
|        there  are  no  other  fields in a general context, but in future there
 | |
|        might be.)  The private_malloc() function  is  used  (if  supplied)  to
 | |
|        obtain  memory  for storing the context, and all three values are saved
 | |
|        as part of the context.
 | |
| 
 | |
|        Whenever PCRE2 creates a data block of any kind, the block  contains  a
 | |
|        pointer  to the free() function that matches the malloc() function that
 | |
|        was used. When the time comes to  free  the  block,  this  function  is
 | |
|        called.
 | |
| 
 | |
|        A general context can be copied by calling:
 | |
| 
 | |
|        pcre2_general_context *pcre2_general_context_copy(
 | |
|          pcre2_general_context *gcontext);
 | |
| 
 | |
|        The memory used for a general context should be freed by calling:
 | |
| 
 | |
|        void pcre2_general_context_free(pcre2_general_context *gcontext);
 | |
| 
 | |
| 
 | |
|    The compile context
 | |
| 
 | |
|        A  compile context is required if you want to change the default values
 | |
|        of any of the following compile-time parameters:
 | |
| 
 | |
|          What \R matches (Unicode newlines or CR, LF, CRLF only)
 | |
|          PCRE2's character tables
 | |
|          The newline character sequence
 | |
|          The compile time nested parentheses limit
 | |
|          An external function for stack checking
 | |
| 
 | |
|        A compile context is also required if you are using custom memory  man-
 | |
|        agement.   If  none of these apply, just pass NULL as the context argu-
 | |
|        ment of pcre2_compile().
 | |
| 
 | |
|        A compile context is created, copied, and freed by the following  func-
 | |
|        tions:
 | |
| 
 | |
|        pcre2_compile_context *pcre2_compile_context_create(
 | |
|          pcre2_general_context *gcontext);
 | |
| 
 | |
|        pcre2_compile_context *pcre2_compile_context_copy(
 | |
|          pcre2_compile_context *ccontext);
 | |
| 
 | |
|        void pcre2_compile_context_free(pcre2_compile_context *ccontext);
 | |
| 
 | |
|        A  compile  context  is created with default values for its parameters.
 | |
|        These can be changed by calling the following functions, which return 0
 | |
|        on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
 | |
| 
 | |
|        int pcre2_set_bsr(pcre2_compile_context *ccontext,
 | |
|          uint32_t value);
 | |
| 
 | |
|        The  value  must  be PCRE2_BSR_ANYCRLF, to specify that \R matches only
 | |
|        CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R  matches  any
 | |
|        Unicode line ending sequence. The value is used by the JIT compiler and
 | |
|        by  the  two  interpreted   matching   functions,   pcre2_match()   and
 | |
|        pcre2_dfa_match().
 | |
| 
 | |
|        int pcre2_set_character_tables(pcre2_compile_context *ccontext,
 | |
|          const unsigned char *tables);
 | |
| 
 | |
|        The  value  must  be  the result of a call to pcre2_maketables(), whose
 | |
|        only argument is a general context. This function builds a set of char-
 | |
|        acter tables in the current locale.
 | |
| 
 | |
|        int pcre2_set_newline(pcre2_compile_context *ccontext,
 | |
|          uint32_t value);
 | |
| 
 | |
|        This specifies which characters or character sequences are to be recog-
 | |
|        nized as newlines. The value must be one of PCRE2_NEWLINE_CR  (carriage
 | |
|        return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
 | |
|        two-character sequence CR followed by LF),  PCRE2_NEWLINE_ANYCRLF  (any
 | |
|        of the above), or PCRE2_NEWLINE_ANY (any Unicode newline sequence).
 | |
| 
 | |
|        When a pattern is compiled with the PCRE2_EXTENDED option, the value of
 | |
|        this parameter affects the recognition of white space and  the  end  of
 | |
|        internal comments starting with #. The value is saved with the compiled
 | |
|        pattern for subsequent use by the JIT compiler and by  the  two  inter-
 | |
|        preted matching functions, pcre2_match() and pcre2_dfa_match().
 | |
| 
 | |
|        int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
 | |
|          uint32_t value);
 | |
| 
 | |
|        This parameter ajusts the limit, set when PCRE2 is built (default 250),
 | |
|        on the depth of parenthesis nesting in  a  pattern.  This  limit  stops
 | |
|        rogue patterns using up too much system stack when being compiled.
 | |
| 
 | |
|        int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
 | |
|          int (*guard_function)(uint32_t, void *), void *user_data);
 | |
| 
 | |
|        There  is at least one application that runs PCRE2 in threads with very
 | |
|        limited system stack, where running out of stack is to  be  avoided  at
 | |
|        all  costs. The parenthesis limit above cannot take account of how much
 | |
|        stack is actually available. For a finer  control,  you  can  supply  a
 | |
|        function  that  is  called whenever pcre2_compile() starts to compile a
 | |
|        parenthesized part of a pattern. This function  can  check  the  actual
 | |
|        stack size (or anything else that it wants to, of course).
 | |
| 
 | |
|        The  first  argument to the callout function gives the current depth of
 | |
|        nesting, and the second is user data that is set up by the  last  argu-
 | |
|        ment   of  pcre2_set_compile_recursion_guard().  The  callout  function
 | |
|        should return zero if all is well, or non-zero to force an error.
 | |
| 
 | |
|    The match context
 | |
| 
 | |
|        A match context is required if you want to change the default values of
 | |
|        any of the following match-time parameters:
 | |
| 
 | |
|          A callout function
 | |
|          The limit for calling match()
 | |
|          The limit for calling match() recursively
 | |
| 
 | |
|        A match context is also required if you are using custom memory manage-
 | |
|        ment.  If none of these apply, just pass NULL as the  context  argument
 | |
|        of pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
 | |
| 
 | |
|        A  match  context  is created, copied, and freed by the following func-
 | |
|        tions:
 | |
| 
 | |
|        pcre2_match_context *pcre2_match_context_create(
 | |
|          pcre2_general_context *gcontext);
 | |
| 
 | |
|        pcre2_match_context *pcre2_match_context_copy(
 | |
|          pcre2_match_context *mcontext);
 | |
| 
 | |
|        void pcre2_match_context_free(pcre2_match_context *mcontext);
 | |
| 
 | |
|        A match context is created with  default  values  for  its  parameters.
 | |
|        These can be changed by calling the following functions, which return 0
 | |
|        on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
 | |
| 
 | |
|        int pcre2_set_callout(pcre2_match_context *mcontext,
 | |
|          int (*callout_function)(pcre2_callout_block *, void *),
 | |
|          void *callout_data);
 | |
| 
 | |
|        This sets up a "callout" function, which PCRE2 will call  at  specified
 | |
|        points during a matching operation. Details are given in the pcre2call-
 | |
|        out documentation.
 | |
| 
 | |
|        int pcre2_set_match_limit(pcre2_match_context *mcontext,
 | |
|          uint32_t value);
 | |
| 
 | |
|        The match_limit parameter provides a means  of  preventing  PCRE2  from
 | |
|        using up too many resources when processing patterns that are not going
 | |
|        to match, but which have a very large number of possibilities in  their
 | |
|        search  trees. The classic example is a pattern that uses nested unlim-
 | |
|        ited repeats.
 | |
| 
 | |
|        Internally, pcre2_match() uses a  function  called  match(),  which  it
 | |
|        calls  repeatedly (sometimes recursively). The limit set by match_limit
 | |
|        is imposed on the number of times this  function  is  called  during  a
 | |
|        match, which has the effect of limiting the amount of backtracking that
 | |
|        can take place. For patterns that are not anchored, the count  restarts
 | |
|        from  zero  for  each position in the subject string. This limit is not
 | |
|        relevant to pcre2_dfa_match(), which ignores it.
 | |
| 
 | |
|        When pcre2_match() is called with a pattern that was successfully  pro-
 | |
|        cessed by pcre2_jit_compile(), the way in which matching is executed is
 | |
|        entirely different. However, there is still the possibility of  runaway
 | |
|        matching  that  goes  on  for  a very long time, and so the match_limit
 | |
|        value is also used in this case (but in a different way) to  limit  how
 | |
|        long the matching can continue.
 | |
| 
 | |
|        The  default  value  for  the limit can be set when PCRE2 is built; the
 | |
|        default default is 10 million, which handles all but the  most  extreme
 | |
|        cases.    If    the    limit   is   exceeded,   pcre2_match()   returns
 | |
|        PCRE2_ERROR_MATCHLIMIT. A value for the match limit may  also  be  sup-
 | |
|        plied by an item at the start of a pattern of the form
 | |
| 
 | |
|          (*LIMIT_MATCH=ddd)
 | |
| 
 | |
|        where  ddd  is  a  decimal  number.  However, such a setting is ignored
 | |
|        unless ddd is less than the limit set by the  caller  of  pcre2_match()
 | |
|        or, if no such limit is set, less than the default.
 | |
| 
 | |
|        int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
 | |
|          uint32_t value);
 | |
| 
 | |
|        The recursion_limit parameter is similar to match_limit, but instead of
 | |
|        limiting the total number of times that match() is  called,  it  limits
 | |
|        the  depth  of  recursion. The recursion depth is a smaller number than
 | |
|        the total number of calls, because not all calls to match() are  recur-
 | |
|        sive.  This limit is of use only if it is set smaller than match_limit.
 | |
| 
 | |
|        Limiting the recursion depth limits the amount of system stack that can
 | |
|        be used, or, when PCRE2 has been compiled to use  memory  on  the  heap
 | |
|        instead  of the stack, the amount of heap memory that can be used. This
 | |
|        limit is not relevant, and is ignored, when matching is done using  JIT
 | |
|        compiled code or by the pcre2_dfa_match() function.
 | |
| 
 | |
|        The  default  value for recursion_limit can be set when PCRE2 is built;
 | |
|        the default default is the same value as the default  for  match_limit.
 | |
|        If  the limit is exceeded, pcre2_match() returns PCRE2_ERROR_RECURSION-
 | |
|        LIMIT. A value for the recursion limit may also be supplied by an  item
 | |
|        at the start of a pattern of the form
 | |
| 
 | |
|          (*LIMIT_RECURSION=ddd)
 | |
| 
 | |
|        where  ddd  is  a  decimal  number.  However, such a setting is ignored
 | |
|        unless ddd is less than the limit set by the  caller  of  pcre2_match()
 | |
|        or, if no such limit is set, less than the default.
 | |
| 
 | |
|        int pcre2_set_recursion_memory_management(
 | |
|          pcre2_match_context *mcontext,
 | |
|          void *(*private_malloc)(PCRE2_SIZE, void *),
 | |
|          void (*private_free)(void *, void *), void *memory_data);
 | |
| 
 | |
|        This function sets up two additional custom memory management functions
 | |
|        for use by pcre2_match() when PCRE2 is compiled to  use  the  heap  for
 | |
|        remembering backtracking data, instead of recursive function calls that
 | |
|        use the system stack. There is a discussion about PCRE2's  stack  usage
 | |
|        in  the  pcre2stack documentation. See the pcre2build documentation for
 | |
|        details of how to build PCRE2.
 | |
| 
 | |
|        Using the heap for recursion is a non-standard way of  building  PCRE2,
 | |
|        for  use  in  environments  that  have  limited  stacks. Because of the
 | |
|        greater use of memory management, pcre2_match() runs more slowly. Func-
 | |
|        tions  that  are  different  to the general custom memory functions are
 | |
|        provided so that special-purpose external code can  be  used  for  this
 | |
|        case,  because  the memory blocks are all the same size. The blocks are
 | |
|        retained by pcre2_match() until it is about to exit so that they can be
 | |
|        re-used  when  possible during the match. In the absence of these func-
 | |
|        tions, the normal custom memory management functions are used, if  sup-
 | |
|        plied, otherwise the system functions.
 | |
| 
 | |
| 
 | |
| CHECKING BUILD-TIME OPTIONS
 | |
| 
 | |
|        int pcre2_config(uint32_t what, void *where);
 | |
| 
 | |
|        The  function  pcre2_config()  makes  it possible for a PCRE2 client to
 | |
|        discover which optional features have  been  compiled  into  the  PCRE2
 | |
|        library.  The  pcre2build  documentation  has  more details about these
 | |
|        optional features.
 | |
| 
 | |
|        The first argument for pcre2_config() specifies  which  information  is
 | |
|        required.  The  second  argument  is a pointer to memory into which the
 | |
|        information is placed. If NULL is  passed,  the  function  returns  the
 | |
|        amount  of  memory  that  is  needed for the requested information. For
 | |
|        calls that return  numerical  values,  the  value  is  in  bytes;  when
 | |
|        requesting  these  values,  where should point to appropriately aligned
 | |
|        memory. For calls that return strings, the required length is given  in
 | |
|        code units, not counting the terminating zero.
 | |
| 
 | |
|        When  requesting information, the returned value from pcre2_config() is
 | |
|        non-negative on success, or the negative error code  PCRE2_ERROR_BADOP-
 | |
|        TION  if the value in the first argument is not recognized. The follow-
 | |
|        ing information is available:
 | |
| 
 | |
|          PCRE2_CONFIG_BSR
 | |
| 
 | |
|        The output is a uint32_t integer whose value indicates  what  character
 | |
|        sequences  the  \R  escape  sequence  matches  by  default.  A value of
 | |
|        PCRE2_BSR_UNICODE  means  that  \R  matches  any  Unicode  line  ending
 | |
|        sequence;  a  value of PCRE2_BSR_ANYCRLF means that \R matches only CR,
 | |
|        LF, or CRLF. The default can be overridden when a pattern is compiled.
 | |
| 
 | |
|          PCRE2_CONFIG_JIT
 | |
| 
 | |
|        The output is a uint32_t integer that is set  to  one  if  support  for
 | |
|        just-in-time compiling is available; otherwise it is set to zero.
 | |
| 
 | |
|          PCRE2_CONFIG_JITTARGET
 | |
| 
 | |
|        The  where  argument  should point to a buffer that is at least 48 code
 | |
|        units long.  (The  exact  length  required  can  be  found  by  calling
 | |
|        pcre2_config()  with  where  set  to NULL.) The buffer is filled with a
 | |
|        string that contains the name of the architecture  for  which  the  JIT
 | |
|        compiler  is  configured,  for  example  "x86  32bit  (little  endian +
 | |
|        unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION  is
 | |
|        returned,  otherwise the number of code units used is returned. This is
 | |
|        the length of the string, plus one unit for the terminating zero.
 | |
| 
 | |
|          PCRE2_CONFIG_LINKSIZE
 | |
| 
 | |
|        The output is a uint32_t integer that contains the number of bytes used
 | |
|        for  internal  linkage  in  compiled regular expressions. When PCRE2 is
 | |
|        configured, the value can be set to 2, 3, or 4, with the default  being
 | |
|        2.  This is the value that is returned by pcre2_config(). However, when
 | |
|        the 16-bit library is compiled, a value of 3 is rounded up  to  4,  and
 | |
|        when  the  32-bit  library  is compiled, internal linkages always use 4
 | |
|        bytes, so the configured value is not relevant.
 | |
| 
 | |
|        The default value of 2 for the 8-bit and 16-bit libraries is sufficient
 | |
|        for  all but the most massive patterns, since it allows the size of the
 | |
|        compiled pattern to be up to 64K code units. Larger values allow larger
 | |
|        regular  expressions  to be compiled by those two libraries, but at the
 | |
|        expense of slower matching.
 | |
| 
 | |
|          PCRE2_CONFIG_MATCHLIMIT
 | |
| 
 | |
|        The output is a uint32_t integer that gives the default limit  for  the
 | |
|        number  of  internal  matching function calls in a pcre2_match() execu-
 | |
|        tion. Further details are given with pcre2_match() below.
 | |
| 
 | |
|          PCRE2_CONFIG_NEWLINE
 | |
| 
 | |
|        The output is a uint32_t integer  whose  value  specifies  the  default
 | |
|        character  sequence that is recognized as meaning "newline". The values
 | |
|        are:
 | |
| 
 | |
|          PCRE2_NEWLINE_CR       Carriage return (CR)
 | |
|          PCRE2_NEWLINE_LF       Linefeed (LF)
 | |
|          PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
 | |
|          PCRE2_NEWLINE_ANY      Any Unicode line ending
 | |
|          PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
 | |
| 
 | |
|        The default should normally correspond to  the  standard  sequence  for
 | |
|        your operating system.
 | |
| 
 | |
|          PCRE2_CONFIG_PARENSLIMIT
 | |
| 
 | |
|        The  output is a uint32_t integer that gives the maximum depth of nest-
 | |
|        ing of parentheses (of any kind) in a pattern. This limit is imposed to
 | |
|        cap  the  amount of system stack used when a pattern is compiled. It is
 | |
|        specified when PCRE2 is built; the default is 250. This limit does  not
 | |
|        take  into  account  the  stack that may already be used by the calling
 | |
|        application. For  finer  control  over  compilation  stack  usage,  see
 | |
|        pcre2_set_compile_recursion_guard().
 | |
| 
 | |
|          PCRE2_CONFIG_RECURSIONLIMIT
 | |
| 
 | |
|        The  output  is a uint32_t integer that gives the default limit for the
 | |
|        depth of recursion when calling the internal  matching  function  in  a
 | |
|        pcre2_match()  execution.  Further details are given with pcre2_match()
 | |
|        below.
 | |
| 
 | |
|          PCRE2_CONFIG_STACKRECURSE
 | |
| 
 | |
|        The output is a uint32_t integer that is set to one if internal  recur-
 | |
|        sion  when  running  pcre2_match() is implemented by recursive function
 | |
|        calls that use the system stack to remember their state.  This  is  the
 | |
|        usual  way that PCRE2 is compiled. The output is zero if PCRE2 was com-
 | |
|        piled to use blocks of data on the heap instead of  recursive  function
 | |
|        calls.
 | |
| 
 | |
|          PCRE2_CONFIG_UNICODE_VERSION
 | |
| 
 | |
|        The  where  argument  should point to a buffer that is at least 24 code
 | |
|        units long.  (The  exact  length  required  can  be  found  by  calling
 | |
|        pcre2_config()  with  where  set  to  NULL.) If PCRE2 has been compiled
 | |
|        without Unicode support, the buffer is filled with  the  text  "Unicode
 | |
|        not  supported".  Otherwise,  the  Unicode version string (for example,
 | |
|        "7.0.0") is inserted. The number of code units used is  returned.  This
 | |
|        is the length of the string plus one unit for the terminating zero.
 | |
| 
 | |
|          PCRE2_CONFIG_UNICODE
 | |
| 
 | |
|        The  output is a uint32_t integer that is set to one if Unicode support
 | |
|        is available; otherwise it is set to zero. Unicode support implies  UTF
 | |
|        support.
 | |
| 
 | |
|          PCRE2_CONFIG_VERSION
 | |
| 
 | |
|        The  where  argument  should point to a buffer that is at least 12 code
 | |
|        units long.  (The  exact  length  required  can  be  found  by  calling
 | |
|        pcre2_config()  with  where set to NULL.) The buffer is filled with the
 | |
|        PCRE2 version string, zero-terminated. The number of code units used is
 | |
|        returned. This is the length of the string plus one unit for the termi-
 | |
|        nating zero.
 | |
| 
 | |
| 
 | |
| COMPILING A PATTERN
 | |
| 
 | |
|        pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
 | |
|          uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
 | |
|          pcre2_compile_context *ccontext);
 | |
| 
 | |
|        pcre2_code_free(pcre2_code *code);
 | |
| 
 | |
|        The pcre2_compile() function compiles a pattern into an internal  form.
 | |
|        The  pattern  is  defined  by a pointer to a string of code units and a
 | |
|        length, If the pattern is zero-terminated, the length can be  specified
 | |
|        as  PCRE2_ZERO_TERMINATED. The function returns a pointer to a block of
 | |
|        memory that contains the compiled pattern and related data. The  caller
 | |
|        must  free the memory by calling pcre2_code_free() when it is no longer
 | |
|        needed.
 | |
| 
 | |
|        NOTE: When one of the matching functions is  called,  pointers  to  the
 | |
|        compiled pattern and the subject string are set in the match data block
 | |
|        so that they can be referenced by the extraction functions. After  run-
 | |
|        ning  a  match,  you  must  not  free  a compiled pattern (or a subject
 | |
|        string) until after all operations on the match data block  have  taken
 | |
|        place.
 | |
| 
 | |
|        If  the  compile context argument ccontext is NULL, memory for the com-
 | |
|        piled pattern  is  obtained  by  calling  malloc().  Otherwise,  it  is
 | |
|        obtained  from  the  same memory function that was used for the compile
 | |
|        context.
 | |
| 
 | |
|        The options argument contains various bit settings that affect the com-
 | |
|        pilation.  It  should be zero if no options are required. The available
 | |
|        options are described below. Some of them (in  particular,  those  that
 | |
|        are  compatible with Perl, but some others as well) can also be set and
 | |
|        unset from within the pattern (see  the  detailed  description  in  the
 | |
|        pcre2pattern documentation).
 | |
| 
 | |
|        For  those options that can be different in different parts of the pat-
 | |
|        tern, the contents of the options argument specifies their settings  at
 | |
|        the  start  of  compilation.  The PCRE2_ANCHORED and PCRE2_NO_UTF_CHECK
 | |
|        options can be set at the time of matching as well as at compile time.
 | |
| 
 | |
|        Other, less frequently required compile-time parameters  (for  example,
 | |
|        the newline setting) can be provided in a compile context (as described
 | |
|        above).
 | |
| 
 | |
|        If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
 | |
|        diately.  Otherwise, if compilation of a pattern fails, pcre2_compile()
 | |
|        returns NULL, having set these variables to an error code and an offset
 | |
|        (number   of   code   units)  within  the  pattern,  respectively.  The
 | |
|        pcre2_get_error_message() function provides a textual message for  each
 | |
|        error code. Compilation errors are positive numbers, but UTF formatting
 | |
|        errors are negative numbers. For an invalid UTF-8 or UTF-16 string, the
 | |
|        offset is that of the first code unit of the failing character.
 | |
| 
 | |
|        Some  errors are not detected until the whole pattern has been scanned;
 | |
|        in these cases, the offset passed back is the length  of  the  pattern.
 | |
|        Note  that  the  offset is in code units, not characters, even in a UTF
 | |
|        mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
 | |
|        acter.
 | |
| 
 | |
|        This  code  fragment shows a typical straightforward call to pcre2_com-
 | |
|        pile():
 | |
| 
 | |
|          pcre2_code *re;
 | |
|          PCRE2_SIZE erroffset;
 | |
|          int errorcode;
 | |
|          re = pcre2_compile(
 | |
|            "^A.*Z",                /* the pattern */
 | |
|            PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */
 | |
|            0,                      /* default options */
 | |
|            &errorcode,             /* for error code */
 | |
|            &erroffset,             /* for error offset */
 | |
|            NULL);                  /* no compile context */
 | |
| 
 | |
|        The following names for option bits are defined in the  pcre2.h  header
 | |
|        file:
 | |
| 
 | |
|          PCRE2_ANCHORED
 | |
| 
 | |
|        If this bit is set, the pattern is forced to be "anchored", that is, it
 | |
|        is constrained to match only at the first matching point in the  string
 | |
|        that  is being searched (the "subject string"). This effect can also be
 | |
|        achieved by appropriate constructs in the pattern itself, which is  the
 | |
|        only way to do it in Perl.
 | |
| 
 | |
|          PCRE2_ALLOW_EMPTY_CLASS
 | |
| 
 | |
|        By  default, for compatibility with Perl, a closing square bracket that
 | |
|        immediately follows an opening one is treated as a data  character  for
 | |
|        the  class.  When  PCRE2_ALLOW_EMPTY_CLASS  is  set,  it terminates the
 | |
|        class, which therefore contains no characters and so can never match.
 | |
| 
 | |
|          PCRE2_ALT_BSUX
 | |
| 
 | |
|        This option request alternative handling  of  three  escape  sequences,
 | |
|        which  makes  PCRE2's  behaviour more like ECMAscript (aka JavaScript).
 | |
|        When it is set:
 | |
| 
 | |
|        (1) \U matches an upper case "U" character; by default \U causes a com-
 | |
|        pile time error (Perl uses \U to upper case subsequent characters).
 | |
| 
 | |
|        (2) \u matches a lower case "u" character unless it is followed by four
 | |
|        hexadecimal digits, in which case the hexadecimal  number  defines  the
 | |
|        code  point  to match. By default, \u causes a compile time error (Perl
 | |
|        uses it to upper case the following character).
 | |
| 
 | |
|        (3) \x matches a lower case "x" character unless it is followed by  two
 | |
|        hexadecimal  digits,  in  which case the hexadecimal number defines the
 | |
|        code point to match. By default, as in Perl, a  hexadecimal  number  is
 | |
|        always expected after \x, but it may have zero, one, or two digits (so,
 | |
|        for example, \xz matches a binary zero character followed by z).
 | |
| 
 | |
|          PCRE2_ALT_CIRCUMFLEX
 | |
| 
 | |
|        In  multiline  mode  (when  PCRE2_MULTILINE  is  set),  the  circumflex
 | |
|        metacharacter  matches at the start of the subject (unless PCRE2_NOTBOL
 | |
|        is set), and also after any internal  newline.  However,  it  does  not
 | |
|        match after a newline at the end of the subject, for compatibility with
 | |
|        Perl. If you want a multiline circumflex also to match after  a  termi-
 | |
|        nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
 | |
| 
 | |
|          PCRE2_AUTO_CALLOUT
 | |
| 
 | |
|        If  this  bit  is  set,  pcre2_compile()  automatically inserts callout
 | |
|        items, all with number 255, before each pattern item. For discussion of
 | |
|        the callout facility, see the pcre2callout documentation.
 | |
| 
 | |
|          PCRE2_CASELESS
 | |
| 
 | |
|        If  this  bit is set, letters in the pattern match both upper and lower
 | |
|        case letters in the subject. It is equivalent to Perl's /i option,  and
 | |
|        it can be changed within a pattern by a (?i) option setting.
 | |
| 
 | |
|          PCRE2_DOLLAR_ENDONLY
 | |
| 
 | |
|        If  this bit is set, a dollar metacharacter in the pattern matches only
 | |
|        at the end of the subject string. Without this option,  a  dollar  also
 | |
|        matches  immediately before a newline at the end of the string (but not
 | |
|        before any other newlines). The PCRE2_DOLLAR_ENDONLY option is  ignored
 | |
|        if  PCRE2_MULTILINE  is  set.  There is no equivalent to this option in
 | |
|        Perl, and no way to set it within a pattern.
 | |
| 
 | |
|          PCRE2_DOTALL
 | |
| 
 | |
|        If this bit is set, a dot metacharacter  in  the  pattern  matches  any
 | |
|        character,  including  one  that  indicates a newline. However, it only
 | |
|        ever matches one character, even if newlines are coded as CRLF. Without
 | |
|        this option, a dot does not match when the current position in the sub-
 | |
|        ject is at a newline. This option is equivalent to  Perl's  /s  option,
 | |
|        and it can be changed within a pattern by a (?s) option setting. A neg-
 | |
|        ative class such as [^a] always matches newline characters, independent
 | |
|        of the setting of this option.
 | |
| 
 | |
|          PCRE2_DUPNAMES
 | |
| 
 | |
|        If  this  bit is set, names used to identify capturing subpatterns need
 | |
|        not be unique. This can be helpful for certain types of pattern when it
 | |
|        is  known  that  only  one instance of the named subpattern can ever be
 | |
|        matched. There are more details of named subpatterns  below;  see  also
 | |
|        the pcre2pattern documentation.
 | |
| 
 | |
|          PCRE2_EXTENDED
 | |
| 
 | |
|        If  this  bit  is  set,  most white space characters in the pattern are
 | |
|        totally ignored except when escaped or inside a character  class.  How-
 | |
|        ever,  white  space  is  not  allowed within sequences such as (?> that
 | |
|        introduce various parenthesized subpatterns, nor within numerical quan-
 | |
|        tifiers  such  as {1,3}.  Ignorable white space is permitted between an
 | |
|        item and a following quantifier and between a quantifier and a  follow-
 | |
|        ing + that indicates possessiveness.
 | |
| 
 | |
|        PCRE2_EXTENDED  also causes characters between an unescaped # outside a
 | |
|        character class and the next newline, inclusive, to be  ignored,  which
 | |
|        makes it possible to include comments inside complicated patterns. Note
 | |
|        that the end of this type of comment is a literal newline  sequence  in
 | |
|        the pattern; escape sequences that happen to represent a newline do not
 | |
|        count. PCRE2_EXTENDED is equivalent to Perl's /x option, and it can  be
 | |
|        changed within a pattern by a (?x) option setting.
 | |
| 
 | |
|        Which characters are interpreted as newlines can be specified by a set-
 | |
|        ting in the compile context that is passed to pcre2_compile() or  by  a
 | |
|        special  sequence at the start of the pattern, as described in the sec-
 | |
|        tion entitled "Newline conventions" in the pcre2pattern  documentation.
 | |
|        A default is defined when PCRE2 is built.
 | |
| 
 | |
|          PCRE2_FIRSTLINE
 | |
| 
 | |
|        If  this  option  is  set,  an  unanchored pattern is required to match
 | |
|        before or at the first  newline  in  the  subject  string,  though  the
 | |
|        matched text may continue over the newline.
 | |
| 
 | |
|          PCRE2_MATCH_UNSET_BACKREF
 | |
| 
 | |
|        If  this  option  is set, a back reference to an unset subpattern group
 | |
|        matches an empty string (by default this causes  the  current  matching
 | |
|        alternative  to  fail).   A  pattern such as (\1)(a) succeeds when this
 | |
|        option is set (assuming it can find an "a" in the subject), whereas  it
 | |
|        fails  by  default,  for  Perl compatibility. Setting this option makes
 | |
|        PCRE2 behave more like ECMAscript (aka JavaScript).
 | |
| 
 | |
|          PCRE2_MULTILINE
 | |
| 
 | |
|        By default, for the purposes of matching "start of line"  and  "end  of
 | |
|        line",  PCRE2  treats the subject string as consisting of a single line
 | |
|        of characters, even if it actually contains  newlines.  The  "start  of
 | |
|        line"  metacharacter  (^)  matches only at the start of the string, and
 | |
|        the "end of line" metacharacter ($) matches only  at  the  end  of  the
 | |
|        string,  or  before  a  terminating  newline  (except  when  PCRE2_DOL-
 | |
|        LAR_ENDONLY is set). Note, however, that unless  PCRE2_DOTALL  is  set,
 | |
|        the "any character" metacharacter (.) does not match at a newline. This
 | |
|        behaviour (for ^, $, and dot) is the same as Perl.
 | |
| 
 | |
|        When PCRE2_MULTILINE it is set, the "start of line" and "end  of  line"
 | |
|        constructs  match  immediately following or immediately before internal
 | |
|        newlines in the subject string, respectively, as well as  at  the  very
 | |
|        start  and  end.  This is equivalent to Perl's /m option, and it can be
 | |
|        changed within a pattern by a (?m) option setting. Note that the "start
 | |
|        of line" metacharacter does not match after a newline at the end of the
 | |
|        subject, for compatibility with Perl.  However, you can change this  by
 | |
|        setting  the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
 | |
|        subject string, or no occurrences of ^  or  $  in  a  pattern,  setting
 | |
|        PCRE2_MULTILINE has no effect.
 | |
| 
 | |
|          PCRE2_NEVER_BACKSLASH_C
 | |
| 
 | |
|        This  option  locks out the use of \C in the pattern that is being com-
 | |
|        piled.  This escape can  cause  unpredictable  behaviour  in  UTF-8  or
 | |
|        UTF-16  modes,  because  it may leave the current matching point in the
 | |
|        middle of a multi-code-unit character. This option  may  be  useful  in
 | |
|        applications that process patterns from external sources.
 | |
| 
 | |
|          PCRE2_NEVER_UCP
 | |
| 
 | |
|        This  option  locks  out the use of Unicode properties for handling \B,
 | |
|        \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
 | |
|        described  for  the  PCRE2_UCP option below. In particular, it prevents
 | |
|        the creator of the pattern from enabling this facility by starting  the
 | |
|        pattern  with  (*UCP).  This  option may be useful in applications that
 | |
|        process patterns from external sources. The option combination PCRE_UCP
 | |
|        and PCRE_NEVER_UCP causes an error.
 | |
| 
 | |
|          PCRE2_NEVER_UTF
 | |
| 
 | |
|        This  option  locks out interpretation of the pattern as UTF-8, UTF-16,
 | |
|        or UTF-32, depending on which library is in use. In particular, it pre-
 | |
|        vents  the  creator of the pattern from switching to UTF interpretation
 | |
|        by starting the pattern with (*UTF).  This  option  may  be  useful  in
 | |
|        applications  that process patterns from external sources. The combina-
 | |
|        tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
 | |
| 
 | |
|          PCRE2_NO_AUTO_CAPTURE
 | |
| 
 | |
|        If this option is set, it disables the use of numbered capturing paren-
 | |
|        theses  in the pattern. Any opening parenthesis that is not followed by
 | |
|        ? behaves as if it were followed by ?: but named parentheses can  still
 | |
|        be  used  for  capturing  (and  they acquire numbers in the usual way).
 | |
|        There is no equivalent of this option in Perl.
 | |
| 
 | |
|          PCRE2_NO_AUTO_POSSESS
 | |
| 
 | |
|        If this option is set, it disables "auto-possessification", which is an
 | |
|        optimization  that,  for example, turns a+b into a++b in order to avoid
 | |
|        backtracks into a+ that can never be successful. However,  if  callouts
 | |
|        are  in  use,  auto-possessification means that some callouts are never
 | |
|        taken. You can set this option if you want the matching functions to do
 | |
|        a  full  unoptimized  search and run all the callouts, but it is mainly
 | |
|        provided for testing purposes.
 | |
| 
 | |
|          PCRE2_NO_DOTSTAR_ANCHOR
 | |
| 
 | |
|        If this option is set, it disables an optimization that is applied when
 | |
|        .*  is  the  first significant item in a top-level branch of a pattern,
 | |
|        and all the other branches also start with .* or with \A or  \G  or  ^.
 | |
|        The  optimization  is  automatically disabled for .* if it is inside an
 | |
|        atomic group or a capturing group that is the subject of a back  refer-
 | |
|        ence,  or  if  the pattern contains (*PRUNE) or (*SKIP). When the opti-
 | |
|        mization is not disabled, such a pattern is automatically  anchored  if
 | |
|        PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
 | |
|        for any ^ items. Otherwise, the fact that any match must  start  either
 | |
|        at  the start of the subject or following a newline is remembered. Like
 | |
|        other optimizations, this can cause callouts to be skipped.
 | |
| 
 | |
|          PCRE2_NO_START_OPTIMIZE
 | |
| 
 | |
|        This is an option whose main effect is at matching time.  It  does  not
 | |
|        change what pcre2_compile() generates, but it does affect the output of
 | |
|        the JIT compiler.
 | |
| 
 | |
|        There are a number of optimizations that may occur at the  start  of  a
 | |
|        match,  in  order  to speed up the process. For example, if it is known
 | |
|        that an unanchored match must start  with  a  specific  character,  the
 | |
|        matching  code searches the subject for that character, and fails imme-
 | |
|        diately if it cannot find it, without actually running the main  match-
 | |
|        ing  function.  This means that a special item such as (*COMMIT) at the
 | |
|        start of a pattern is not considered until after  a  suitable  starting
 | |
|        point  for  the  match  has  been found. Also, when callouts or (*MARK)
 | |
|        items are in use, these "start-up" optimizations can cause them  to  be
 | |
|        skipped  if  the pattern is never actually used. The start-up optimiza-
 | |
|        tions are in effect a pre-scan of the subject that takes  place  before
 | |
|        the pattern is run.
 | |
| 
 | |
|        The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
 | |
|        possibly causing performance to suffer,  but  ensuring  that  in  cases
 | |
|        where  the  result is "no match", the callouts do occur, and that items
 | |
|        such as (*COMMIT) and (*MARK) are considered at every possible starting
 | |
|        position in the subject string.
 | |
| 
 | |
|        Setting  PCRE2_NO_START_OPTIMIZE  may  change the outcome of a matching
 | |
|        operation.  Consider the pattern
 | |
| 
 | |
|          (*COMMIT)ABC
 | |
| 
 | |
|        When this is compiled, PCRE2 records the fact that a match  must  start
 | |
|        with  the  character  "A".  Suppose the subject string is "DEFABC". The
 | |
|        start-up optimization scans along the subject, finds "A" and  runs  the
 | |
|        first  match attempt from there. The (*COMMIT) item means that the pat-
 | |
|        tern must match the current starting position, which in this  case,  it
 | |
|        does.  However,  if  the same match is run with PCRE2_NO_START_OPTIMIZE
 | |
|        set, the initial scan along the subject string  does  not  happen.  The
 | |
|        first  match  attempt  is  run  starting  from "D" and when this fails,
 | |
|        (*COMMIT) prevents any further matches  being  tried,  so  the  overall
 | |
|        result is "no match". There are also other start-up optimizations.  For
 | |
|        example, a minimum length for the subject may be recorded. Consider the
 | |
|        pattern
 | |
| 
 | |
|          (*MARK:A)(X|Y)
 | |
| 
 | |
|        The  minimum  length  for  a  match is one character. If the subject is
 | |
|        "ABC", there will be attempts to match "ABC", "BC", and "C". An attempt
 | |
|        to match an empty string at the end of the subject does not take place,
 | |
|        because PCRE2 knows that the subject is  now  too  short,  and  so  the
 | |
|        (*MARK)  is  never encountered. In this case, the optimization does not
 | |
|        affect the overall match result, which is still "no match", but it does
 | |
|        affect the auxiliary information that is returned.
 | |
| 
 | |
|          PCRE2_NO_UTF_CHECK
 | |
| 
 | |
|        When  PCRE2_UTF  is set, the validity of the pattern as a UTF string is
 | |
|        automatically checked. There are  discussions  about  the  validity  of
 | |
|        UTF-8  strings,  UTF-16 strings, and UTF-32 strings in the pcre2unicode
 | |
|        document.  If an invalid UTF sequence is found, pcre2_compile() returns
 | |
|        a negative error code.
 | |
| 
 | |
|        If you know that your pattern is valid, and you want to skip this check
 | |
|        for performance reasons, you can  set  the  PCRE2_NO_UTF_CHECK  option.
 | |
|        When  it  is set, the effect of passing an invalid UTF string as a pat-
 | |
|        tern is undefined. It may cause your program to  crash  or  loop.  Note
 | |
|        that   this   option   can   also   be   passed  to  pcre2_match()  and
 | |
|        pcre_dfa_match(), to suppress validity checking of the subject string.
 | |
| 
 | |
|          PCRE2_UCP
 | |
| 
 | |
|        This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
 | |
|        \w,  and  some  of  the POSIX character classes. By default, only ASCII
 | |
|        characters are recognized, but if PCRE2_UCP is set, Unicode  properties
 | |
|        are  used instead to classify characters. More details are given in the
 | |
|        section on generic character types in the pcre2pattern page. If you set
 | |
|        PCRE2_UCP,  matching one of the items it affects takes much longer. The
 | |
|        option is available only if PCRE2 has been compiled with  Unicode  sup-
 | |
|        port.
 | |
| 
 | |
|          PCRE2_UNGREEDY
 | |
| 
 | |
|        This  option  inverts  the "greediness" of the quantifiers so that they
 | |
|        are not greedy by default, but become greedy if followed by "?". It  is
 | |
|        not  compatible  with Perl. It can also be set by a (?U) option setting
 | |
|        within the pattern.
 | |
| 
 | |
|          PCRE2_UTF
 | |
| 
 | |
|        This option causes PCRE2 to regard both the  pattern  and  the  subject
 | |
|        strings  that  are  subsequently processed as strings of UTF characters
 | |
|        instead of single-code-unit strings. It  is  available  when  PCRE2  is
 | |
|        built  to  include  Unicode  support (which is the default). If Unicode
 | |
|        support is not available, the use of this  option  provokes  an  error.
 | |
|        Details  of how this option changes the behaviour of PCRE2 are given in
 | |
|        the pcre2unicode page.
 | |
| 
 | |
| 
 | |
| COMPILATION ERROR CODES
 | |
| 
 | |
|        There are over 80 positive error codes that pcre2_compile() may  return
 | |
|        if it finds an error in the pattern. There are also some negative error
 | |
|        codes that are used for invalid UTF strings.  These  are  the  same  as
 | |
|        given  by pcre2_match() and pcre2_dfa_match(), and are described in the
 | |
|        pcre2unicode page. The pcre2_get_error_message() function can be called
 | |
|        to obtain a textual error message from any error code.
 | |
| 
 | |
| 
 | |
| JUST-IN-TIME (JIT) COMPILATION
 | |
| 
 | |
|        int pcre2_jit_compile(pcre2_code *code, uint32_t options);
 | |
| 
 | |
|        int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
 | |
|          PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | |
|          uint32_t options, pcre2_match_data *match_data,
 | |
|          pcre2_match_context *mcontext);
 | |
| 
 | |
|        void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
 | |
| 
 | |
|        pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
 | |
|          PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
 | |
| 
 | |
|        void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
 | |
|          pcre2_jit_callback callback_function, void *callback_data);
 | |
| 
 | |
|        void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
 | |
| 
 | |
|        These  functions  provide  support  for  JIT compilation, which, if the
 | |
|        just-in-time compiler is available, further processes a  compiled  pat-
 | |
|        tern into machine code that executes much faster than the pcre2_match()
 | |
|        interpretive matching function. Full details are given in the  pcre2jit
 | |
|        documentation.
 | |
| 
 | |
|        JIT  compilation  is  a heavyweight optimization. It can take some time
 | |
|        for patterns to be analyzed, and for one-off matches  and  simple  pat-
 | |
|        terns  the benefit of faster execution might be offset by a much slower
 | |
|        compilation time.  Most, but not all patterns can be optimized  by  the
 | |
|        JIT compiler.
 | |
| 
 | |
| 
 | |
| LOCALE SUPPORT
 | |
| 
 | |
|        PCRE2  handles caseless matching, and determines whether characters are
 | |
|        letters, digits, or whatever, by reference to a set of tables,  indexed
 | |
|        by  character  code  point.  This applies only to characters whose code
 | |
|        points are less than 256. By default, higher-valued code  points  never
 | |
|        match  escapes  such  as \w or \d.  However, if PCRE2 is built with UTF
 | |
|        support, all characters can be tested with  \p  and  \P,  or,  alterna-
 | |
|        tively,  the  PCRE2_UCP  option  can be set when a pattern is compiled;
 | |
|        this causes \w and friends to use Unicode property support  instead  of
 | |
|        the built-in tables.
 | |
| 
 | |
|        The  use  of  locales  with Unicode is discouraged. If you are handling
 | |
|        characters with code points greater than 128,  you  should  either  use
 | |
|        Unicode support, or use locales, but not try to mix the two.
 | |
| 
 | |
|        PCRE2  contains  an  internal  set of character tables that are used by
 | |
|        default.  These are sufficient for  many  applications.  Normally,  the
 | |
|        internal tables recognize only ASCII characters. However, when PCRE2 is
 | |
|        built, it is possible to cause the internal tables to be rebuilt in the
 | |
|        default "C" locale of the local system, which may cause them to be dif-
 | |
|        ferent.
 | |
| 
 | |
|        The internal tables can be overridden by tables supplied by the  appli-
 | |
|        cation  that  calls  PCRE2.  These may be created in a different locale
 | |
|        from the default.  As more and more applications change to  using  Uni-
 | |
|        code, the need for this locale support is expected to die away.
 | |
| 
 | |
|        External  tables  are built by calling the pcre2_maketables() function,
 | |
|        in the relevant locale. The result can be passed to pcre2_compile()  as
 | |
|        often   as  necessary,  by  creating  a  compile  context  and  calling
 | |
|        pcre2_set_character_tables() to set the  tables  pointer  therein.  For
 | |
|        example,  to  build  and use tables that are appropriate for the French
 | |
|        locale (where accented characters with  values  greater  than  128  are
 | |
|        treated as letters), the following code could be used:
 | |
| 
 | |
|          setlocale(LC_CTYPE, "fr_FR");
 | |
|          tables = pcre2_maketables(NULL);
 | |
|          ccontext = pcre2_compile_context_create(NULL);
 | |
|          pcre2_set_character_tables(ccontext, tables);
 | |
|          re = pcre2_compile(..., ccontext);
 | |
| 
 | |
|        The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
 | |
|        if you are using Windows, the name for the French locale  is  "french".
 | |
|        It  is the caller's responsibility to ensure that the memory containing
 | |
|        the tables remains available for as long as it is needed.
 | |
| 
 | |
|        The pointer that is passed (via the compile context) to pcre2_compile()
 | |
|        is  saved  with  the  compiled pattern, and the same tables are used by
 | |
|        pcre2_match() and pcre_dfa_match(). Thus, for any single pattern,  com-
 | |
|        pilation,  and  matching  all  happen in the same locale, but different
 | |
|        patterns can be processed in different locales.
 | |
| 
 | |
| 
 | |
| INFORMATION ABOUT A COMPILED PATTERN
 | |
| 
 | |
|        int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
 | |
| 
 | |
|        The pcre2_pattern_info() function returns general information  about  a
 | |
|        compiled pattern. For information about callouts, see the next section.
 | |
|        The first argument for pcre2_pattern_info() is a pointer  to  the  com-
 | |
|        piled pattern. The second argument specifies which piece of information
 | |
|        is required, and the third argument is  a  pointer  to  a  variable  to
 | |
|        receive  the data. If the third argument is NULL, the first argument is
 | |
|        ignored, and the function returns the size in  bytes  of  the  variable
 | |
|        that is required for the information requested. Otherwise, The yield of
 | |
|        the function is zero for success, or one of the following negative num-
 | |
|        bers:
 | |
| 
 | |
|          PCRE2_ERROR_NULL           the argument code was NULL
 | |
|          PCRE2_ERROR_BADMAGIC       the "magic number" was not found
 | |
|          PCRE2_ERROR_BADOPTION      the value of what was invalid
 | |
|          PCRE2_ERROR_UNSET          the requested field is not set
 | |
| 
 | |
|        The  "magic  number" is placed at the start of each compiled pattern as
 | |
|        an simple check against passing an arbitrary memory pointer. Here is  a
 | |
|        typical  call of pcre2_pattern_info(), to obtain the length of the com-
 | |
|        piled pattern:
 | |
| 
 | |
|          int rc;
 | |
|          size_t length;
 | |
|          rc = pcre2_pattern_info(
 | |
|            re,               /* result of pcre2_compile() */
 | |
|            PCRE2_INFO_SIZE,  /* what is required */
 | |
|            &length);         /* where to put the data */
 | |
| 
 | |
|        The possible values for the second argument are defined in pcre2.h, and
 | |
|        are as follows:
 | |
| 
 | |
|          PCRE2_INFO_ALLOPTIONS
 | |
|          PCRE2_INFO_ARGOPTIONS
 | |
| 
 | |
|        Return a copy of the pattern's options. The third argument should point
 | |
|        to a  uint32_t  variable.  PCRE2_INFO_ARGOPTIONS  returns  exactly  the
 | |
|        options  that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
 | |
|        TIONS returns the compile options as modified by any  top-level  option
 | |
|        settings  at  the start of the pattern itself. In other words, they are
 | |
|        the options that will be in force when matching starts. For example, if
 | |
|        the  pattern  /(?im)abc(?-i)d/  is  compiled  with  the  PCRE2_EXTENDED
 | |
|        option,   the   result   is   PCRE2_CASELESS,   PCRE2_MULTILINE,    and
 | |
|        PCRE2_EXTENDED.
 | |
| 
 | |
|        A  pattern compiled without PCRE2_ANCHORED is automatically anchored by
 | |
|        PCRE2 if the first significant item in every top-level branch is one of
 | |
|        the following:
 | |
| 
 | |
|          ^     unless PCRE2_MULTILINE is set
 | |
|          \A    always
 | |
|          \G    always
 | |
|          .*    sometimes - see below
 | |
| 
 | |
|        When  .* is the first significant item, anchoring is possible only when
 | |
|        all the following are true:
 | |
| 
 | |
|          .* is not in an atomic group
 | |
|          .* is not in a capturing group that is the subject
 | |
|               of a back reference
 | |
|          PCRE2_DOTALL is in force for .*
 | |
|          Neither (*PRUNE) nor (*SKIP) appears in the pattern.
 | |
|          PCRE2_NO_DOTSTAR_ANCHOR is not set.
 | |
| 
 | |
|        For patterns that are auto-anchored, the PCRE2_ANCHORED bit is  set  in
 | |
|        the options returned for PCRE2_INFO_ALLOPTIONS.
 | |
| 
 | |
|          PCRE2_INFO_BACKREFMAX
 | |
| 
 | |
|        Return  the  number  of  the highest back reference in the pattern. The
 | |
|        third argument should point to an uint32_t variable. Named  subpatterns
 | |
|        acquire  numbers  as well as names, and these count towards the highest
 | |
|        back reference.  Back references such as \4 or \g{12}  match  the  cap-
 | |
|        tured  characters of the given group, but in addition, the check that a
 | |
|        capturing group is set in a conditional subpattern such as (?(3)a|b) is
 | |
|        also  a  back  reference.  Zero is returned if there are no back refer-
 | |
|        ences.
 | |
| 
 | |
|          PCRE2_INFO_BSR
 | |
| 
 | |
|        The output is a uint32_t whose value indicates what character sequences
 | |
|        the \R escape sequence matches. A value of PCRE2_BSR_UNICODE means that
 | |
|        \R matches any Unicode line ending sequence; a value of  PCRE2_BSR_ANY-
 | |
|        CRLF means that \R matches only CR, LF, or CRLF.
 | |
| 
 | |
|          PCRE2_INFO_CAPTURECOUNT
 | |
| 
 | |
|        Return  the  number  of capturing subpatterns in the pattern. The third
 | |
|        argument should point to an uint32_t variable.
 | |
| 
 | |
|          PCRE2_INFO_FIRSTCODETYPE
 | |
| 
 | |
|        Return information about the first code unit of any matched string, for
 | |
|        a  non-anchored pattern. The third argument should point to an uint32_t
 | |
|        variable.
 | |
| 
 | |
|        If there is a fixed first value, for example, the  letter  "c"  from  a
 | |
|        pattern  such  as  (cat|cow|coyote),  1  is returned, and the character
 | |
|        value can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there  is  no
 | |
|        fixed  first  value, but it is known that a match can occur only at the
 | |
|        start of the subject or following  a  newline  in  the  subject,  2  is
 | |
|        returned. Otherwise, and for anchored patterns, 0 is returned.
 | |
| 
 | |
|          PCRE2_INFO_FIRSTCODEUNIT
 | |
| 
 | |
|        Return  the  value  of the first code unit of any matched string in the
 | |
|        situation where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
 | |
|        The  third  argument should point to an uint32_t variable. In the 8-bit
 | |
|        library, the value is always less than 256. In the 16-bit  library  the
 | |
|        value  can  be  up  to 0xffff. In the 32-bit library in UTF-32 mode the
 | |
|        value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
 | |
|        mode.
 | |
| 
 | |
|          PCRE2_INFO_FIRSTBITMAP
 | |
| 
 | |
|        In  the absence of a single first code unit for a non-anchored pattern,
 | |
|        pcre2_compile() may construct a 256-bit table that defines a fixed  set
 | |
|        of  values for the first code unit in any match. For example, a pattern
 | |
|        that starts with [abc] results in a table with  three  bits  set.  When
 | |
|        code  unit  values greater than 255 are supported, the flag bit for 255
 | |
|        means "any code unit of value 255 or above". If such a table  was  con-
 | |
|        structed,  a pointer to it is returned. Otherwise NULL is returned. The
 | |
|        third argument should point to an const uint8_t * variable.
 | |
| 
 | |
|          PCRE2_INFO_HASCRORLF
 | |
| 
 | |
|        Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
 | |
|        characters, otherwise 0. The third argument should point to an uint32_t
 | |
|        variable. An explicit match is either a literal CR or LF character,  or
 | |
|        \r or \n.
 | |
| 
 | |
|          PCRE2_INFO_JCHANGED
 | |
| 
 | |
|        Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
 | |
|        otherwise 0. The third argument should point to an  uint32_t  variable.
 | |
|        (?J)  and  (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
 | |
|        tively.
 | |
| 
 | |
|          PCRE2_INFO_JITSIZE
 | |
| 
 | |
|        If the compiled pattern was successfully  processed  by  pcre2_jit_com-
 | |
|        pile(),  return  the  size  of  the JIT compiled code, otherwise return
 | |
|        zero. The third argument should point to a size_t variable.
 | |
| 
 | |
|          PCRE2_INFO_LASTCODETYPE
 | |
| 
 | |
|        Returns 1 if there is a rightmost literal code unit that must exist  in
 | |
|        any  matched string, other than at its start. The third argument should
 | |
|        point to an uint32_t  variable.  If  there  is  no  such  value,  0  is
 | |
|        returned.  When  1  is  returned,  the  code  unit  value itself can be
 | |
|        retrieved using PCRE2_INFO_LASTCODEUNIT.
 | |
| 
 | |
|        For anchored patterns, a last literal value is recorded only if it fol-
 | |
|        lows  something  of  variable  length.  For  example,  for  the pattern
 | |
|        /^a\d+z\d+/  the  returned  value  is  1  (with   "z"   returned   from
 | |
|        PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is 0.
 | |
| 
 | |
|          PCRE2_INFO_LASTCODEUNIT
 | |
| 
 | |
|        Return  the value of the rightmost literal data unit that must exist in
 | |
|        any matched string, other than at its start, if such a value  has  been
 | |
|        recorded.  The  third argument should point to an uint32_t variable. If
 | |
|        there is no such value, 0 is returned.
 | |
| 
 | |
|          PCRE2_INFO_MATCHEMPTY
 | |
| 
 | |
|        Return 1 if the pattern can match an empty  string,  otherwise  0.  The
 | |
|        third argument should point to an uint32_t variable.
 | |
| 
 | |
|          PCRE2_INFO_MATCHLIMIT
 | |
| 
 | |
|        If  the  pattern  set  a  match  limit by including an item of the form
 | |
|        (*LIMIT_MATCH=nnnn) at the start, the  value  is  returned.  The  third
 | |
|        argument  should  point to an unsigned 32-bit integer. If no such value
 | |
|        has been set,  the  call  to  pcre2_pattern_info()  returns  the  error
 | |
|        PCRE2_ERROR_UNSET.
 | |
| 
 | |
|          PCRE2_INFO_MAXLOOKBEHIND
 | |
| 
 | |
|        Return the number of characters (not code units) in the longest lookbe-
 | |
|        hind assertion in the pattern. The third argument should  point  to  an
 | |
|        unsigned  32-bit  integer. This information is useful when doing multi-
 | |
|        segment matching using the partial matching facilities. Note  that  the
 | |
|        simple assertions \b and \B require a one-character lookbehind. \A also
 | |
|        registers a one-character  lookbehind,  though  it  does  not  actually
 | |
|        inspect  the  previous  character.  This is to ensure that at least one
 | |
|        character from the old segment is retained when a new segment  is  pro-
 | |
|        cessed. Otherwise, if there are no lookbehinds in the pattern, \A might
 | |
|        match incorrectly at the start of a new segment.
 | |
| 
 | |
|          PCRE2_INFO_MINLENGTH
 | |
| 
 | |
|        If a minimum length for matching  subject  strings  was  computed,  its
 | |
|        value  is  returned.  Otherwise the returned value is 0. The value is a
 | |
|        number of characters, which in UTF mode may be different from the  num-
 | |
|        ber  of  code  units.   The  third argument should point to an uint32_t
 | |
|        variable. The value is a lower bound to  the  length  of  any  matching
 | |
|        string.  There  may  not be any strings of that length that do actually
 | |
|        match, but every string that does match is at least that long.
 | |
| 
 | |
|          PCRE2_INFO_NAMECOUNT
 | |
|          PCRE2_INFO_NAMEENTRYSIZE
 | |
|          PCRE2_INFO_NAMETABLE
 | |
| 
 | |
|        PCRE2 supports the use of named as well as numbered capturing parenthe-
 | |
|        ses.  The names are just an additional way of identifying the parenthe-
 | |
|        ses, which still acquire numbers. Several convenience functions such as
 | |
|        pcre2_substring_get_byname()  are provided for extracting captured sub-
 | |
|        strings by name. It is also possible to extract the data  directly,  by
 | |
|        first  converting  the  name to a number in order to access the correct
 | |
|        pointers in the output vector (described with pcre2_match() below).  To
 | |
|        do  the  conversion,  you  need to use the name-to-number map, which is
 | |
|        described by these three values.
 | |
| 
 | |
|        The map consists of a number of  fixed-size  entries.  PCRE2_INFO_NAME-
 | |
|        COUNT  gives  the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
 | |
|        the size of each entry in code units; both of these return  a  uint32_t
 | |
|        value. The entry size depends on the length of the longest name.
 | |
| 
 | |
|        PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
 | |
|        This is a PCRE2_SPTR pointer to a block of code  units.  In  the  8-bit
 | |
|        library,  the  first two bytes of each entry are the number of the cap-
 | |
|        turing parenthesis, most significant byte first. In the 16-bit library,
 | |
|        the  pointer  points  to 16-bit code units, the first of which contains
 | |
|        the parenthesis number. In the 32-bit library, the  pointer  points  to
 | |
|        32-bit  code units, the first of which contains the parenthesis number.
 | |
|        The rest of the entry is the corresponding name, zero terminated.
 | |
| 
 | |
|        The names are in alphabetical order. If (?| is used to create  multiple
 | |
|        groups  with  the same number, as described in the section on duplicate
 | |
|        subpattern numbers in the pcre2pattern page, the groups  may  be  given
 | |
|        the  same  name,  but  there  is only one entry in the table. Different
 | |
|        names for groups of the same number are not permitted.
 | |
| 
 | |
|        Duplicate names for subpatterns with different numbers  are  permitted,
 | |
|        but  only  if  PCRE2_DUPNAMES  is  set. They appear in the table in the
 | |
|        order in which they were found in the pattern. In the  absence  of  (?|
 | |
|        this  is  the  order of increasing number; when (?| is used this is not
 | |
|        necessarily the case because later subpatterns may have lower numbers.
 | |
| 
 | |
|        As a simple example of the name/number table,  consider  the  following
 | |
|        pattern  after  compilation by the 8-bit library (assume PCRE2_EXTENDED
 | |
|        is set, so white space - including newlines - is ignored):
 | |
| 
 | |
|          (?<date> (?<year>(\d\d)?\d\d) -
 | |
|          (?<month>\d\d) - (?<day>\d\d) )
 | |
| 
 | |
|        There are four named subpatterns, so the table has  four  entries,  and
 | |
|        each  entry  in the table is eight bytes long. The table is as follows,
 | |
|        with non-printing bytes shows in hexadecimal, and undefined bytes shown
 | |
|        as ??:
 | |
| 
 | |
|          00 01 d  a  t  e  00 ??
 | |
|          00 05 d  a  y  00 ?? ??
 | |
|          00 04 m  o  n  t  h  00
 | |
|          00 02 y  e  a  r  00 ??
 | |
| 
 | |
|        When  writing  code  to  extract  data from named subpatterns using the
 | |
|        name-to-number map, remember that the length of the entries  is  likely
 | |
|        to be different for each compiled pattern.
 | |
| 
 | |
|          PCRE2_INFO_NEWLINE
 | |
| 
 | |
|        The output is a uint32_t with one of the following values:
 | |
| 
 | |
|          PCRE2_NEWLINE_CR       Carriage return (CR)
 | |
|          PCRE2_NEWLINE_LF       Linefeed (LF)
 | |
|          PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF)
 | |
|          PCRE2_NEWLINE_ANY      Any Unicode line ending
 | |
|          PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF
 | |
| 
 | |
|        This  specifies  the default character sequence that will be recognized
 | |
|        as meaning "newline" while matching.
 | |
| 
 | |
|          PCRE2_INFO_RECURSIONLIMIT
 | |
| 
 | |
|        If the pattern set a recursion limit by including an item of  the  form
 | |
|        (*LIMIT_RECURSION=nnnn)  at the start, the value is returned. The third
 | |
|        argument should point to an unsigned 32-bit integer. If no  such  value
 | |
|        has  been  set,  the  call  to  pcre2_pattern_info()  returns the error
 | |
|        PCRE2_ERROR_UNSET.
 | |
| 
 | |
|          PCRE2_INFO_SIZE
 | |
| 
 | |
|        Return the size of  the  compiled  pattern  in  bytes  (for  all  three
 | |
|        libraries).  The third argument should point to a size_t variable. This
 | |
|        value includes the size of the general data  block  that  precedes  the
 | |
|        code  units of the compiled pattern itself. The value that is used when
 | |
|        pcre2_compile() is getting memory in which to place the  compiled  pat-
 | |
|        tern  may  be  slightly  larger than the value returned by this option,
 | |
|        because there are cases where the code that calculates the size has  to
 | |
|        over-estimate.  Processing  a  pattern  with  the JIT compiler does not
 | |
|        alter the value returned by this option.
 | |
| 
 | |
| 
 | |
| INFORMATION ABOUT A PATTERN'S CALLOUTS
 | |
| 
 | |
|        int pcre2_callout_enumerate(const pcre2_code *code,
 | |
|          int (*callback)(pcre2_callout_enumerate_block *, void *),
 | |
|          void *user_data);
 | |
| 
 | |
|        A script language that supports the use of string arguments in callouts
 | |
|        might  like  to  scan  all the callouts in a pattern before running the
 | |
|        match. This can be done by calling pcre2_callout_enumerate(). The first
 | |
|        argument  is  a  pointer  to a compiled pattern, the second points to a
 | |
|        callback function, and the third is arbitrary user data.  The  callback
 | |
|        function  is  called  for  every callout in the pattern in the order in
 | |
|        which they appear. Its first argument is a pointer to a callout enumer-
 | |
|        ation  block,  and  its second argument is the user_data value that was
 | |
|        passed to pcre2_callout_enumerate(). The contents of the  callout  enu-
 | |
|        meration  block  are described in the pcre2callout documentation, which
 | |
|        also gives further details about callouts.
 | |
| 
 | |
| 
 | |
| SERIALIZATION AND PRECOMPILING
 | |
| 
 | |
|        It is possible to save compiled patterns  on  disc  or  elsewhere,  and
 | |
|        reload  them  later, subject to a number of restrictions. The functions
 | |
|        whose names begin with pcre2_serialize_ are used for this purpose. They
 | |
|        are described in the pcre2serialize documentation.
 | |
| 
 | |
| 
 | |
| THE MATCH DATA BLOCK
 | |
| 
 | |
|        pcre2_match_data_create(uint32_t ovecsize,
 | |
|          pcre2_general_context *gcontext);
 | |
| 
 | |
|        pcre2_match_data_create_from_pattern(const pcre2_code *code,
 | |
|          pcre2_general_context *gcontext);
 | |
| 
 | |
|        void pcre2_match_data_free(pcre2_match_data *match_data);
 | |
| 
 | |
|        Information  about  a  successful  or unsuccessful match is placed in a
 | |
|        match data block, which is an opaque  structure  that  is  accessed  by
 | |
|        function  calls.  In particular, the match data block contains a vector
 | |
|        of offsets into the subject string that define the matched part of  the
 | |
|        subject  and  any  substrings  that  were captured. This is know as the
 | |
|        ovector.
 | |
| 
 | |
|        Before calling pcre2_match(), pcre2_dfa_match(),  or  pcre2_jit_match()
 | |
|        you must create a match data block by calling one of the creation func-
 | |
|        tions above. For pcre2_match_data_create(), the first argument  is  the
 | |
|        number  of  pairs  of  offsets  in  the ovector. One pair of offsets is
 | |
|        required to identify the string that matched the  whole  pattern,  with
 | |
|        another  pair  for  each  captured substring. For example, a value of 4
 | |
|        creates enough space to record the matched portion of the subject  plus
 | |
|        three  captured  substrings. A minimum of at least 1 pair is imposed by
 | |
|        pcre2_match_data_create(), so it is always possible to return the over-
 | |
|        all matched string.
 | |
| 
 | |
|        The second argument of pcre2_match_data_create() is a pointer to a gen-
 | |
|        eral context, which can specify custom memory management for  obtaining
 | |
|        the memory for the match data block. If you are not using custom memory
 | |
|        management, pass NULL, which causes malloc() to be used.
 | |
| 
 | |
|        For pcre2_match_data_create_from_pattern(), the  first  argument  is  a
 | |
|        pointer to a compiled pattern. The ovector is created to be exactly the
 | |
|        right size to hold all the substrings a pattern might capture. The sec-
 | |
|        ond  argument is again a pointer to a general context, but in this case
 | |
|        if NULL is passed, the memory is obtained using the same allocator that
 | |
|        was used for the compiled pattern (custom or default).
 | |
| 
 | |
|        A  match  data block can be used many times, with the same or different
 | |
|        compiled patterns. You can extract information from a match data  block
 | |
|        after  a  match  operation  has  finished,  using  functions  that  are
 | |
|        described in the sections on  matched  strings  and  other  match  data
 | |
|        below.
 | |
| 
 | |
|        When  a  call  of  pcre2_match()  fails, valid data is available in the
 | |
|        match   block   only   when   the   error    is    PCRE2_ERROR_NOMATCH,
 | |
|        PCRE2_ERROR_PARTIAL,  or  one  of  the  error  codes for an invalid UTF
 | |
|        string. Exactly what is available depends on the error, and is detailed
 | |
|        below.
 | |
| 
 | |
|        When  one of the matching functions is called, pointers to the compiled
 | |
|        pattern and the subject string are set in the match data block so  that
 | |
|        they  can  be  referenced  by the extraction functions. After running a
 | |
|        match, you must not free a compiled pattern or a subject  string  until
 | |
|        after  all  operations  on  the  match data block (for that match) have
 | |
|        taken place.
 | |
| 
 | |
|        When a match data block itself is no longer needed, it should be  freed
 | |
|        by calling pcre2_match_data_free().
 | |
| 
 | |
| 
 | |
| MATCHING A PATTERN: THE TRADITIONAL FUNCTION
 | |
| 
 | |
|        int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
 | |
|          PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | |
|          uint32_t options, pcre2_match_data *match_data,
 | |
|          pcre2_match_context *mcontext);
 | |
| 
 | |
|        The  function pcre2_match() is called to match a subject string against
 | |
|        a compiled pattern, which is passed in the code argument. You can  call
 | |
|        pcre2_match() with the same code argument as many times as you like, in
 | |
|        order to find multiple matches in the subject string or to  match  dif-
 | |
|        ferent subject strings with the same pattern.
 | |
| 
 | |
|        This  function  is  the  main  matching facility of the library, and it
 | |
|        operates in a Perl-like manner. For specialist use  there  is  also  an
 | |
|        alternative  matching function, which is described below in the section
 | |
|        about the pcre2_dfa_match() function.
 | |
| 
 | |
|        Here is an example of a simple call to pcre2_match():
 | |
| 
 | |
|          pcre2_match_data *md = pcre2_match_data_create(4, NULL);
 | |
|          int rc = pcre2_match(
 | |
|            re,             /* result of pcre2_compile() */
 | |
|            "some string",  /* the subject string */
 | |
|            11,             /* the length of the subject string */
 | |
|            0,              /* start at offset 0 in the subject */
 | |
|            0,              /* default options */
 | |
|            match_data,     /* the match data block */
 | |
|            NULL);          /* a match context; NULL means use defaults */
 | |
| 
 | |
|        If the subject string is zero-terminated, the length can  be  given  as
 | |
|        PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
 | |
|        common matching parameters are to be changed. For details, see the sec-
 | |
|        tion on the match context above.
 | |
| 
 | |
|    The string to be matched by pcre2_match()
 | |
| 
 | |
|        The  subject string is passed to pcre2_match() as a pointer in subject,
 | |
|        a length in length, and a starting offset in  startoffset.  The  length
 | |
|        and  offset  are  in  code units, not characters.  That is, they are in
 | |
|        bytes for the 8-bit library, 16-bit code units for the 16-bit  library,
 | |
|        and  32-bit  code units for the 32-bit library, whether or not UTF pro-
 | |
|        cessing is enabled.
 | |
| 
 | |
|        If startoffset is greater than the length of the subject, pcre2_match()
 | |
|        returns  PCRE2_ERROR_BADOFFSET.  When  the starting offset is zero, the
 | |
|        search for a match starts at the beginning of the subject, and this  is
 | |
|        by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
 | |
|        set must point to the start of a character, or to the end of  the  sub-
 | |
|        ject  (in  UTF-32 mode, one code unit equals one character, so all off-
 | |
|        sets are valid). Like the  pattern  string,  the  subject  may  contain
 | |
|        binary zeroes.
 | |
| 
 | |
|        A  non-zero  starting offset is useful when searching for another match
 | |
|        in the same subject by calling pcre2_match()  again  after  a  previous
 | |
|        success.   Setting  startoffset  differs  from passing over a shortened
 | |
|        string and setting PCRE2_NOTBOL in the case of a  pattern  that  begins
 | |
|        with any kind of lookbehind. For example, consider the pattern
 | |
| 
 | |
|          \Biss\B
 | |
| 
 | |
|        which  finds  occurrences  of "iss" in the middle of words. (\B matches
 | |
|        only if the current position in the subject is not  a  word  boundary.)
 | |
|        When applied to the string "Mississipi" the first call to pcre2_match()
 | |
|        finds the first occurrence. If pcre2_match() is called again with  just
 | |
|        the  remainder  of  the  subject,  namely  "issipi", it does not match,
 | |
|        because \B is always false at the start of the subject, which is deemed
 | |
|        to  be  a word boundary. However, if pcre2_match() is passed the entire
 | |
|        string again, but with startoffset set to 4, it finds the second occur-
 | |
|        rence  of "iss" because it is able to look behind the starting point to
 | |
|        discover that it is preceded by a letter.
 | |
| 
 | |
|        Finding all the matches in a subject is tricky  when  the  pattern  can
 | |
|        match an empty string. It is possible to emulate Perl's /g behaviour by
 | |
|        first  trying  the  match  again  at  the   same   offset,   with   the
 | |
|        PCRE2_NOTEMPTY_ATSTART  and  PCRE2_ANCHORED  options,  and then if that
 | |
|        fails, advancing the starting  offset  and  trying  an  ordinary  match
 | |
|        again.  There  is  some  code  that  demonstrates how to do this in the
 | |
|        pcre2demo sample program. In the most general case, you have  to  check
 | |
|        to  see  if the newline convention recognizes CRLF as a newline, and if
 | |
|        so, and the current character is CR followed by LF, advance the  start-
 | |
|        ing offset by two characters instead of one.
 | |
| 
 | |
|        If  a  non-zero starting offset is passed when the pattern is anchored,
 | |
|        one attempt to match at the given offset is made. This can only succeed
 | |
|        if  the  pattern  does  not require the match to be at the start of the
 | |
|        subject.
 | |
| 
 | |
|    Option bits for pcre2_match()
 | |
| 
 | |
|        The unused bits of the options argument for pcre2_match() must be zero.
 | |
|        The  only  bits  that  may  be  set  are  PCRE2_ANCHORED, PCRE2_NOTBOL,
 | |
|        PCRE2_NOTEOL,          PCRE2_NOTEMPTY,          PCRE2_NOTEMPTY_ATSTART,
 | |
|        PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,  and PCRE2_PARTIAL_SOFT. Their
 | |
|        action is described below.
 | |
| 
 | |
|        Setting PCRE2_ANCHORED at match time is not supported by  the  just-in-
 | |
|        time  (JIT)  compiler.  If  it is set, JIT matching is disabled and the
 | |
|        normal interpretive code in pcre2_match() is run. The remaining options
 | |
|        are supported for JIT matching.
 | |
| 
 | |
|          PCRE2_ANCHORED
 | |
| 
 | |
|        The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
 | |
|        matching position. If a pattern was compiled  with  PCRE2_ANCHORED,  or
 | |
|        turned  out to be anchored by virtue of its contents, it cannot be made
 | |
|        unachored at matching time. Note that setting the option at match  time
 | |
|        disables JIT matching.
 | |
| 
 | |
|          PCRE2_NOTBOL
 | |
| 
 | |
|        This option specifies that first character of the subject string is not
 | |
|        the beginning of a line, so the  circumflex  metacharacter  should  not
 | |
|        match  before  it.  Setting  this without having set PCRE2_MULTILINE at
 | |
|        compile time causes circumflex never to match. This option affects only
 | |
|        the behaviour of the circumflex metacharacter. It does not affect \A.
 | |
| 
 | |
|          PCRE2_NOTEOL
 | |
| 
 | |
|        This option specifies that the end of the subject string is not the end
 | |
|        of a line, so the dollar metacharacter should not match it nor  (except
 | |
|        in  multiline mode) a newline immediately before it. Setting this with-
 | |
|        out having set PCRE2_MULTILINE at compile time causes dollar  never  to
 | |
|        match. This option affects only the behaviour of the dollar metacharac-
 | |
|        ter. It does not affect \Z or \z.
 | |
| 
 | |
|          PCRE2_NOTEMPTY
 | |
| 
 | |
|        An empty string is not considered to be a valid match if this option is
 | |
|        set.  If  there are alternatives in the pattern, they are tried. If all
 | |
|        the alternatives match the empty string, the entire  match  fails.  For
 | |
|        example, if the pattern
 | |
| 
 | |
|          a?b?
 | |
| 
 | |
|        is  applied  to  a  string not beginning with "a" or "b", it matches an
 | |
|        empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
 | |
|        match  is  not valid, so pcre2_match() searches further into the string
 | |
|        for occurrences of "a" or "b".
 | |
| 
 | |
|          PCRE2_NOTEMPTY_ATSTART
 | |
| 
 | |
|        This is like PCRE2_NOTEMPTY, except that it locks out an  empty  string
 | |
|        match only at the first matching position, that is, at the start of the
 | |
|        subject plus the starting offset. An empty string match  later  in  the
 | |
|        subject  is  permitted.   If  the pattern is anchored, such a match can
 | |
|        occur only if the pattern contains \K.
 | |
| 
 | |
|          PCRE2_NO_UTF_CHECK
 | |
| 
 | |
|        When PCRE2_UTF is set at compile time, the validity of the subject as a
 | |
|        UTF  string  is  checked  by default when pcre2_match() is subsequently
 | |
|        called.  The entire string is checked before any other processing takes
 | |
|        place,  and a negative error code is returned if the check fails. There
 | |
|        are several UTF error codes for each code unit width, corresponding  to
 | |
|        different  problems with the code unit sequence. The value of startoff-
 | |
|        set is also checked, to ensure that it points to the start of a charac-
 | |
|        ter  or  to  the  end  of  the subject. There are discussions about the
 | |
|        validity of UTF-8 strings, UTF-16 strings, and UTF-32  strings  in  the
 | |
|        pcre2unicode page.
 | |
| 
 | |
|        If  you  know  that  your  subject is valid, and you want to skip these
 | |
|        checks for performance reasons,  you  can  set  the  PCRE2_NO_UTF_CHECK
 | |
|        option  when  calling  pcre2_match(). You might want to do this for the
 | |
|        second and subsequent calls to pcre2_match() if you are making repeated
 | |
|        calls to find all the matches in a single subject string.
 | |
| 
 | |
|        NOTE:  When PCRE2_NO_UTF_CHECK is set, the effect of passing an invalid
 | |
|        string as a subject, or an invalid value of startoffset, is  undefined.
 | |
|        Your program may crash or loop indefinitely.
 | |
| 
 | |
|          PCRE2_PARTIAL_HARD
 | |
|          PCRE2_PARTIAL_SOFT
 | |
| 
 | |
|        These  options  turn  on  the partial matching feature. A partial match
 | |
|        occurs if the end of the subject string is  reached  successfully,  but
 | |
|        there  are not enough subject characters to complete the match. If this
 | |
|        happens when PCRE2_PARTIAL_SOFT (but not  PCRE2_PARTIAL_HARD)  is  set,
 | |
|        matching  continues  by  testing any remaining alternatives. Only if no
 | |
|        complete match can be found is PCRE2_ERROR_PARTIAL returned instead  of
 | |
|        PCRE2_ERROR_NOMATCH.  In other words, PCRE2_PARTIAL_SOFT specifies that
 | |
|        the caller is prepared to handle a partial match, but only if  no  com-
 | |
|        plete match can be found.
 | |
| 
 | |
|        If  PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this
 | |
|        case, if a partial match is found,  pcre2_match()  immediately  returns
 | |
|        PCRE2_ERROR_PARTIAL,  without  considering  any  other alternatives. In
 | |
|        other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
 | |
|        ered to be more important that an alternative complete match.
 | |
| 
 | |
|        There is a more detailed discussion of partial and multi-segment match-
 | |
|        ing, with examples, in the pcre2partial documentation.
 | |
| 
 | |
| 
 | |
| NEWLINE HANDLING WHEN MATCHING
 | |
| 
 | |
|        When PCRE2 is built, a default newline convention is set; this is  usu-
 | |
|        ally  the standard convention for the operating system. The default can
 | |
|        be overridden in a  compile  context.   During  matching,  the  newline
 | |
|        choice  affects  the  behaviour  of  the  dot,  circumflex,  and dollar
 | |
|        metacharacters. It may also alter the way the match  starting  position
 | |
|        is advanced after a match failure for an unanchored pattern.
 | |
| 
 | |
|        When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
 | |
|        set as the newline convention, and a match attempt  for  an  unanchored
 | |
|        pattern fails when the current starting position is at a CRLF sequence,
 | |
|        and the pattern contains no explicit matches for CR or  LF  characters,
 | |
|        the  match  position  is  advanced by two characters instead of one, in
 | |
|        other words, to after the CRLF.
 | |
| 
 | |
|        The above rule is a compromise that makes the most common cases work as
 | |
|        expected.  For  example,  if  the  pattern is .+A (and the PCRE2_DOTALL
 | |
|        option is not set), it does not match the string "\r\nA" because, after
 | |
|        failing  at the start, it skips both the CR and the LF before retrying.
 | |
|        However, the pattern [\r\n]A does match that string,  because  it  con-
 | |
|        tains an explicit CR or LF reference, and so advances only by one char-
 | |
|        acter after the first failure.
 | |
| 
 | |
|        An explicit match for CR of LF is either a literal appearance of one of
 | |
|        those  characters  in  the  pattern,  or  one  of  the  \r or \n escape
 | |
|        sequences. Implicit matches such as [^X] do not  count,  nor  does  \s,
 | |
|        even though it includes CR and LF in the characters that it matches.
 | |
| 
 | |
|        Notwithstanding  the above, anomalous effects may still occur when CRLF
 | |
|        is a valid newline sequence and explicit \r or \n escapes appear in the
 | |
|        pattern.
 | |
| 
 | |
| 
 | |
| HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
 | |
| 
 | |
|        uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
 | |
| 
 | |
|        PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
 | |
| 
 | |
|        In  general, a pattern matches a certain portion of the subject, and in
 | |
|        addition, further substrings from the subject  may  be  picked  out  by
 | |
|        parenthesized  parts  of  the  pattern.  Following the usage in Jeffrey
 | |
|        Friedl's book, this is called "capturing"  in  what  follows,  and  the
 | |
|        phrase  "capturing subpattern" or "capturing group" is used for a frag-
 | |
|        ment of a pattern that picks out a substring.  PCRE2  supports  several
 | |
|        other kinds of parenthesized subpattern that do not cause substrings to
 | |
|        be captured. The pcre2_pattern_info() function can be used to find  out
 | |
|        how many capturing subpatterns there are in a compiled pattern.
 | |
| 
 | |
|        A  successful match returns the overall matched string and any captured
 | |
|        substrings to the caller via a vector of  PCRE2_SIZE  values.  This  is
 | |
|        called  the ovector, and is contained within the match data block.  You
 | |
|        can obtain direct access to  the  ovector  by  calling  pcre2_get_ovec-
 | |
|        tor_pointer()  to  find  its  address, and pcre2_get_ovector_count() to
 | |
|        find the number of pairs of values it contains. Alternatively, you  can
 | |
|        use the auxiliary functions for accessing captured substrings by number
 | |
|        or by name (see below).
 | |
| 
 | |
|        Within the ovector, the first in each pair of values is set to the off-
 | |
|        set of the first code unit of a substring, and the second is set to the
 | |
|        offset of the first code unit after the end of a substring. These  val-
 | |
|        ues  are always code unit offsets, not character offsets. That is, they
 | |
|        are byte offsets in the 8-bit library, 16-bit  offsets  in  the  16-bit
 | |
|        library, and 32-bit offsets in the 32-bit library.
 | |
| 
 | |
|        After  a  partial  match  (error  return PCRE2_ERROR_PARTIAL), only the
 | |
|        first pair of offsets (that is, ovector[0]  and  ovector[1])  are  set.
 | |
|        They  identify  the part of the subject that was partially matched. See
 | |
|        the pcre2partial documentation for details of partial matching.
 | |
| 
 | |
|        After a successful match, the first pair of offsets identifies the por-
 | |
|        tion  of the subject string that was matched by the entire pattern. The
 | |
|        next pair is used for the first capturing subpattern, and  so  on.  The
 | |
|        value  returned  by pcre2_match() is one more than the highest numbered
 | |
|        pair that has been set. For example, if two substrings have  been  cap-
 | |
|        tured,  the returned value is 3. If there are no capturing subpatterns,
 | |
|        the return value from a successful match is 1, indicating that just the
 | |
|        first pair of offsets has been set.
 | |
| 
 | |
|        If  a  pattern uses the \K escape sequence within a positive assertion,
 | |
|        the reported start of a successful match can be greater than the end of
 | |
|        the  match.   For  example,  if the pattern (?=ab\K) is matched against
 | |
|        "ab", the start and end offset values for the match are 2 and 0.
 | |
| 
 | |
|        If a capturing subpattern group is matched repeatedly within  a  single
 | |
|        match  operation, it is the last portion of the subject that it matched
 | |
|        that is returned.
 | |
| 
 | |
|        If the ovector is too small to hold all the captured substring offsets,
 | |
|        as  much  as possible is filled in, and the function returns a value of
 | |
|        zero. If captured substrings are not of interest, pcre2_match() may  be
 | |
|        called with a match data block whose ovector is of minimum length (that
 | |
|        is, one pair). However, if the pattern contains back references and the
 | |
|        ovector is not big enough to remember the related substrings, PCRE2 has
 | |
|        to get additional memory for use during matching. Thus  it  is  usually
 | |
|        advisable to set up a match data block containing an ovector of reason-
 | |
|        able size.
 | |
| 
 | |
|        It is possible for capturing subpattern number n+1 to match  some  part
 | |
|        of the subject when subpattern n has not been used at all. For example,
 | |
|        if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the
 | |
|        return from the function is 4, and subpatterns 1 and 3 are matched, but
 | |
|        2 is not. When this happens, both values in  the  offset  pairs  corre-
 | |
|        sponding to unused subpatterns are set to PCRE2_UNSET.
 | |
| 
 | |
|        Offset  values  that correspond to unused subpatterns at the end of the
 | |
|        expression are also set to PCRE2_UNSET.  For  example,  if  the  string
 | |
|        "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3
 | |
|        are not matched.  The return from the function is 2, because the  high-
 | |
|        est used capturing subpattern number is 1. The offsets for for the sec-
 | |
|        ond and third capturing  subpatterns  (assuming  the  vector  is  large
 | |
|        enough, of course) are set to PCRE2_UNSET.
 | |
| 
 | |
|        Elements in the ovector that do not correspond to capturing parentheses
 | |
|        in the pattern are never changed. That is, if a pattern contains n cap-
 | |
|        turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
 | |
|        pcre2_match(). The other elements retain whatever  values  they  previ-
 | |
|        ously had.
 | |
| 
 | |
| 
 | |
| OTHER INFORMATION ABOUT A MATCH
 | |
| 
 | |
|        PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
 | |
| 
 | |
|        PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
 | |
| 
 | |
|        As  well as the offsets in the ovector, other information about a match
 | |
|        is retained in the match data block and can be retrieved by  the  above
 | |
|        functions  in  appropriate  circumstances.  If they are called at other
 | |
|        times, the result is undefined.
 | |
| 
 | |
|        After a successful match, a partial match (PCRE2_ERROR_PARTIAL),  or  a
 | |
|        failure  to  match  (PCRE2_ERROR_NOMATCH), a (*MARK) name may be avail-
 | |
|        able, and pcre2_get_mark() can be called. It returns a pointer  to  the
 | |
|        zero-terminated  name,  which is within the compiled pattern. Otherwise
 | |
|        NULL is returned. After a successful match, the (*MARK)  name  that  is
 | |
|        returned  is  the last one encountered on the matching path through the
 | |
|        pattern. After a "no match" or a partial match,  the  last  encountered
 | |
|        (*MARK) name is returned. For example, consider this pattern:
 | |
| 
 | |
|          ^(*MARK:A)((*MARK:B)a|b)c
 | |
| 
 | |
|        When  it  matches "bc", the returned mark is A. The B mark is "seen" in
 | |
|        the first branch of the group, but it is not on the matching  path.  On
 | |
|        the  other  hand,  when  this pattern fails to match "bx", the returned
 | |
|        mark is B.
 | |
| 
 | |
|        After a successful match, a partial match, or one of  the  invalid  UTF
 | |
|        errors  (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can
 | |
|        be called. After a successful or partial match it returns the code unit
 | |
|        offset  of  the character at which the match started. For a non-partial
 | |
|        match, this can be different to the value of ovector[0] if the  pattern
 | |
|        contains  the  \K escape sequence. After a partial match, however, this
 | |
|        value is always the same as ovector[0] because \K does not  affect  the
 | |
|        result of a partial match.
 | |
| 
 | |
|        After  a UTF check failure, pcre2_get_startchar() can be used to obtain
 | |
|        the code unit offset of the invalid UTF character. Details are given in
 | |
|        the pcre2unicode page.
 | |
| 
 | |
| 
 | |
| ERROR RETURNS FROM pcre2_match()
 | |
| 
 | |
|        If  pcre2_match() fails, it returns a negative number. This can be con-
 | |
|        verted to a text string by calling pcre2_get_error_message().  Negative
 | |
|        error  codes  are  also returned by other functions, and are documented
 | |
|        with them.  The codes are given names in the header file. If UTF check-
 | |
|        ing is in force and an invalid UTF subject string is detected, one of a
 | |
|        number of UTF-specific negative error codes is  returned.  Details  are
 | |
|        given in the pcre2unicode page. The following are the other errors that
 | |
|        may be returned by pcre2_match():
 | |
| 
 | |
|          PCRE2_ERROR_NOMATCH
 | |
| 
 | |
|        The subject string did not match the pattern.
 | |
| 
 | |
|          PCRE2_ERROR_PARTIAL
 | |
| 
 | |
|        The subject string did not match, but it did match partially.  See  the
 | |
|        pcre2partial documentation for details of partial matching.
 | |
| 
 | |
|          PCRE2_ERROR_BADMAGIC
 | |
| 
 | |
|        PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
 | |
|        to catch the case when it is passed a junk pointer. This is  the  error
 | |
|        that is returned when the magic number is not present.
 | |
| 
 | |
|          PCRE2_ERROR_BADMODE
 | |
| 
 | |
|        This  error  is  given  when  a  pattern that was compiled by the 8-bit
 | |
|        library is passed to a 16-bit  or  32-bit  library  function,  or  vice
 | |
|        versa.
 | |
| 
 | |
|          PCRE2_ERROR_BADOFFSET
 | |
| 
 | |
|        The value of startoffset was greater than the length of the subject.
 | |
| 
 | |
|          PCRE2_ERROR_BADOPTION
 | |
| 
 | |
|        An unrecognized bit was set in the options argument.
 | |
| 
 | |
|          PCRE2_ERROR_BADUTFOFFSET
 | |
| 
 | |
|        The UTF code unit sequence that was passed as a subject was checked and
 | |
|        found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but  the
 | |
|        value  of startoffset did not point to the beginning of a UTF character
 | |
|        or the end of the subject.
 | |
| 
 | |
|          PCRE2_ERROR_CALLOUT
 | |
| 
 | |
|        This error is never generated by pcre2_match() itself. It  is  provided
 | |
|        for  use  by  callout  functions  that  want  to cause pcre2_match() or
 | |
|        pcre2_callout_enumerate() to return a distinctive error code.  See  the
 | |
|        pcre2callout documentation for details.
 | |
| 
 | |
|          PCRE2_ERROR_INTERNAL
 | |
| 
 | |
|        An  unexpected  internal error has occurred. This error could be caused
 | |
|        by a bug in PCRE2 or by overwriting of the compiled pattern.
 | |
| 
 | |
|          PCRE2_ERROR_JIT_BADOPTION
 | |
| 
 | |
|        This error is returned when a pattern  that  was  successfully  studied
 | |
|        using  JIT is being matched, but the matching mode (partial or complete
 | |
|        match) does not correspond to any JIT compilation mode.  When  the  JIT
 | |
|        fast  path  function  is used, this error may be also given for invalid
 | |
|        options. See the pcre2jit documentation for more details.
 | |
| 
 | |
|          PCRE2_ERROR_JIT_STACKLIMIT
 | |
| 
 | |
|        This error is returned when a pattern  that  was  successfully  studied
 | |
|        using  JIT  is being matched, but the memory available for the just-in-
 | |
|        time processing stack is not large enough. See the pcre2jit  documenta-
 | |
|        tion for more details.
 | |
| 
 | |
|          PCRE2_ERROR_MATCHLIMIT
 | |
| 
 | |
|        The backtracking limit was reached.
 | |
| 
 | |
|          PCRE2_ERROR_NOMEMORY
 | |
| 
 | |
|        If  a  pattern  contains  back  references,  but the ovector is not big
 | |
|        enough to remember the referenced substrings, PCRE2  gets  a  block  of
 | |
|        memory at the start of matching to use for this purpose. There are some
 | |
|        other special cases where extra memory is needed during matching.  This
 | |
|        error is given when memory cannot be obtained.
 | |
| 
 | |
|          PCRE2_ERROR_NULL
 | |
| 
 | |
|        Either the code, subject, or match_data argument was passed as NULL.
 | |
| 
 | |
|          PCRE2_ERROR_RECURSELOOP
 | |
| 
 | |
|        This  error  is  returned  when  pcre2_match() detects a recursion loop
 | |
|        within the pattern. Specifically, it means that either the  whole  pat-
 | |
|        tern or a subpattern has been called recursively for the second time at
 | |
|        the same position in the subject  string.  Some  simple  patterns  that
 | |
|        might  do  this are detected and faulted at compile time, but more com-
 | |
|        plicated cases, in particular mutual recursions between  two  different
 | |
|        subpatterns, cannot be detected until matching is attempted.
 | |
| 
 | |
|          PCRE2_ERROR_RECURSIONLIMIT
 | |
| 
 | |
|        The internal recursion limit was reached.
 | |
| 
 | |
| 
 | |
| EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
 | |
| 
 | |
|        int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
 | |
|          uint32_t number, PCRE2_SIZE *length);
 | |
| 
 | |
|        int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
 | |
|          uint32_t number, PCRE2_UCHAR *buffer,
 | |
|          PCRE2_SIZE *bufflen);
 | |
| 
 | |
|        int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
 | |
|          uint32_t number, PCRE2_UCHAR **bufferptr,
 | |
|          PCRE2_SIZE *bufflen);
 | |
| 
 | |
|        void pcre2_substring_free(PCRE2_UCHAR *buffer);
 | |
| 
 | |
|        Captured  substrings  can  be accessed directly by using the ovector as
 | |
|        described above.  For convenience, auxiliary functions are provided for
 | |
|        extracting   captured  substrings  as  new,  separate,  zero-terminated
 | |
|        strings. A substring that contains a binary zero is correctly extracted
 | |
|        and  has  a  further  zero  added on the end, but the result is not, of
 | |
|        course, a C string.
 | |
| 
 | |
|        The functions in this section identify substrings by number. The number
 | |
|        zero refers to the entire matched substring, with higher numbers refer-
 | |
|        ring to substrings captured by parenthesized groups.  After  a  partial
 | |
|        match,  only  substring  zero  is  available. An attempt to extract any
 | |
|        other substring gives the error PCRE2_ERROR_PARTIAL. The  next  section
 | |
|        describes similar functions for extracting captured substrings by name.
 | |
| 
 | |
|        If  a  pattern uses the \K escape sequence within a positive assertion,
 | |
|        the reported start of a successful match can be greater than the end of
 | |
|        the  match.   For  example,  if the pattern (?=ab\K) is matched against
 | |
|        "ab", the start and end offset values for the match are  2  and  0.  In
 | |
|        this  situation,  calling  these functions with a zero substring number
 | |
|        extracts a zero-length empty string.
 | |
| 
 | |
|        You can find the length in code units of a captured  substring  without
 | |
|        extracting  it  by calling pcre2_substring_length_bynumber(). The first
 | |
|        argument is a pointer to the match data block, the second is the  group
 | |
|        number,  and the third is a pointer to a variable into which the length
 | |
|        is placed. If you just want to know whether or not  the  substring  has
 | |
|        been captured, you can pass the third argument as NULL.
 | |
| 
 | |
|        The  pcre2_substring_copy_bynumber()  function  copies  a captured sub-
 | |
|        string into a supplied buffer,  whereas  pcre2_substring_get_bynumber()
 | |
|        copies  it  into  new memory, obtained using the same memory allocation
 | |
|        function that was used for the match data block. The  first  two  argu-
 | |
|        ments  of  these  functions are a pointer to the match data block and a
 | |
|        capturing group number.
 | |
| 
 | |
|        The final arguments of pcre2_substring_copy_bynumber() are a pointer to
 | |
|        the buffer and a pointer to a variable that contains its length in code
 | |
|        units.  This is updated to contain the actual number of code units used
 | |
|        for the extracted substring, excluding the terminating zero.
 | |
| 
 | |
|        For pcre2_substring_get_bynumber() the third and fourth arguments point
 | |
|        to variables that are updated with a pointer to the new memory and  the
 | |
|        number  of  code units that comprise the substring, again excluding the
 | |
|        terminating zero. When the substring is no longer  needed,  the  memory
 | |
|        should be freed by calling pcre2_substring_free().
 | |
| 
 | |
|        The  return  value  from  all these functions is zero for success, or a
 | |
|        negative error code. If the pattern match  failed,  the  match  failure
 | |
|        code  is  returned.   If  a  substring number greater than zero is used
 | |
|        after a partial match, PCRE2_ERROR_PARTIAL is returned. Other  possible
 | |
|        error codes are:
 | |
| 
 | |
|          PCRE2_ERROR_NOMEMORY
 | |
| 
 | |
|        The  buffer  was  too small for pcre2_substring_copy_bynumber(), or the
 | |
|        attempt to get memory failed for pcre2_substring_get_bynumber().
 | |
| 
 | |
|          PCRE2_ERROR_NOSUBSTRING
 | |
| 
 | |
|        There is no substring with that number in the  pattern,  that  is,  the
 | |
|        number is greater than the number of capturing parentheses.
 | |
| 
 | |
|          PCRE2_ERROR_UNAVAILABLE
 | |
| 
 | |
|        The substring number, though not greater than the number of captures in
 | |
|        the pattern, is greater than the number of slots in the ovector, so the
 | |
|        substring could not be captured.
 | |
| 
 | |
|          PCRE2_ERROR_UNSET
 | |
| 
 | |
|        The  substring  did  not  participate in the match. For example, if the
 | |
|        pattern is (abc)|(def) and the subject is "def", and the  ovector  con-
 | |
|        tains at least two capturing slots, substring number 1 is unset.
 | |
| 
 | |
| 
 | |
| EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
 | |
| 
 | |
|        int pcre2_substring_list_get(pcre2_match_data *match_data,
 | |
|          PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
 | |
| 
 | |
|        void pcre2_substring_list_free(PCRE2_SPTR *list);
 | |
| 
 | |
|        The  pcre2_substring_list_get()  function  extracts  all available sub-
 | |
|        strings and builds a list of pointers to  them.  It  also  (optionally)
 | |
|        builds  a  second  list  that  contains  their lengths (in code units),
 | |
|        excluding a terminating zero that is added to each of them. All this is
 | |
|        done in a single block of memory that is obtained using the same memory
 | |
|        allocation function that was used to get the match data block.
 | |
| 
 | |
|        This function must be called only after a successful match.  If  called
 | |
|        after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
 | |
| 
 | |
|        The  address of the memory block is returned via listptr, which is also
 | |
|        the start of the list of string pointers. The end of the list is marked
 | |
|        by  a  NULL pointer. The address of the list of lengths is returned via
 | |
|        lengthsptr. If your strings do not contain binary zeros and you do  not
 | |
|        therefore need the lengths, you may supply NULL as the lengthsptr argu-
 | |
|        ment to disable the creation of a list of lengths.  The  yield  of  the
 | |
|        function  is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
 | |
|        ory block could not be obtained. When the list is no longer needed,  it
 | |
|        should be freed by calling pcre2_substring_list_free().
 | |
| 
 | |
|        If this function encounters a substring that is unset, which can happen
 | |
|        when capturing subpattern number n+1 matches some part of the  subject,
 | |
|        but  subpattern n has not been used at all, it returns an empty string.
 | |
|        This can be distinguished  from  a  genuine  zero-length  substring  by
 | |
|        inspecting  the  appropriate  offset  in  the  ovector,  which  contain
 | |
|        PCRE2_UNSET  for   unset   substrings,   or   by   calling   pcre2_sub-
 | |
|        string_length_bynumber().
 | |
| 
 | |
| 
 | |
| EXTRACTING CAPTURED SUBSTRINGS BY NAME
 | |
| 
 | |
|        int pcre2_substring_number_from_name(const pcre2_code *code,
 | |
|          PCRE2_SPTR name);
 | |
| 
 | |
|        int pcre2_substring_length_byname(pcre2_match_data *match_data,
 | |
|          PCRE2_SPTR name, PCRE2_SIZE *length);
 | |
| 
 | |
|        int pcre2_substring_copy_byname(pcre2_match_data *match_data,
 | |
|          PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
 | |
| 
 | |
|        int pcre2_substring_get_byname(pcre2_match_data *match_data,
 | |
|          PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
 | |
| 
 | |
|        void pcre2_substring_free(PCRE2_UCHAR *buffer);
 | |
| 
 | |
|        To  extract a substring by name, you first have to find associated num-
 | |
|        ber.  For example, for this pattern:
 | |
| 
 | |
|          (a+)b(?<xxx>\d+)...
 | |
| 
 | |
|        the number of the subpattern called "xxx" is 2. If the name is known to
 | |
|        be  unique  (PCRE2_DUPNAMES  was not set), you can find the number from
 | |
|        the name by calling pcre2_substring_number_from_name(). The first argu-
 | |
|        ment  is the compiled pattern, and the second is the name. The yield of
 | |
|        the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there
 | |
|        is  no  subpattern  of  that  name, or PCRE2_ERROR_NOUNIQUESUBSTRING if
 | |
|        there is more than one subpattern of that name. Given the  number,  you
 | |
|        can  extract  the  substring  directly,  or  use  one  of the functions
 | |
|        described above.
 | |
| 
 | |
|        For convenience, there are also "byname" functions that  correspond  to
 | |
|        the  "bynumber"  functions,  the  only difference being that the second
 | |
|        argument is a name instead of a number. If PCRE2_DUPNAMES  is  set  and
 | |
|        there are duplicate names, these functions scan all the groups with the
 | |
|        given name, and return the first named string that is set.
 | |
| 
 | |
|        If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING  is
 | |
|        returned.  If  all  groups  with the name have numbers that are greater
 | |
|        than the number of slots in  the  ovector,  PCRE2_ERROR_UNAVAILABLE  is
 | |
|        returned.  If  there  is at least one group with a slot in the ovector,
 | |
|        but no group is found to be set, PCRE2_ERROR_UNSET is returned.
 | |
| 
 | |
|        Warning: If the pattern uses the (?| feature to set up multiple subpat-
 | |
|        terns  with  the  same number, as described in the section on duplicate
 | |
|        subpattern numbers in the pcre2pattern page, you cannot  use  names  to
 | |
|        distinguish  the  different subpatterns, because names are not included
 | |
|        in the compiled code. The matching process uses only numbers. For  this
 | |
|        reason,  the  use of different names for subpatterns of the same number
 | |
|        causes an error at compile time.
 | |
| 
 | |
| 
 | |
| CREATING A NEW STRING WITH SUBSTITUTIONS
 | |
| 
 | |
|        int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
 | |
|          PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | |
|          uint32_t options, pcre2_match_data *match_data,
 | |
|          pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP,
 | |
|          PCRE2_SIZE rlength, PCRE2_UCHAR *outputbufferP,
 | |
|          PCRE2_SIZE *outlengthptr);
 | |
|        This function calls pcre2_match() and then makes a copy of the  subject
 | |
|        string  in  outputbuffer,  replacing the part that was matched with the
 | |
|        replacement string, whose length is supplied in rlength.  This  can  be
 | |
|        given as PCRE2_ZERO_TERMINATED for a zero-terminated string.
 | |
| 
 | |
|        In  the replacement string, which is interpreted as a UTF string in UTF
 | |
|        mode, and is checked for UTF  validity  unless  the  PCRE2_NO_UTF_CHECK
 | |
|        option is set, a dollar character is an escape character that can spec-
 | |
|        ify the insertion of characters from capturing groups in  the  pattern.
 | |
|        The following forms are recognized:
 | |
| 
 | |
|          $$      insert a dollar character
 | |
|          $<n>    insert the contents of group <n>
 | |
|          ${<n>}  insert the contents of group <n>
 | |
| 
 | |
|        Either  a  group  number  or  a  group name can be given for <n>. Curly
 | |
|        brackets are required only if the following character would  be  inter-
 | |
|        preted as part of the number or name. The number may be zero to include
 | |
|        the entire matched string.   For  example,  if  the  pattern  a(b)c  is
 | |
|        matched  with "=abc=" and the replacement string "+$1$0$1+", the result
 | |
|        is "=+babcb+=". Group insertion is done by calling  pcre2_copy_byname()
 | |
|        or pcre2_copy_bynumber() as appropriate.
 | |
| 
 | |
|        The  first  seven  arguments  of pcre2_substitute() are the same as for
 | |
|        pcre2_match(), except that the partial matching options are not permit-
 | |
|        ted,  and  match_data may be passed as NULL, in which case a match data
 | |
|        block is obtained and freed within this function, using memory  manage-
 | |
|        ment  functions from the match context, if provided, or else those that
 | |
|        were used to allocate memory for the compiled code.
 | |
| 
 | |
|        There is one additional option, PCRE2_SUBSTITUTE_GLOBAL,  which  causes
 | |
|        the function to iterate over the subject string, replacing every match-
 | |
|        ing substring. If this is not set, only the first matching substring is
 | |
|        replaced.
 | |
| 
 | |
|        The  outlengthptr  argument  must point to a variable that contains the
 | |
|        length, in code units, of the output buffer. It is updated  to  contain
 | |
|        the length of the new string, excluding the trailing zero that is auto-
 | |
|        matically added.
 | |
| 
 | |
|        The function returns the number of replacements that  were  made.  This
 | |
|        may  be  zero  if  no  matches  were found, and is never greater than 1
 | |
|        unless PCRE2_SUBSTITUTE_GLOBAL is set. In the event of an error, a neg-
 | |
|        ative  error code is returned. Except for PCRE2_ERROR_NOMATCH (which is
 | |
|        never returned), any errors from pcre2_match() or the substring copying
 | |
|        functions  are  passed  straight  back.  PCRE2_ERROR_BADREPLACEMENT  is
 | |
|        returned for an invalid replacement string (unrecognized sequence  fol-
 | |
|        lowing a dollar sign), and PCRE2_ERROR_NOMEMORY is returned if the out-
 | |
|        put buffer is not big enough.
 | |
| 
 | |
| 
 | |
| DUPLICATE SUBPATTERN NAMES
 | |
| 
 | |
|        int pcre2_substring_nametable_scan(const pcre2_code *code,
 | |
|          PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
 | |
| 
 | |
|        When a pattern is compiled with the PCRE2_DUPNAMES  option,  names  for
 | |
|        subpatterns  are  not required to be unique. Duplicate names are always
 | |
|        allowed for subpatterns with the same number, created by using the  (?|
 | |
|        feature.  Indeed,  if  such subpatterns are named, they are required to
 | |
|        use the same names.
 | |
| 
 | |
|        Normally, patterns with duplicate names are such that in any one match,
 | |
|        only  one of the named subpatterns participates. An example is shown in
 | |
|        the pcre2pattern documentation.
 | |
| 
 | |
|        When  duplicates   are   present,   pcre2_substring_copy_byname()   and
 | |
|        pcre2_substring_get_byname()  return  the first substring corresponding
 | |
|        to  the  given  name  that  is  set.  Only   if   none   are   set   is
 | |
|        PCRE2_ERROR_UNSET  is  returned. The pcre2_substring_number_from_name()
 | |
|        function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are
 | |
|        duplicate names.
 | |
| 
 | |
|        If  you want to get full details of all captured substrings for a given
 | |
|        name, you must use the pcre2_substring_nametable_scan()  function.  The
 | |
|        first  argument is the compiled pattern, and the second is the name. If
 | |
|        the third and fourth arguments are NULL, the function returns  a  group
 | |
|        number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
 | |
| 
 | |
|        When the third and fourth arguments are not NULL, they must be pointers
 | |
|        to variables that are updated by the function. After it has  run,  they
 | |
|        point to the first and last entries in the name-to-number table for the
 | |
|        given name, and the function returns the length of each entry  in  code
 | |
|        units.  In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
 | |
|        no entries for the given name.
 | |
| 
 | |
|        The format of the name table is described above in the section entitled
 | |
|        Information  about a pattern above.  Given all the relevant entries for
 | |
|        the name, you can extract each of their numbers, and hence the captured
 | |
|        data.
 | |
| 
 | |
| 
 | |
| FINDING ALL POSSIBLE MATCHES AT ONE POSITION
 | |
| 
 | |
|        The  traditional  matching  function  uses a similar algorithm to Perl,
 | |
|        which stops when it finds the first match at a given point in the  sub-
 | |
|        ject. If you want to find all possible matches, or the longest possible
 | |
|        match at a given position,  consider  using  the  alternative  matching
 | |
|        function  (see  below) instead. If you cannot use the alternative func-
 | |
|        tion, you can kludge it up by making use of the callout facility, which
 | |
|        is described in the pcre2callout documentation.
 | |
| 
 | |
|        What you have to do is to insert a callout right at the end of the pat-
 | |
|        tern.  When your callout function is called, extract and save the  cur-
 | |
|        rent  matched  substring.  Then return 1, which forces pcre2_match() to
 | |
|        backtrack and try other alternatives. Ultimately, when it runs  out  of
 | |
|        matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
 | |
| 
 | |
| 
 | |
| MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
 | |
| 
 | |
|        int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
 | |
|          PCRE2_SIZE length, PCRE2_SIZE startoffset,
 | |
|          uint32_t options, pcre2_match_data *match_data,
 | |
|          pcre2_match_context *mcontext,
 | |
|          int *workspace, PCRE2_SIZE wscount);
 | |
| 
 | |
|        The  function  pcre2_dfa_match()  is  called  to match a subject string
 | |
|        against a compiled pattern, using a matching algorithm that  scans  the
 | |
|        subject  string  just  once, and does not backtrack. This has different
 | |
|        characteristics to the normal algorithm, and  is  not  compatible  with
 | |
|        Perl.  Some of the features of PCRE2 patterns are not supported. Never-
 | |
|        theless, there are times when this kind of matching can be useful.  For
 | |
|        a  discussion  of  the  two matching algorithms, and a list of features
 | |
|        that pcre2_dfa_match() does not support, see the pcre2matching documen-
 | |
|        tation.
 | |
| 
 | |
|        The  arguments  for  the pcre2_dfa_match() function are the same as for
 | |
|        pcre2_match(), plus two extras. The ovector within the match data block
 | |
|        is used in a different way, and this is described below. The other com-
 | |
|        mon arguments are used in the same way as for pcre2_match(),  so  their
 | |
|        description is not repeated here.
 | |
| 
 | |
|        The  two  additional  arguments provide workspace for the function. The
 | |
|        workspace vector should contain at least 20 elements. It  is  used  for
 | |
|        keeping  track  of  multiple  paths  through  the  pattern  tree.  More
 | |
|        workspace is needed for patterns and subjects where there are a lot  of
 | |
|        potential matches.
 | |
| 
 | |
|        Here is an example of a simple call to pcre2_dfa_match():
 | |
| 
 | |
|          int wspace[20];
 | |
|          pcre2_match_data *md = pcre2_match_data_create(4, NULL);
 | |
|          int rc = pcre2_dfa_match(
 | |
|            re,             /* result of pcre2_compile() */
 | |
|            "some string",  /* the subject string */
 | |
|            11,             /* the length of the subject string */
 | |
|            0,              /* start at offset 0 in the subject */
 | |
|            0,              /* default options */
 | |
|            match_data,     /* the match data block */
 | |
|            NULL,           /* a match context; NULL means use defaults */
 | |
|            wspace,         /* working space vector */
 | |
|            20);            /* number of elements (NOT size in bytes) */
 | |
| 
 | |
|    Option bits for pcre_dfa_match()
 | |
| 
 | |
|        The  unused  bits of the options argument for pcre2_dfa_match() must be
 | |
|        zero. The only bits that may be set are  PCRE2_ANCHORED,  PCRE2_NOTBOL,
 | |
|        PCRE2_NOTEOL,          PCRE2_NOTEMPTY,          PCRE2_NOTEMPTY_ATSTART,
 | |
|        PCRE2_NO_UTF_CHECK,       PCRE2_PARTIAL_HARD,       PCRE2_PARTIAL_SOFT,
 | |
|        PCRE2_DFA_SHORTEST,  and  PCRE2_DFA_RESTART.  All  but the last four of
 | |
|        these are exactly the same as for pcre2_match(), so  their  description
 | |
|        is not repeated here.
 | |
| 
 | |
|          PCRE2_PARTIAL_HARD
 | |
|          PCRE2_PARTIAL_SOFT
 | |
| 
 | |
|        These  have  the  same general effect as they do for pcre2_match(), but
 | |
|        the details are slightly different. When PCRE2_PARTIAL_HARD is set  for
 | |
|        pcre2_dfa_match(),  it  returns  PCRE2_ERROR_PARTIAL  if the end of the
 | |
|        subject is reached and there is still at least one matching possibility
 | |
|        that requires additional characters. This happens even if some complete
 | |
|        matches have already been found. When PCRE2_PARTIAL_SOFT  is  set,  the
 | |
|        return  code  PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
 | |
|        if the end of the subject is  reached,  there  have  been  no  complete
 | |
|        matches, but there is still at least one matching possibility. The por-
 | |
|        tion of the string that was inspected when the  longest  partial  match
 | |
|        was found is set as the first matching string in both cases. There is a
 | |
|        more detailed discussion of partial and  multi-segment  matching,  with
 | |
|        examples, in the pcre2partial documentation.
 | |
| 
 | |
|          PCRE2_DFA_SHORTEST
 | |
| 
 | |
|        Setting  the PCRE2_DFA_SHORTEST option causes the matching algorithm to
 | |
|        stop as soon as it has found one match. Because of the way the alterna-
 | |
|        tive  algorithm  works, this is necessarily the shortest possible match
 | |
|        at the first possible matching point in the subject string.
 | |
| 
 | |
|          PCRE2_DFA_RESTART
 | |
| 
 | |
|        When pcre2_dfa_match() returns a partial match, it is possible to  call
 | |
|        it again, with additional subject characters, and have it continue with
 | |
|        the same match. The PCRE2_DFA_RESTART option requests this action; when
 | |
|        it  is  set,  the workspace and wscount options must reference the same
 | |
|        vector as before because data about the match so far is  left  in  them
 | |
|        after a partial match. There is more discussion of this facility in the
 | |
|        pcre2partial documentation.
 | |
| 
 | |
|    Successful returns from pcre2_dfa_match()
 | |
| 
 | |
|        When pcre2_dfa_match() succeeds, it may have matched more than one sub-
 | |
|        string in the subject. Note, however, that all the matches from one run
 | |
|        of the function start at the same point in  the  subject.  The  shorter
 | |
|        matches  are all initial substrings of the longer matches. For example,
 | |
|        if the pattern
 | |
| 
 | |
|          <.*>
 | |
| 
 | |
|        is matched against the string
 | |
| 
 | |
|          This is <something> <something else> <something further> no more
 | |
| 
 | |
|        the three matched strings are
 | |
| 
 | |
|          <something> <something else> <something further>
 | |
|          <something> <something else>
 | |
|          <something>
 | |
| 
 | |
|        On success, the yield of the function is a number  greater  than  zero,
 | |
|        which  is  the  number  of  matched substrings. The offsets of the sub-
 | |
|        strings are returned in the ovector, and can be extracted by number  in
 | |
|        the  same way as for pcre2_match(), but the numbers bear no relation to
 | |
|        any capturing groups that may exist in the pattern, because DFA  match-
 | |
|        ing does not support group capture.
 | |
| 
 | |
|        Calls  to  the  convenience  functions  that extract substrings by name
 | |
|        return the error PCRE2_ERROR_DFA_UFUNC (unsupported function)  if  used
 | |
|        after a DFA match. The convenience functions that extract substrings by
 | |
|        number never return PCRE2_ERROR_NOSUBSTRING, and the meanings  of  some
 | |
|        other errors are slightly different:
 | |
| 
 | |
|          PCRE2_ERROR_UNAVAILABLE
 | |
| 
 | |
|        The ovector is not big enough to include a slot for the given substring
 | |
|        number.
 | |
| 
 | |
|          PCRE2_ERROR_UNSET
 | |
| 
 | |
|        There is a slot in the ovector  for  this  substring,  but  there  were
 | |
|        insufficient matches to fill it.
 | |
| 
 | |
|        The  matched  strings  are  stored  in  the ovector in reverse order of
 | |
|        length; that is, the longest matching string is first.  If  there  were
 | |
|        too  many matches to fit into the ovector, the yield of the function is
 | |
|        zero, and the vector is filled with the longest matches.
 | |
| 
 | |
|        NOTE: PCRE2's "auto-possessification" optimization usually  applies  to
 | |
|        character  repeats at the end of a pattern (as well as internally). For
 | |
|        example, the pattern "a\d+" is compiled as if it were "a\d++". For  DFA
 | |
|        matching,  this  means  that  only  one possible match is found. If you
 | |
|        really do want multiple matches in such cases, either use  an  ungreedy
 | |
|        repeat  auch  as  "a\d+?"  or set the PCRE2_NO_AUTO_POSSESS option when
 | |
|        compiling.
 | |
| 
 | |
|    Error returns from pcre2_dfa_match()
 | |
| 
 | |
|        The pcre2_dfa_match() function returns a negative number when it fails.
 | |
|        Many  of  the  errors  are  the same as for pcre2_match(), as described
 | |
|        above.  There are in addition the following errors that are specific to
 | |
|        pcre2_dfa_match():
 | |
| 
 | |
|          PCRE2_ERROR_DFA_UITEM
 | |
| 
 | |
|        This  return  is  given  if pcre2_dfa_match() encounters an item in the
 | |
|        pattern that it does not support, for instance, the use of \C or a back
 | |
|        reference.
 | |
| 
 | |
|          PCRE2_ERROR_DFA_UCOND
 | |
| 
 | |
|        This  return  is given if pcre2_dfa_match() encounters a condition item
 | |
|        that uses a back reference for the condition, or a test  for  recursion
 | |
|        in a specific group. These are not supported.
 | |
| 
 | |
|          PCRE2_ERROR_DFA_WSSIZE
 | |
| 
 | |
|        This  return  is  given  if  pcre2_dfa_match() runs out of space in the
 | |
|        workspace vector.
 | |
| 
 | |
|          PCRE2_ERROR_DFA_RECURSE
 | |
| 
 | |
|        When a recursive subpattern is processed, the matching  function  calls
 | |
|        itself recursively, using private memory for the ovector and workspace.
 | |
|        This error is given if the internal ovector is not large  enough.  This
 | |
|        should be extremely rare, as a vector of size 1000 is used.
 | |
| 
 | |
|          PCRE2_ERROR_DFA_BADRESTART
 | |
| 
 | |
|        When  pcre2_dfa_match()  is  called  with the PCRE2_DFA_RESTART option,
 | |
|        some plausibility checks are made on the  contents  of  the  workspace,
 | |
|        which  should  contain data about the previous partial match. If any of
 | |
|        these checks fail, this error is given.
 | |
| 
 | |
| 
 | |
| SEE ALSO
 | |
| 
 | |
|        pcre2build(3),   pcre2callout(3),    pcre2demo(3),    pcre2matching(3),
 | |
|        pcre2partial(3),    pcre2posix(3),    pcre2sample(3),    pcre2stack(3),
 | |
|        pcre2unicode(3).
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 22 April 2015
 | |
|        Copyright (c) 1997-2015 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRE2BUILD(3)              Library Functions Manual              PCRE2BUILD(3)
 | |
| 
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE2 - Perl-compatible regular expressions (revised API)
 | |
| 
 | |
| BUILDING PCRE2
 | |
| 
 | |
|        PCRE2  is distributed with a configure script that can be used to build
 | |
|        the library in Unix-like environments using the applications  known  as
 | |
|        Autotools. Also in the distribution are files to support building using
 | |
|        CMake instead of configure.  The  text  file  README  contains  general
 | |
|        information  about  building  with Autotools (some of which is repeated
 | |
|        below), and also has some comments about building on various  operating
 | |
|        systems.  There  is a lot more information about building PCRE2 without
 | |
|        using Autotools (including information about using CMake  and  building
 | |
|        "by  hand")  in  the  text file called NON-AUTOTOOLS-BUILD.  You should
 | |
|        consult this file as well as the README file if you are building  in  a
 | |
|        non-Unix-like environment.
 | |
| 
 | |
| 
 | |
| PCRE2 BUILD-TIME OPTIONS
 | |
| 
 | |
|        The rest of this document describes the optional features of PCRE2 that
 | |
|        can be selected when the library is compiled. It  assumes  use  of  the
 | |
|        configure  script,  where  the  optional features are selected or dese-
 | |
|        lected by providing options to configure before running the  make  com-
 | |
|        mand.  However,  the same options can be selected in both Unix-like and
 | |
|        non-Unix-like environments if you are using CMake instead of  configure
 | |
|        to build PCRE2.
 | |
| 
 | |
|        If  you  are not using Autotools or CMake, option selection can be done
 | |
|        by editing the config.h file, or by passing parameter settings  to  the
 | |
|        compiler, as described in NON-AUTOTOOLS-BUILD.
 | |
| 
 | |
|        The complete list of options for configure (which includes the standard
 | |
|        ones such as the  selection  of  the  installation  directory)  can  be
 | |
|        obtained by running
 | |
| 
 | |
|          ./configure --help
 | |
| 
 | |
|        The  following  sections  include  descriptions  of options whose names
 | |
|        begin with --enable or --disable. These settings specify changes to the
 | |
|        defaults  for  the configure command. Because of the way that configure
 | |
|        works, --enable and --disable always come in pairs, so  the  complemen-
 | |
|        tary  option always exists as well, but as it specifies the default, it
 | |
|        is not described.
 | |
| 
 | |
| 
 | |
| BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
 | |
| 
 | |
|        By default, a library called libpcre2-8 is built, containing  functions
 | |
|        that  take  string arguments contained in vectors of bytes, interpreted
 | |
|        either as single-byte characters, or UTF-8 strings. You can also  build
 | |
|        two  other libraries, called libpcre2-16 and libpcre2-32, which process
 | |
|        strings that are contained in vectors of 16-bit and 32-bit code  units,
 | |
|        respectively. These can be interpreted either as single-unit characters
 | |
|        or UTF-16/UTF-32 strings. To build these additional libraries, add  one
 | |
|        or both of the following to the configure command:
 | |
| 
 | |
|          --enable-pcre2-16
 | |
|          --enable-pcre2-32
 | |
| 
 | |
|        If you do not want the 8-bit library, add
 | |
| 
 | |
|          --disable-pcre2-8
 | |
| 
 | |
|        as  well.  At least one of the three libraries must be built. Note that
 | |
|        the POSIX wrapper is for the 8-bit library only, and that pcre2grep  is
 | |
|        an  8-bit  program.  Neither  of these are built if you select only the
 | |
|        16-bit or 32-bit libraries.
 | |
| 
 | |
| 
 | |
| BUILDING SHARED AND STATIC LIBRARIES
 | |
| 
 | |
|        The Autotools PCRE2 building process uses libtool to build both  shared
 | |
|        and  static  libraries by default. You can suppress an unwanted library
 | |
|        by adding one of
 | |
| 
 | |
|          --disable-shared
 | |
|          --disable-static
 | |
| 
 | |
|        to the configure command.
 | |
| 
 | |
| 
 | |
| UNICODE AND UTF SUPPORT
 | |
| 
 | |
|        By default, PCRE2 is built with support for Unicode and  UTF  character
 | |
|        strings.  To build it without Unicode support, add
 | |
| 
 | |
|          --disable-unicode
 | |
| 
 | |
|        to  the configure command. This setting applies to all three libraries.
 | |
|        It is not possible to build  one  library  with  Unicode  support,  and
 | |
|        another without, in the same configuration.
 | |
| 
 | |
|        Of  itself, Unicode support does not make PCRE2 treat strings as UTF-8,
 | |
|        UTF-16 or UTF-32. To do that, applications that use the library can set
 | |
|        the  PCRE2_UTF  option when they call pcre2_compile() to compile a pat-
 | |
|        tern.  Alternatively, patterns may be started with  (*UTF)  unless  the
 | |
|        application has locked this out by setting PCRE2_NEVER_UTF.
 | |
| 
 | |
|        UTF support allows the libraries to process character code points up to
 | |
|        0x10ffff in the strings that they handle. It also provides support  for
 | |
|        accessing  the  Unicode  properties  of  such characters, using pattern
 | |
|        escapes such as \P, \p, and \X. Only the  general  category  properties
 | |
|        such  as Lu and Nd are supported. Details are given in the pcre2pattern
 | |
|        documentation.
 | |
| 
 | |
|        Pattern escapes such as \d and \w do not by default make use of Unicode
 | |
|        properties.  The  application  can  request that they do by setting the
 | |
|        PCRE2_UCP option. Unless the application  has  set  PCRE2_NEVER_UCP,  a
 | |
|        pattern may also request this by starting with (*UCP).
 | |
| 
 | |
|        The \C escape sequence, which matches a single code unit, even in a UTF
 | |
|        mode, can cause unpredictable behaviour because it may leave  the  cur-
 | |
|        rent  matching  point  in the middle of a multi-code-unit character. It
 | |
|        can be locked out by setting the PCRE2_NEVER_BACKSLASH_C option.
 | |
| 
 | |
| 
 | |
| JUST-IN-TIME COMPILER SUPPORT
 | |
| 
 | |
|        Just-in-time compiler support is included in the build by specifying
 | |
| 
 | |
|          --enable-jit
 | |
| 
 | |
|        This support is available only for certain hardware  architectures.  If
 | |
|        this  option  is  set for an unsupported architecture, a building error
 | |
|        occurs.  See the pcre2jit documentation for a discussion of JIT  usage.
 | |
|        When  JIT  support is enabled, pcre2grep automatically makes use of it,
 | |
|        unless you add
 | |
| 
 | |
|          --disable-pcre2grep-jit
 | |
| 
 | |
|        to the "configure" command.
 | |
| 
 | |
| 
 | |
| NEWLINE RECOGNITION
 | |
| 
 | |
|        By default, PCRE2 interprets the linefeed (LF) character as  indicating
 | |
|        the  end  of  a line. This is the normal newline character on Unix-like
 | |
|        systems. You can compile PCRE2 to use carriage return (CR) instead,  by
 | |
|        adding
 | |
| 
 | |
|          --enable-newline-is-cr
 | |
| 
 | |
|        to  the  configure  command.  There  is  also an --enable-newline-is-lf
 | |
|        option, which explicitly specifies linefeed as the newline character.
 | |
| 
 | |
|        Alternatively, you can specify that line endings are to be indicated by
 | |
|        the two-character sequence CRLF (CR immediately followed by LF). If you
 | |
|        want this, add
 | |
| 
 | |
|          --enable-newline-is-crlf
 | |
| 
 | |
|        to the configure command. There is a fourth option, specified by
 | |
| 
 | |
|          --enable-newline-is-anycrlf
 | |
| 
 | |
|        which causes PCRE2 to recognize any of the three sequences CR,  LF,  or
 | |
|        CRLF as indicating a line ending. Finally, a fifth option, specified by
 | |
| 
 | |
|          --enable-newline-is-any
 | |
| 
 | |
|        causes  PCRE2  to  recognize  any Unicode newline sequence. The Unicode
 | |
|        newline sequences are the three just mentioned, plus the single charac-
 | |
|        ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
 | |
|        U+0085), LS (line separator,  U+2028),  and  PS  (paragraph  separator,
 | |
|        U+2029).
 | |
| 
 | |
|        Whatever default line ending convention is selected when PCRE2 is built
 | |
|        can be overridden by applications that use the library. At  build  time
 | |
|        it is conventional to use the standard for your operating system.
 | |
| 
 | |
| 
 | |
| WHAT \R MATCHES
 | |
| 
 | |
|        By  default,  the  sequence \R in a pattern matches any Unicode newline
 | |
|        sequence, independently of what has been selected as  the  line  ending
 | |
|        sequence. If you specify
 | |
| 
 | |
|          --enable-bsr-anycrlf
 | |
| 
 | |
|        the  default  is changed so that \R matches only CR, LF, or CRLF. What-
 | |
|        ever is selected when PCRE2 is built can be overridden by  applications
 | |
|        that use the called.
 | |
| 
 | |
| 
 | |
| HANDLING VERY LARGE PATTERNS
 | |
| 
 | |
|        Within  a  compiled  pattern,  offset values are used to point from one
 | |
|        part to another (for example, from an opening parenthesis to an  alter-
 | |
|        nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
 | |
|        two-byte values are used for these offsets, leading to a  maximum  size
 | |
|        for  a compiled pattern of around 64K code units. This is sufficient to
 | |
|        handle all but the most gigantic patterns. Nevertheless, some people do
 | |
|        want  to  process truly enormous patterns, so it is possible to compile
 | |
|        PCRE2 to use three-byte or four-byte offsets by adding a  setting  such
 | |
|        as
 | |
| 
 | |
|          --with-link-size=3
 | |
| 
 | |
|        to  the  configure command. The value given must be 2, 3, or 4. For the
 | |
|        16-bit library, a value of 3 is rounded up to 4.  In  these  libraries,
 | |
|        using  longer  offsets slows down the operation of PCRE2 because it has
 | |
|        to load additional data when handling them. For the 32-bit library  the
 | |
|        value  is  always 4 and cannot be overridden; the value of --with-link-
 | |
|        size is ignored.
 | |
| 
 | |
| 
 | |
| AVOIDING EXCESSIVE STACK USAGE
 | |
| 
 | |
|        When matching with the pcre2_match() function, PCRE2  implements  back-
 | |
|        tracking  by  making  recursive  calls  to  an internal function called
 | |
|        match(). In environments where the size of the stack is  limited,  this
 | |
|        can  severely  limit  PCRE2's operation. (The Unix environment does not
 | |
|        usually suffer from this problem, but it may sometimes be necessary  to
 | |
|        increase  the  maximum  stack  size.  There  is  a  discussion  in  the
 | |
|        pcre2stack documentation.) An alternative approach  to  recursion  that
 | |
|        uses  memory from the heap to remember data, instead of using recursive
 | |
|        function calls, has been implemented to work round the problem of  lim-
 | |
|        ited  stack  size.  If  you want to build a version of PCRE2 that works
 | |
|        this way, add
 | |
| 
 | |
|          --disable-stack-for-recursion
 | |
| 
 | |
|        to the configure command. By default, the system functions malloc() and
 | |
|        free()  are called to manage the heap memory that is required, but cus-
 | |
|        tom memory management functions  can  be  called  instead.  PCRE2  runs
 | |
|        noticeably more slowly when built in this way. This option affects only
 | |
|        the pcre2_match() function; it is not relevant for pcre2_dfa_match().
 | |
| 
 | |
| 
 | |
| LIMITING PCRE2 RESOURCE USAGE
 | |
| 
 | |
|        Internally, PCRE2 has a function called match(), which it calls repeat-
 | |
|        edly   (sometimes   recursively)  when  matching  a  pattern  with  the
 | |
|        pcre2_match() function. By controlling the maximum number of times this
 | |
|        function  may be called during a single matching operation, a limit can
 | |
|        be placed on the resources used by a single call to pcre2_match().  The
 | |
|        limit can be changed at run time, as described in the pcre2api documen-
 | |
|        tation. The default is 10 million, but this can be changed by adding  a
 | |
|        setting such as
 | |
| 
 | |
|          --with-match-limit=500000
 | |
| 
 | |
|        to   the   configure  command.  This  setting  has  no  effect  on  the
 | |
|        pcre2_dfa_match() matching function.
 | |
| 
 | |
|        In some environments it is desirable to limit the  depth  of  recursive
 | |
|        calls of match() more strictly than the total number of calls, in order
 | |
|        to restrict the maximum amount of stack (or heap,  if  --disable-stack-
 | |
|        for-recursion is specified) that is used. A second limit controls this;
 | |
|        it defaults to the value that  is  set  for  --with-match-limit,  which
 | |
|        imposes  no  additional constraints. However, you can set a lower limit
 | |
|        by adding, for example,
 | |
| 
 | |
|          --with-match-limit-recursion=10000
 | |
| 
 | |
|        to the configure command. This value can  also  be  overridden  at  run
 | |
|        time.
 | |
| 
 | |
| 
 | |
| CREATING CHARACTER TABLES AT BUILD TIME
 | |
| 
 | |
|        PCRE2 uses fixed tables for processing characters whose code points are
 | |
|        less than 256. By default, PCRE2 is built with a set of tables that are
 | |
|        distributed  in  the file src/pcre2_chartables.c.dist. These tables are
 | |
|        for ASCII codes only. If you add
 | |
| 
 | |
|          --enable-rebuild-chartables
 | |
| 
 | |
|        to the configure command, the distributed tables are  no  longer  used.
 | |
|        Instead,  a  program  called dftables is compiled and run. This outputs
 | |
|        the source for new set of tables, created in the default locale of your
 | |
|        C  run-time  system. (This method of replacing the tables does not work
 | |
|        if you are cross compiling, because dftables is run on the local  host.
 | |
|        If you need to create alternative tables when cross compiling, you will
 | |
|        have to do so "by hand".)
 | |
| 
 | |
| 
 | |
| USING EBCDIC CODE
 | |
| 
 | |
|        PCRE2 assumes by default that it will run in an environment  where  the
 | |
|        character  code is ASCII or Unicode, which is a superset of ASCII. This
 | |
|        is the case for most computer operating systems. PCRE2 can, however, be
 | |
|        compiled to run in an 8-bit EBCDIC environment by adding
 | |
| 
 | |
|          --enable-ebcdic --disable-unicode
 | |
| 
 | |
|        to the configure command. This setting implies --enable-rebuild-charta-
 | |
|        bles. You should only use it if you know that  you  are  in  an  EBCDIC
 | |
|        environment (for example, an IBM mainframe operating system).
 | |
| 
 | |
|        It  is  not possible to support both EBCDIC and UTF-8 codes in the same
 | |
|        version of the library. Consequently,  --enable-unicode  and  --enable-
 | |
|        ebcdic are mutually exclusive.
 | |
| 
 | |
|        The EBCDIC character that corresponds to an ASCII LF is assumed to have
 | |
|        the value 0x15 by default. However, in some EBCDIC  environments,  0x25
 | |
|        is used. In such an environment you should use
 | |
| 
 | |
|          --enable-ebcdic-nl25
 | |
| 
 | |
|        as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
 | |
|        has the same value as in ASCII, namely, 0x0d.  Whichever  of  0x15  and
 | |
|        0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
 | |
|        acter (which, in Unicode, is 0x85).
 | |
| 
 | |
|        The options that select newline behaviour, such as --enable-newline-is-
 | |
|        cr, and equivalent run-time options, refer to these character values in
 | |
|        an EBCDIC environment.
 | |
| 
 | |
| 
 | |
| PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
 | |
| 
 | |
|        By default, pcre2grep reads all files as plain text. You can  build  it
 | |
|        so  that  it recognizes files whose names end in .gz or .bz2, and reads
 | |
|        them with libz or libbz2, respectively, by adding one or both of
 | |
| 
 | |
|          --enable-pcre2grep-libz
 | |
|          --enable-pcre2grep-libbz2
 | |
| 
 | |
|        to the configure command. These options naturally require that the rel-
 | |
|        evant  libraries  are installed on your system. Configuration will fail
 | |
|        if they are not.
 | |
| 
 | |
| 
 | |
| PCRE2GREP BUFFER SIZE
 | |
| 
 | |
|        pcre2grep uses an internal buffer to hold a "window" on the file it  is
 | |
|        scanning, in order to be able to output "before" and "after" lines when
 | |
|        it finds a match. The size of the buffer is controlled by  a  parameter
 | |
|        whose default value is 20K. The buffer itself is three times this size,
 | |
|        but because of the way it is used for holding "before" lines, the long-
 | |
|        est  line  that  is guaranteed to be processable is the parameter size.
 | |
|        You can change the default parameter value by adding, for example,
 | |
| 
 | |
|          --with-pcre2grep-bufsize=50K
 | |
| 
 | |
|        to the configure command. The caller of  pcre2grep  can  override  this
 | |
|        value by using --buffer-size on the command line..
 | |
| 
 | |
| 
 | |
| PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
 | |
| 
 | |
|        If you add one of
 | |
| 
 | |
|          --enable-pcre2test-libreadline
 | |
|          --enable-pcre2test-libedit
 | |
| 
 | |
|        to  the  configure  command,  pcre2test  is linked with the libreadline
 | |
|        orlibedit library, respectively, and when its input is from a terminal,
 | |
|        it  reads  it using the readline() function. This provides line-editing
 | |
|        and history facilities. Note that libreadline is  GPL-licensed,  so  if
 | |
|        you  distribute  a binary of pcre2test linked in this way, there may be
 | |
|        licensing issues. These can be avoided by linking instead with libedit,
 | |
|        which has a BSD licence.
 | |
| 
 | |
|        Setting  --enable-pcre2test-libreadline causes the -lreadline option to
 | |
|        be added to the pcre2test build. In many operating environments with  a
 | |
|        sytem-installed  readline  library this is sufficient. However, in some
 | |
|        environments (e.g. if an unmodified distribution version of readline is
 | |
|        in  use),  some  extra configuration may be necessary. The INSTALL file
 | |
|        for libreadline says this:
 | |
| 
 | |
|          "Readline uses the termcap functions, but does not link with
 | |
|          the termcap or curses library itself, allowing applications
 | |
|          which link with readline the to choose an appropriate library."
 | |
| 
 | |
|        If your environment has not been set up so that an appropriate  library
 | |
|        is automatically included, you may need to add something like
 | |
| 
 | |
|          LIBS="-ncurses"
 | |
| 
 | |
|        immediately before the configure command.
 | |
| 
 | |
| 
 | |
| INCLUDING DEBUGGING CODE
 | |
| 
 | |
|        If you add
 | |
| 
 | |
|          --enable-debug
 | |
| 
 | |
|        to  the configure command, additional debugging code is included in the
 | |
|        build. This feature is intended for use by the PCRE2 maintainers.
 | |
| 
 | |
| 
 | |
| DEBUGGING WITH VALGRIND SUPPORT
 | |
| 
 | |
|        If you add
 | |
| 
 | |
|          --enable-valgrind
 | |
| 
 | |
|        to the configure command, PCRE2 will use valgrind annotations  to  mark
 | |
|        certain  memory  regions  as  unaddressable.  This  allows it to detect
 | |
|        invalid memory accesses, and  is  mostly  useful  for  debugging  PCRE2
 | |
|        itself.
 | |
| 
 | |
| 
 | |
| CODE COVERAGE REPORTING
 | |
| 
 | |
|        If  your  C  compiler is gcc, you can build a version of PCRE2 that can
 | |
|        generate a code coverage report for its test suite. To enable this, you
 | |
|        must install lcov version 1.6 or above. Then specify
 | |
| 
 | |
|          --enable-coverage
 | |
| 
 | |
|        to the configure command and build PCRE2 in the usual way.
 | |
| 
 | |
|        Note that using ccache (a caching C compiler) is incompatible with code
 | |
|        coverage reporting. If you have configured ccache to run  automatically
 | |
|        on your system, you must set the environment variable
 | |
| 
 | |
|          CCACHE_DISABLE=1
 | |
| 
 | |
|        before running make to build PCRE2, so that ccache is not used.
 | |
| 
 | |
|        When  --enable-coverage  is  used,  the  following addition targets are
 | |
|        added to the Makefile:
 | |
| 
 | |
|          make coverage
 | |
| 
 | |
|        This creates a fresh coverage report for the PCRE2 test  suite.  It  is
 | |
|        equivalent  to running "make coverage-reset", "make coverage-baseline",
 | |
|        "make check", and then "make coverage-report".
 | |
| 
 | |
|          make coverage-reset
 | |
| 
 | |
|        This zeroes the coverage counters, but does nothing else.
 | |
| 
 | |
|          make coverage-baseline
 | |
| 
 | |
|        This captures baseline coverage information.
 | |
| 
 | |
|          make coverage-report
 | |
| 
 | |
|        This creates the coverage report.
 | |
| 
 | |
|          make coverage-clean-report
 | |
| 
 | |
|        This removes the generated coverage report without cleaning the  cover-
 | |
|        age data itself.
 | |
| 
 | |
|          make coverage-clean-data
 | |
| 
 | |
|        This  removes  the captured coverage data without removing the coverage
 | |
|        files created at compile time (*.gcno).
 | |
| 
 | |
|          make coverage-clean
 | |
| 
 | |
|        This cleans all coverage data including the generated coverage  report.
 | |
|        For  more  information about code coverage, see the gcov and lcov docu-
 | |
|        mentation.
 | |
| 
 | |
| 
 | |
| SEE ALSO
 | |
| 
 | |
|        pcre2api(3), pcre2-config(3).
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 24 April 2015
 | |
|        Copyright (c) 1997-2015 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3)
 | |
| 
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE2 - Perl-compatible regular expressions (revised API)
 | |
| 
 | |
| SYNOPSIS
 | |
| 
 | |
|        #include <pcre2.h>
 | |
| 
 | |
|        int (*pcre2_callout)(pcre2_callout_block *, void *);
 | |
| 
 | |
|        int pcre2_callout_enumerate(const pcre2_code *code,
 | |
|          int (*callback)(pcre2_callout_enumerate_block *, void *),
 | |
|          void *user_data);
 | |
| 
 | |
| 
 | |
| DESCRIPTION
 | |
| 
 | |
|        PCRE2  provides  a feature called "callout", which is a means of tempo-
 | |
|        rarily passing control to the caller of PCRE2 in the middle of  pattern
 | |
|        matching.  The caller of PCRE2 provides an external function by putting
 | |
|        its entry point in a match  context  (see  pcre2_set_callout()  in  the
 | |
|        pcre2api documentation).
 | |
| 
 | |
|        Within  a  regular expression, (?C<arg>) indicates a point at which the
 | |
|        external function is to be called.  Different  callout  points  can  be
 | |
|        identified  by  putting  a number less than 256 after the letter C. The
 | |
|        default value is zero.  Alternatively, the argument may be a  delimited
 | |
|        string.  The  starting delimiter must be one of ` ' " ^ % # $ { and the
 | |
|        ending delimiter is the same as the start, except for {, where the end-
 | |
|        ing  delimiter  is  }.  If  the  ending  delimiter is needed within the
 | |
|        string, it must be doubled. For example, this pattern has  two  callout
 | |
|        points:
 | |
| 
 | |
|          (?C1)abc(?C"some ""arbitrary"" text")def
 | |
| 
 | |
|        If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
 | |
|        PCRE2 automatically inserts callouts, all with number 255, before  each
 | |
|        item  in  the  pattern. For example, if PCRE2_AUTO_CALLOUT is used with
 | |
|        the pattern
 | |
| 
 | |
|          A(\d{2}|--)
 | |
| 
 | |
|        it is processed as if it were
 | |
| 
 | |
|        (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
 | |
| 
 | |
|        Notice that there is a callout before and after  each  parenthesis  and
 | |
|        alternation bar. If the pattern contains a conditional group whose con-
 | |
|        dition is an assertion, an automatic callout  is  inserted  immediately
 | |
|        before  the  condition. Such a callout may also be inserted explicitly,
 | |
|        for example:
 | |
| 
 | |
|          (?(?C9)(?=a)ab|de)  (?(?C%text%)(?!=d)ab|de)
 | |
| 
 | |
|        This applies only to assertion conditions (because they are  themselves
 | |
|        independent groups).
 | |
| 
 | |
|        Callouts  can  be useful for tracking the progress of pattern matching.
 | |
|        The pcre2test program has a pattern qualifier (/auto_callout) that sets
 | |
|        automatic  callouts.   When  any  callouts are present, the output from
 | |
|        pcre2test indicates how the pattern is being matched.  This  is  useful
 | |
|        information  when  you are trying to optimize the performance of a par-
 | |
|        ticular pattern.
 | |
| 
 | |
| 
 | |
| MISSING CALLOUTS
 | |
| 
 | |
|        You should be aware that, because of optimizations  in  the  way  PCRE2
 | |
|        compiles and matches patterns, callouts sometimes do not happen exactly
 | |
|        as you might expect.
 | |
| 
 | |
|    Auto-possessification
 | |
| 
 | |
|        At compile time, PCRE2 "auto-possessifies" repeated items when it knows
 | |
|        that  what follows cannot be part of the repeat. For example, a+[bc] is
 | |
|        compiled as if it were a++[bc]. The pcre2test output when this  pattern
 | |
|        is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
 | |
|        to the string "aaaa" is:
 | |
| 
 | |
|          --->aaaa
 | |
|           +0 ^        a+
 | |
|           +2 ^   ^    [bc]
 | |
|          No match
 | |
| 
 | |
|        This indicates that when matching [bc] fails, there is no  backtracking
 | |
|        into  a+  and  therefore the callouts that would be taken for the back-
 | |
|        tracks do not occur.  You can disable the  auto-possessify  feature  by
 | |
|        passing  PCRE2_NO_AUTO_POSSESS to pcre2_compile(), or starting the pat-
 | |
|        tern with (*NO_AUTO_POSSESS). In this case, the output changes to this:
 | |
| 
 | |
|          --->aaaa
 | |
|           +0 ^        a+
 | |
|           +2 ^   ^    [bc]
 | |
|           +2 ^  ^     [bc]
 | |
|           +2 ^ ^      [bc]
 | |
|           +2 ^^       [bc]
 | |
|          No match
 | |
| 
 | |
|        This time, when matching [bc] fails, the matcher backtracks into a+ and
 | |
|        tries again, repeatedly, until a+ itself fails.
 | |
| 
 | |
|    Automatic .* anchoring
 | |
| 
 | |
|        By default, an optimization is applied when .* is the first significant
 | |
|        item in a pattern. If PCRE2_DOTALL is set, so that the  dot  can  match
 | |
|        any  character,  the pattern is automatically anchored. If PCRE2_DOTALL
 | |
|        is not set, a match can start only after an internal newline or at  the
 | |
|        beginning  of  the  subject,  and  pcre2_compile() remembers this. This
 | |
|        optimization is disabled, however, if .* is in an atomic  group  or  if
 | |
|        there  is  a back reference to the capturing group in which it appears.
 | |
|        It is also disabled if the pattern contains (*PRUNE) or  (*SKIP).  How-
 | |
|        ever, the presence of callouts does not affect it.
 | |
| 
 | |
|        For  example,  if  the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
 | |
|        and applied to the string "aa", the pcre2test output is:
 | |
| 
 | |
|          --->aa
 | |
|           +0 ^      .*
 | |
|           +2 ^ ^    \d
 | |
|           +2 ^^     \d
 | |
|           +2 ^      \d
 | |
|          No match
 | |
| 
 | |
|        This shows that all match attempts start at the beginning of  the  sub-
 | |
|        ject.  In  other  words,  the pattern is anchored. You can disable this
 | |
|        optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(),  or
 | |
|        starting  the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
 | |
|        put changes to:
 | |
| 
 | |
|          --->aa
 | |
|           +0 ^      .*
 | |
|           +2 ^ ^    \d
 | |
|           +2 ^^     \d
 | |
|           +2 ^      \d
 | |
|           +0  ^     .*
 | |
|           +2  ^^    \d
 | |
|           +2  ^     \d
 | |
|          No match
 | |
| 
 | |
|        This shows more match attempts, starting at the second subject  charac-
 | |
|        ter.   Another  optimization, described in the next section, means that
 | |
|        there is no subsequent attempt to match with an empty subject.
 | |
| 
 | |
|        If a pattern has more than one top-level  branch,  automatic  anchoring
 | |
|        occurs if all branches are anchorable.
 | |
| 
 | |
|    Other optimizations
 | |
| 
 | |
|        Other  optimizations  that  provide fast "no match" results also affect
 | |
|        callouts.  For example, if the pattern is
 | |
| 
 | |
|          ab(?C4)cd
 | |
| 
 | |
|        PCRE2 knows that any matching string must contain the  letter  "d".  If
 | |
|        the  subject  string  is  "abyz",  the  lack of "d" means that matching
 | |
|        doesn't ever start, and the callout is  never  reached.  However,  with
 | |
|        "abyd", though the result is still no match, the callout is obeyed.
 | |
| 
 | |
|        PCRE2  also  knows  the  minimum  length of a matching string, and will
 | |
|        immediately give a "no match" return without actually running  a  match
 | |
|        if  the  subject is not long enough, or, for unanchored patterns, if it
 | |
|        has been scanned far enough.
 | |
| 
 | |
|        You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
 | |
|        MIZE  option  to  pcre2_compile(),  or  by  starting  the  pattern with
 | |
|        (*NO_START_OPT). This slows down the matching process, but does  ensure
 | |
|        that callouts such as the example above are obeyed.
 | |
| 
 | |
| 
 | |
| THE CALLOUT INTERFACE
 | |
| 
 | |
|        During  matching,  when  PCRE2  reaches a callout point, if an external
 | |
|        function is set in the match context, it is  called.  This  applies  to
 | |
|        both  normal  and DFA matching. The first argument to the callout func-
 | |
|        tion is a pointer to a pcre2_callout block. The second argument is  the
 | |
|        void  *  callout  data that was supplied when the callout was set up by
 | |
|        calling pcre2_set_callout() (see the pcre2api documentation). The call-
 | |
|        out block structure contains the following fields:
 | |
| 
 | |
|          uint32_t      version;
 | |
|          uint32_t      callout_number;
 | |
|          uint32_t      capture_top;
 | |
|          uint32_t      capture_last;
 | |
|          PCRE2_SIZE   *offset_vector;
 | |
|          PCRE2_SPTR    mark;
 | |
|          PCRE2_SPTR    subject;
 | |
|          PCRE2_SIZE    subject_length;
 | |
|          PCRE2_SIZE    start_match;
 | |
|          PCRE2_SIZE    current_position;
 | |
|          PCRE2_SIZE    pattern_position;
 | |
|          PCRE2_SIZE    next_item_length;
 | |
|          PCRE2_SIZE    callout_string_offset;
 | |
|          PCRE2_SIZE    callout_string_length;
 | |
|          PCRE2_SPTR    callout_string;
 | |
| 
 | |
|        The  version field contains the version number of the block format. The
 | |
|        current version is 1; the three callout string fields  were  added  for
 | |
|        this  version. If you are writing an application that might use an ear-
 | |
|        lier release of PCRE2, you  should  check  the  version  number  before
 | |
|        accessing  any  of  these  fields.  The version number will increase in
 | |
|        future if more fields are added, but the intention is never  to  remove
 | |
|        any of the existing fields.
 | |
| 
 | |
|    Fields for numerical callouts
 | |
| 
 | |
|        For  a  numerical  callout,  callout_string is NULL, and callout_number
 | |
|        contains the number of the callout, in the range  0-255.  This  is  the
 | |
|        number  that  follows  (?C for manual callouts; it is 255 for automati-
 | |
|        cally generated callouts.
 | |
| 
 | |
|    Fields for string callouts
 | |
| 
 | |
|        For callouts with string arguments, callout_number is always zero,  and
 | |
|        callout_string  points  to the string that is contained within the com-
 | |
|        piled pattern. Its length is given by callout_string_length. Duplicated
 | |
|        ending delimiters that were present in the original pattern string have
 | |
|        been turned into single characters, but there is no other processing of
 | |
|        the  callout string argument. An additional code unit containing binary
 | |
|        zero is present after the string, but is not included  in  the  length.
 | |
|        The  delimiter  that was used to start the string is also stored within
 | |
|        the pattern, immediately before the string itself. You can access  this
 | |
|        delimiter as callout_string[-1] if you need it.
 | |
| 
 | |
|        The callout_string_offset field is the code unit offset to the start of
 | |
|        the callout argument string within the original pattern string. This is
 | |
|        provided  for the benefit of applications such as script languages that
 | |
|        might need to report errors in the callout string within the pattern.
 | |
| 
 | |
|    Fields for all callouts
 | |
| 
 | |
|        The remaining fields in the callout block are the same for  both  kinds
 | |
|        of callout.
 | |
| 
 | |
|        The offset_vector field is a pointer to the vector of capturing offsets
 | |
|        (the "ovector") that was passed to the matching function in  the  match
 | |
|        data  block.  When pcre2_match() is used, the contents can be inspected
 | |
|        in order to extract substrings that have been matched so  far,  in  the
 | |
|        same  way as for extracting substrings after a match has completed. For
 | |
|        the DFA matching function, this field is not useful.
 | |
| 
 | |
|        The subject and subject_length fields contain copies of the values that
 | |
|        were passed to the matching function.
 | |
| 
 | |
|        The  start_match  field normally contains the offset within the subject
 | |
|        at which the current match attempt  started.  However,  if  the  escape
 | |
|        sequence  \K has been encountered, this value is changed to reflect the
 | |
|        modified starting point. If the pattern is not  anchored,  the  callout
 | |
|        function may be called several times from the same point in the pattern
 | |
|        for different starting points in the subject.
 | |
| 
 | |
|        The current_position field contains the offset within  the  subject  of
 | |
|        the current match pointer.
 | |
| 
 | |
|        When the pcre2_match() is used, the capture_top field contains one more
 | |
|        than the number of the highest numbered captured substring so  far.  If
 | |
|        no substrings have been captured, the value of capture_top is one. This
 | |
|        is always the case when the DFA functions are used, because they do not
 | |
|        support captured substrings.
 | |
| 
 | |
|        The  capture_last  field  contains the number of the most recently cap-
 | |
|        tured substring. However, when a recursion exits, the value reverts  to
 | |
|        what  it  was  outside  the recursion, as do the values of all captured
 | |
|        substrings. If no substrings have been  captured,  the  value  of  cap-
 | |
|        ture_last is 0. This is always the case for the DFA matching functions.
 | |
| 
 | |
|        The pattern_position field contains the offset in the pattern string to
 | |
|        the next item to be matched.
 | |
| 
 | |
|        The next_item_length field contains the length of the next item  to  be
 | |
|        matched in the pattern string. When the callout immediately precedes an
 | |
|        alternation bar, a closing parenthesis, or the end of the pattern,  the
 | |
|        length  is  zero. When the callout precedes an opening parenthesis, the
 | |
|        length is that of the entire subpattern.
 | |
| 
 | |
|        The pattern_position and next_item_length fields are intended  to  help
 | |
|        in  distinguishing between different automatic callouts, which all have
 | |
|        the same callout number. However, they are set for  all  callouts,  and
 | |
|        are used by pcre2test to show the next item to be matched when display-
 | |
|        ing callout information.
 | |
| 
 | |
|        In callouts from pcre2_match() the mark field contains a pointer to the
 | |
|        zero-terminated  name of the most recently passed (*MARK), (*PRUNE), or
 | |
|        (*THEN) item in the match, or NULL if no such items have  been  passed.
 | |
|        Instances  of  (*PRUNE)  or  (*THEN) without a name do not obliterate a
 | |
|        previous (*MARK). In callouts from the DFA matching function this field
 | |
|        always contains NULL.
 | |
| 
 | |
| 
 | |
| RETURN VALUES FROM CALLOUTS
 | |
| 
 | |
|        The external callout function returns an integer to PCRE2. If the value
 | |
|        is zero, matching proceeds as normal. If  the  value  is  greater  than
 | |
|        zero,  matching  fails  at  the current point, but the testing of other
 | |
|        matching possibilities goes ahead, just as if a lookahead assertion had
 | |
|        failed. If the value is less than zero, the match is abandoned, and the
 | |
|        matching function returns the negative value.
 | |
| 
 | |
|        Negative  values  should  normally  be   chosen   from   the   set   of
 | |
|        PCRE2_ERROR_xxx  values.  In  particular,  PCRE2_ERROR_NOMATCH forces a
 | |
|        standard "no match" failure. The error  number  PCRE2_ERROR_CALLOUT  is
 | |
|        reserved  for  use by callout functions; it will never be used by PCRE2
 | |
|        itself.
 | |
| 
 | |
| 
 | |
| CALLOUT ENUMERATION
 | |
| 
 | |
|        int pcre2_callout_enumerate(const pcre2_code *code,
 | |
|          int (*callback)(pcre2_callout_enumerate_block *, void *),
 | |
|          void *user_data);
 | |
| 
 | |
|        A script language that supports the use of string arguments in callouts
 | |
|        might  like  to  scan  all the callouts in a pattern before running the
 | |
|        match. This can be done by calling pcre2_callout_enumerate(). The first
 | |
|        argument  is  a  pointer  to a compiled pattern, the second points to a
 | |
|        callback function, and the third is arbitrary user data.  The  callback
 | |
|        function  is  called  for  every callout in the pattern in the order in
 | |
|        which they appear. Its first argument is a pointer to a callout enumer-
 | |
|        ation  block,  and  its second argument is the user_data value that was
 | |
|        passed to pcre2_callout_enumerate(). The data block contains  the  fol-
 | |
|        lowing fields:
 | |
| 
 | |
|          version                Block version number
 | |
|          pattern_position       Offset to next item in pattern
 | |
|          next_item_length       Length of next item in pattern
 | |
|          callout_number         Number for numbered callouts
 | |
|          callout_string_offset  Offset to string within pattern
 | |
|          callout_string_length  Length of callout string
 | |
|          callout_string         Points to callout string or is NULL
 | |
| 
 | |
|        The  version  number is currently 0. It will increase if new fields are
 | |
|        ever added to the block. The remaining fields are  the  same  as  their
 | |
|        namesakes  in  the pcre2_callout block that is used for callouts during
 | |
|        matching, as described above.
 | |
| 
 | |
|        Note that the value of pattern_position is  unique  for  each  callout.
 | |
|        However,  if  a callout occurs inside a group that is quantified with a
 | |
|        non-zero minimum or a fixed maximum, the group is replicated inside the
 | |
|        compiled  pattern.  For example, a pattern such as /(a){2}/ is compiled
 | |
|        as if it were /(a)(a)/. This means that the callout will be  enumerated
 | |
|        more  than  once,  but with the same value for pattern_position in each
 | |
|        case.
 | |
| 
 | |
|        The callback function should normally return zero. If it returns a non-
 | |
|        zero value, scanning the pattern stops, and that value is returned from
 | |
|        pcre2_callout_enumerate().
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 23 March 2015
 | |
|        Copyright (c) 1997-2015 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRE2COMPAT(3)             Library Functions Manual             PCRE2COMPAT(3)
 | |
| 
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE2 - Perl-compatible regular expressions (revised API)
 | |
| 
 | |
| DIFFERENCES BETWEEN PCRE2 AND PERL
 | |
| 
 | |
|        This document describes the differences in the ways that PCRE2 and Perl
 | |
|        handle regular expressions. The differences  described  here  are  with
 | |
|        respect to Perl versions 5.10 and above.
 | |
| 
 | |
|        1.  PCRE2  has only a subset of Perl's Unicode support. Details of what
 | |
|        it does have are given in the pcre2unicode page.
 | |
| 
 | |
|        2. PCRE2 allows repeat quantifiers only  on  parenthesized  assertions,
 | |
|        but  they  do not mean what you might think. For example, (?!a){3} does
 | |
|        not assert that the next three characters are not "a". It just  asserts
 | |
|        that  the  next  character  is not "a" three times (in principle: PCRE2
 | |
|        optimizes this to run the assertion  just  once).  Perl  allows  repeat
 | |
|        quantifiers  on  other  assertions such as \b, but these do not seem to
 | |
|        have any use.
 | |
| 
 | |
|        3. Capturing subpatterns that occur inside  negative  lookahead  asser-
 | |
|        tions  are  counted,  but their entries in the offsets vector are never
 | |
|        set. Perl sometimes (but not always) sets its numerical variables  from
 | |
|        inside negative assertions.
 | |
| 
 | |
|        4.  The  following Perl escape sequences are not supported: \l, \u, \L,
 | |
|        \U, and \N when followed by a character name or Unicode value.  (\N  on
 | |
|        its own, matching a non-newline character, is supported.) In fact these
 | |
|        are implemented by Perl's general string-handling and are not  part  of
 | |
|        its  pattern matching engine. If any of these are encountered by PCRE2,
 | |
|        an error is generated by default. However, if the PCRE2_ALT_BSUX option
 | |
|        is set, \U and \u are interpreted as ECMAScript interprets them.
 | |
| 
 | |
|        5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
 | |
|        is built with Unicode support. The properties that can be  tested  with
 | |
|        \p and \P are limited to the general category properties such as Lu and
 | |
|        Nd, script names such as Greek or Han, and the derived  properties  Any
 | |
|        and L&. PCRE2 does support the Cs (surrogate) property, which Perl does
 | |
|        not; the Perl documentation says "Because Perl hides the need  for  the
 | |
|        user  to  understand the internal representation of Unicode characters,
 | |
|        there is no need to implement the  somewhat  messy  concept  of  surro-
 | |
|        gates."
 | |
| 
 | |
|        6.  PCRE2 does support the \Q...\E escape for quoting substrings. Char-
 | |
|        acters in between are treated as literals. This is  slightly  different
 | |
|        from  Perl  in  that  $  and  @ are also handled as literals inside the
 | |
|        quotes. In Perl, they cause variable interpolation (but of course PCRE2
 | |
|        does not have variables).  Note the following examples:
 | |
| 
 | |
|            Pattern            PCRE2 matches      Perl matches
 | |
| 
 | |
|            \Qabc$xyz\E        abc$xyz           abc followed by the
 | |
|                                                   contents of $xyz
 | |
|            \Qabc\$xyz\E       abc\$xyz          abc\$xyz
 | |
|            \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
 | |
| 
 | |
|        The  \Q...\E  sequence  is recognized both inside and outside character
 | |
|        classes.
 | |
| 
 | |
|        7.  Fairly  obviously,  PCRE2  does  not  support  the  (?{code})   and
 | |
|        (??{code})  constructions. However, there is support for recursive pat-
 | |
|        terns. This is not available in Perl 5.8, but it is in Perl 5.10. Also,
 | |
|        the  PCRE2  "callout"  feature allows an external function to be called
 | |
|        during  pattern  matching.  See  the  pcre2callout  documentation   for
 | |
|        details.
 | |
| 
 | |
|        8.  Subroutine  calls  (whether recursive or not) are treated as atomic
 | |
|        groups.  Atomic recursion is like Python,  but  unlike  Perl.  Captured
 | |
|        values  that  are  set outside a subroutine call can be referenced from
 | |
|        inside in PCRE2, but not in Perl. There is a discussion  that  explains
 | |
|        these  differences  in  more detail in the section on recursion differ-
 | |
|        ences from Perl in the pcre2pattern page.
 | |
| 
 | |
|        9. If any of the backtracking control verbs are used  in  a  subpattern
 | |
|        that  is  called  as  a  subroutine (whether or not recursively), their
 | |
|        effect is confined to that subpattern; it does not extend to  the  sur-
 | |
|        rounding  pattern.  This is not always the case in Perl. In particular,
 | |
|        if (*THEN) is present in a group that is called as  a  subroutine,  its
 | |
|        action is limited to that group, even if the group does not contain any
 | |
|        | characters. Note that such subpatterns are processed as  anchored  at
 | |
|        the point where they are tested.
 | |
| 
 | |
|        10.  If a pattern contains more than one backtracking control verb, the
 | |
|        first one that is backtracked onto acts. For example,  in  the  pattern
 | |
|        A(*COMMIT)B(*PRUNE)C  a  failure in B triggers (*COMMIT), but a failure
 | |
|        in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
 | |
|        it is the same as PCRE2, but there are examples where it differs.
 | |
| 
 | |
|        11.  Most  backtracking  verbs in assertions have their normal actions.
 | |
|        They are not confined to the assertion.
 | |
| 
 | |
|        12. There are some differences that are concerned with the settings  of
 | |
|        captured  strings  when  part  of  a  pattern is repeated. For example,
 | |
|        matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2
 | |
|        unset, but in PCRE2 it is set to "b".
 | |
| 
 | |
|        13. PCRE2's handling of duplicate subpattern numbers and duplicate sub-
 | |
|        pattern names is not as general as Perl's. This is a consequence of the
 | |
|        fact  the  PCRE2  works internally just with numbers, using an external
 | |
|        table to translate between numbers and names. In particular, a  pattern
 | |
|        such  as  (?|(?<a>A)|(?<b)B),  where the two capturing parentheses have
 | |
|        the same number but different names, is not supported,  and  causes  an
 | |
|        error  at compile time. If it were allowed, it would not be possible to
 | |
|        distinguish which parentheses matched, because both names map  to  cap-
 | |
|        turing subpattern number 1. To avoid this confusing situation, an error
 | |
|        is given at compile time.
 | |
| 
 | |
|        14. Perl recognizes comments in some places that PCRE2  does  not,  for
 | |
|        example,  between  the  ( and ? at the start of a subpattern. If the /x
 | |
|        modifier is set, Perl allows white space between ( and ?  (though  cur-
 | |
|        rent  Perls warn that this is deprecated) but PCRE2 never does, even if
 | |
|        the PCRE2_EXTENDED option is set.
 | |
| 
 | |
|        15. Perl, when in warning mode, gives warnings  for  character  classes
 | |
|        such  as  [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
 | |
|        als. PCRE2 has no warning features, so it gives an error in these cases
 | |
|        because they are almost certainly user mistakes.
 | |
| 
 | |
|        16.  In  PCRE2, the upper/lower case character properties Lu and Ll are
 | |
|        not affected when case-independent matching is specified. For  example,
 | |
|        \p{Lu} always matches an upper case letter. I think Perl has changed in
 | |
|        this respect; in the release at the time of writing (5.16), \p{Lu}  and
 | |
|        \p{Ll} match all letters, regardless of case, when case independence is
 | |
|        specified.
 | |
| 
 | |
|        17. PCRE2 provides some  extensions  to  the  Perl  regular  expression
 | |
|        facilities.   Perl  5.10  includes new features that are not in earlier
 | |
|        versions of Perl, some of which (such as named parentheses)  have  been
 | |
|        in PCRE2 for some time. This list is with respect to Perl 5.10:
 | |
| 
 | |
|        (a)  Although  lookbehind  assertions  in PCRE2 must match fixed length
 | |
|        strings, each alternative branch of a lookbehind assertion can match  a
 | |
|        different  length  of  string.  Perl requires them all to have the same
 | |
|        length.
 | |
| 
 | |
|        (b) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set,  the
 | |
|        $ meta-character matches only at the very end of the string.
 | |
| 
 | |
|        (c)  A  backslash  followed  by  a  letter  with  no special meaning is
 | |
|        faulted. (Perl can be made to issue a warning.)
 | |
| 
 | |
|        (d) If PCRE2_UNGREEDY is set, the greediness of the repetition  quanti-
 | |
|        fiers is inverted, that is, by default they are not greedy, but if fol-
 | |
|        lowed by a question mark they are.
 | |
| 
 | |
|        (e) PCRE2_ANCHORED can be used at matching time to force a  pattern  to
 | |
|        be tried only at the first matching position in the subject string.
 | |
| 
 | |
|        (f)      The      PCRE2_NOTBOL,      PCRE2_NOTEOL,      PCRE2_NOTEMPTY,
 | |
|        PCRE2_NOTEMPTY_ATSTART, and PCRE2_NO_AUTO_CAPTURE options have no  Perl
 | |
|        equivalents.
 | |
| 
 | |
|        (g)  The  \R escape sequence can be restricted to match only CR, LF, or
 | |
|        CRLF by the PCRE2_BSR_ANYCRLF option.
 | |
| 
 | |
|        (h) The callout facility is PCRE2-specific.
 | |
| 
 | |
|        (i) The partial matching facility is PCRE2-specific.
 | |
| 
 | |
|        (j) The alternative matching function (pcre2_dfa_match() matches  in  a
 | |
|        different way and is not Perl-compatible.
 | |
| 
 | |
|        (k)  PCRE2 recognizes some special sequences such as (*CR) at the start
 | |
|        of a pattern that set overall options that cannot be changed within the
 | |
|        pattern.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 15 March 2015
 | |
|        Copyright (c) 1997-2015 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRE2JIT(3)                Library Functions Manual                PCRE2JIT(3)
 | |
| 
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE2 - Perl-compatible regular expressions (revised API)
 | |
| 
 | |
| PCRE2 JUST-IN-TIME COMPILER SUPPORT
 | |
| 
 | |
|        Just-in-time  compiling  is a heavyweight optimization that can greatly
 | |
|        speed up pattern matching. However, it comes at the cost of extra  pro-
 | |
|        cessing  before  the  match is performed, so it is of most benefit when
 | |
|        the same pattern is going to be matched many times. This does not  nec-
 | |
|        essarily  mean many calls of a matching function; if the pattern is not
 | |
|        anchored, matching attempts may take place many times at various  posi-
 | |
|        tions in the subject, even for a single call. Therefore, if the subject
 | |
|        string is very long, it may still pay  to  use  JIT  even  for  one-off
 | |
|        matches.  JIT  support  is  available  for all of the 8-bit, 16-bit and
 | |
|        32-bit PCRE2 libraries.
 | |
| 
 | |
|        JIT support applies only to the  traditional  Perl-compatible  matching
 | |
|        function.   It  does  not apply when the DFA matching function is being
 | |
|        used. The code for this support was written by Zoltan Herczeg.
 | |
| 
 | |
| 
 | |
| AVAILABILITY OF JIT SUPPORT
 | |
| 
 | |
|        JIT support is an optional feature of  PCRE2.  The  "configure"  option
 | |
|        --enable-jit  (or  equivalent  CMake  option) must be set when PCRE2 is
 | |
|        built if you want to use JIT. The support is limited to  the  following
 | |
|        hardware platforms:
 | |
| 
 | |
|          ARM 32-bit (v5, v7, and Thumb2)
 | |
|          ARM 64-bit
 | |
|          Intel x86 32-bit and 64-bit
 | |
|          MIPS 32-bit and 64-bit
 | |
|          Power PC 32-bit and 64-bit
 | |
|          SPARC 32-bit
 | |
| 
 | |
|        If --enable-jit is set on an unsupported platform, compilation fails.
 | |
| 
 | |
|        A  program  can  tell if JIT support is available by calling pcre2_con-
 | |
|        fig() with the PCRE2_CONFIG_JIT option. The result is  1  when  JIT  is
 | |
|        available,  and 0 otherwise. However, a simple program does not need to
 | |
|        check this in order to use JIT. The API is implemented in  a  way  that
 | |
|        falls  back  to the interpretive code if JIT is not available. For pro-
 | |
|        grams that need the best possible performance, there is  also  a  "fast
 | |
|        path" API that is JIT-specific.
 | |
| 
 | |
| 
 | |
| SIMPLE USE OF JIT
 | |
| 
 | |
|        To  make use of the JIT support in the simplest way, all you have to do
 | |
|        is to call pcre2_jit_compile() after successfully compiling  a  pattern
 | |
|        with pcre2_compile(). This function has two arguments: the first is the
 | |
|        compiled pattern pointer that was returned by pcre2_compile(), and  the
 | |
|        second  is  zero  or  more of the following option bits: PCRE2_JIT_COM-
 | |
|        PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
 | |
| 
 | |
|        If JIT support is not available, a  call  to  pcre2_jit_compile()  does
 | |
|        nothing  and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
 | |
|        pattern is passed to the JIT compiler, which turns it into machine code
 | |
|        that executes much faster than the normal interpretive code, but yields
 | |
|        exactly the same results. The returned value  from  pcre2_jit_compile()
 | |
|        is zero on success, or a negative error code.
 | |
| 
 | |
|        PCRE2_JIT_COMPLETE  requests the JIT compiler to generate code for com-
 | |
|        plete matches. If you want to run partial matches using the  PCRE2_PAR-
 | |
|        TIAL_HARD  or  PCRE2_PARTIAL_SOFT  options of pcre2_match(), you should
 | |
|        set one or both of  the  other  options  as  well  as,  or  instead  of
 | |
|        PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
 | |
|        for each of the three modes (normal, soft partial, hard partial).  When
 | |
|        pcre2_match()  is  called,  the appropriate code is run if it is avail-
 | |
|        able. Otherwise, the pattern is matched using interpretive code.
 | |
| 
 | |
|        You can call pcre2_jit_compile() multiple times for the  same  compiled
 | |
|        pattern.  It does nothing if it has previously compiled code for any of
 | |
|        the option bits. For example, you can call it once with  PCRE2_JIT_COM-
 | |
|        PLETE  and  (perhaps  later,  when  you find you need partial matching)
 | |
|        again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time  it
 | |
|        will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
 | |
|        ing. If pcre2_jit_compile() is called with no option bits set, it imme-
 | |
|        diately returns zero. This is an alternative way of testing whether JIT
 | |
|        is available.
 | |
| 
 | |
|        At present, it is not possible to free JIT compiled  code  except  when
 | |
|        the entire compiled pattern is freed by calling pcre2_code_free().
 | |
| 
 | |
|        In  some circumstances you may need to call additional functions. These
 | |
|        are described in the  section  entitled  "Controlling  the  JIT  stack"
 | |
|        below.
 | |
| 
 | |
|        There are some pcre2_match() options that are not supported by JIT, and
 | |
|        there are also some pattern items that JIT cannot handle.  Details  are
 | |
|        given  below.  In  both cases, matching automatically falls back to the
 | |
|        interpretive code. If you want to know whether JIT  was  actually  used
 | |
|        for  a particular match, you should arrange for a JIT callback function
 | |
|        to be set up as described in the section entitled "Controlling the  JIT
 | |
|        stack"  below,  even  if  you  do  not need to supply a non-default JIT
 | |
|        stack. Such a callback function is called whenever JIT code is about to
 | |
|        be  obeyed.  If the match-time options are not right for JIT execution,
 | |
|        the callback function is not obeyed.
 | |
| 
 | |
|        If the JIT compiler finds an unsupported item, no JIT  data  is  gener-
 | |
|        ated.  You  can find out if JIT matching is available after compiling a
 | |
|        pattern by calling  pcre2_pattern_info()  with  the  PCRE2_INFO_JITSIZE
 | |
|        option.  A non-zero result means that JIT compilation was successful. A
 | |
|        result of 0 means that JIT support is not available, or the pattern was
 | |
|        not  processed by pcre2_jit_compile(), or the JIT compiler was not able
 | |
|        to handle the pattern.
 | |
| 
 | |
| 
 | |
| UNSUPPORTED OPTIONS AND PATTERN ITEMS
 | |
| 
 | |
|        The pcre2_match() options that  are  supported  for  JIT  matching  are
 | |
|        PCRE2_NOTBOL,   PCRE2_NOTEOL,  PCRE2_NOTEMPTY,  PCRE2_NOTEMPTY_ATSTART,
 | |
|        PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,  and  PCRE2_PARTIAL_SOFT.  The
 | |
|        PCRE2_ANCHORED option is not supported at match time.
 | |
| 
 | |
|        The  only  unsupported  pattern items are \C (match a single data unit)
 | |
|        when running in a UTF mode, and a callout immediately before an  asser-
 | |
|        tion condition in a conditional group.
 | |
| 
 | |
| 
 | |
| RETURN VALUES FROM JIT MATCHING
 | |
| 
 | |
|        When a pattern is matched using JIT matching, the return values are the
 | |
|        same as those given by the interpretive pcre2_match()  code,  with  the
 | |
|        addition  of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means
 | |
|        that the memory used for the JIT stack was insufficient. See  "Control-
 | |
|        ling the JIT stack" below for a discussion of JIT stack usage.
 | |
| 
 | |
|        The  error  code  PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
 | |
|        searching a very large pattern tree goes on for too long, as it  is  in
 | |
|        the  same circumstance when JIT is not used, but the details of exactly
 | |
|        what is counted are not the same. The PCRE2_ERROR_RECURSIONLIMIT  error
 | |
|        code is never returned when JIT matching is used.
 | |
| 
 | |
| 
 | |
| CONTROLLING THE JIT STACK
 | |
| 
 | |
|        When the compiled JIT code runs, it needs a block of memory to use as a
 | |
|        stack.  By default, it uses 32K on the  machine  stack.  However,  some
 | |
|        large   or   complicated  patterns  need  more  than  this.  The  error
 | |
|        PCRE2_ERROR_JIT_STACKLIMIT is given when there  is  not  enough  stack.
 | |
|        Three  functions  are provided for managing blocks of memory for use as
 | |
|        JIT stacks. There is further discussion about the use of JIT stacks  in
 | |
|        the section entitled "JIT stack FAQ" below.
 | |
| 
 | |
|        The  pcre2_jit_stack_create()  function  creates a JIT stack. Its argu-
 | |
|        ments are a starting size, a maximum size, and a general  context  (for
 | |
|        memory  allocation  functions, or NULL for standard memory allocation).
 | |
|        It returns a pointer to an opaque structure of type pcre2_jit_stack, or
 | |
|        NULL  if there is an error. The pcre2_jit_stack_free() function is used
 | |
|        to free a stack that is no longer needed. (For the technically  minded:
 | |
|        the address space is allocated by mmap or VirtualAlloc.)
 | |
| 
 | |
|        JIT  uses far less memory for recursion than the interpretive code, and
 | |
|        a maximum stack size of 512K to 1M should be more than enough  for  any
 | |
|        pattern.
 | |
| 
 | |
|        The  pcre2_jit_stack_assign()  function  specifies which stack JIT code
 | |
|        should use. Its arguments are as follows:
 | |
| 
 | |
|          pcre2_match_context  *mcontext
 | |
|          pcre2_jit_callback    callback
 | |
|          void                 *data
 | |
| 
 | |
|        The first argument is a pointer to a match context. When this is subse-
 | |
|        quently passed to a matching function, its information determines which
 | |
|        JIT stack is used. There are three cases for the values  of  the  other
 | |
|        two options:
 | |
| 
 | |
|          (1) If callback is NULL and data is NULL, an internal 32K block
 | |
|              on the machine stack is used. This is the default when a match
 | |
|              context is created.
 | |
| 
 | |
|          (2) If callback is NULL and data is not NULL, data must be
 | |
|              a pointer to a valid JIT stack, the result of calling
 | |
|              pcre2_jit_stack_create().
 | |
| 
 | |
|          (3) If callback is not NULL, it must point to a function that is
 | |
|              called with data as an argument at the start of matching, in
 | |
|              order to set up a JIT stack. If the return from the callback
 | |
|              function is NULL, the internal 32K stack is used; otherwise the
 | |
|              return value must be a valid JIT stack, the result of calling
 | |
|              pcre2_jit_stack_create().
 | |
| 
 | |
|        A  callback function is obeyed whenever JIT code is about to be run; it
 | |
|        is not obeyed when pcre2_match() is called with options that are incom-
 | |
|        patible  for JIT matching. A callback function can therefore be used to
 | |
|        determine whether a match operation was  executed  by  JIT  or  by  the
 | |
|        interpreter.
 | |
| 
 | |
|        You may safely use the same JIT stack for more than one pattern (either
 | |
|        by assigning directly or by callback), as long as the patterns are  all
 | |
|        matched  sequentially in the same thread. In a multithread application,
 | |
|        if you do not specify a JIT stack, or if you assign or pass  back  NULL
 | |
|        from  a  callback, that is thread-safe, because each thread has its own
 | |
|        machine stack. However, if you assign  or  pass  back  a  non-NULL  JIT
 | |
|        stack,  this  must  be  a  different  stack for each thread so that the
 | |
|        application is thread-safe.
 | |
| 
 | |
|        Strictly speaking, even more is allowed. You can assign the  same  non-
 | |
|        NULL  stack  to a match context that is used by any number of patterns,
 | |
|        as long as they are not used for matching by multiple  threads  at  the
 | |
|        same  time.  For  example, you could use the same stack in all compiled
 | |
|        patterns, with a global mutex in the callback to wait until  the  stack
 | |
|        is available for use. However, this is an inefficient solution, and not
 | |
|        recommended.
 | |
| 
 | |
|        This is a suggestion for how a multithreaded program that needs to  set
 | |
|        up non-default JIT stacks might operate:
 | |
| 
 | |
|          During thread initalization
 | |
|            thread_local_var = pcre2_jit_stack_create(...)
 | |
| 
 | |
|          During thread exit
 | |
|            pcre2_jit_stack_free(thread_local_var)
 | |
| 
 | |
|          Use a one-line callback function
 | |
|            return thread_local_var
 | |
| 
 | |
|        All  the  functions  described in this section do nothing if JIT is not
 | |
|        available.
 | |
| 
 | |
| 
 | |
| JIT STACK FAQ
 | |
| 
 | |
|        (1) Why do we need JIT stacks?
 | |
| 
 | |
|        PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
 | |
|        where  the local data of the current node is pushed before checking its
 | |
|        child nodes.  Allocating real machine stack on some platforms is diffi-
 | |
|        cult. For example, the stack chain needs to be updated every time if we
 | |
|        extend the stack on PowerPC.  Although it  is  possible,  its  updating
 | |
|        time overhead decreases performance. So we do the recursion in memory.
 | |
| 
 | |
|        (2) Why don't we simply allocate blocks of memory with malloc()?
 | |
| 
 | |
|        Modern  operating  systems  have  a  nice  feature: they can reserve an
 | |
|        address space instead of allocating memory. We can safely allocate mem-
 | |
|        ory  pages  inside  this address space, so the stack could grow without
 | |
|        moving memory data (this is important because of pointers). Thus we can
 | |
|        allocate  1M  address space, and use only a single memory page (usually
 | |
|        4K) if that is enough. However, we can still grow up to 1M  anytime  if
 | |
|        needed.
 | |
| 
 | |
|        (3) Who "owns" a JIT stack?
 | |
| 
 | |
|        The owner of the stack is the user program, not the JIT studied pattern
 | |
|        or anything else. The user program must ensure that if a stack is being
 | |
|        used by pcre2_match(), (that is, it is assigned to a match context that
 | |
|        is passed to the pattern currently running), that  stack  must  not  be
 | |
|        used  by any other threads (to avoid overwriting the same memory area).
 | |
|        The best practice for multithreaded programs is to allocate a stack for
 | |
|        each thread, and return this stack through the JIT callback function.
 | |
| 
 | |
|        (4) When should a JIT stack be freed?
 | |
| 
 | |
|        You can free a JIT stack at any time, as long as it will not be used by
 | |
|        pcre2_match() again. When you assign the stack to a match context, only
 | |
|        a  pointer  is  set. There is no reference counting or any other magic.
 | |
|        You can free compiled patterns, contexts, and stacks in any order, any-
 | |
|        time.  Just  do not call pcre2_match() with a match context pointing to
 | |
|        an already freed stack, as that will cause SEGFAULT. (Also, do not free
 | |
|        a  stack  currently  used  by pcre2_match() in another thread). You can
 | |
|        also replace the stack in a context at any time when it is not in  use.
 | |
|        You should free the previous stack before assigning a replacement.
 | |
| 
 | |
|        (5)  Should  I  allocate/free  a  stack every time before/after calling
 | |
|        pcre2_match()?
 | |
| 
 | |
|        No, because this is too costly in  terms  of  resources.  However,  you
 | |
|        could  implement  some clever idea which release the stack if it is not
 | |
|        used in let's say two minutes. The JIT callback  can  help  to  achieve
 | |
|        this without keeping a list of patterns.
 | |
| 
 | |
|        (6)  OK, the stack is for long term memory allocation. But what happens
 | |
|        if a pattern causes stack overflow with a stack of 1M? Is that 1M  kept
 | |
|        until the stack is freed?
 | |
| 
 | |
|        Especially  on embedded sytems, it might be a good idea to release mem-
 | |
|        ory sometimes without freeing the stack. There is no API  for  this  at
 | |
|        the  moment.  Probably a function call which returns with the currently
 | |
|        allocated memory for any stack and another which allows releasing  mem-
 | |
|        ory (shrinking the stack) would be a good idea if someone needs this.
 | |
| 
 | |
|        (7) This is too much of a headache. Isn't there any better solution for
 | |
|        JIT stack handling?
 | |
| 
 | |
|        No, thanks to Windows. If POSIX threads were used everywhere, we  could
 | |
|        throw out this complicated API.
 | |
| 
 | |
| 
 | |
| FREEING JIT SPECULATIVE MEMORY
 | |
| 
 | |
|        void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
 | |
| 
 | |
|        The JIT executable allocator does not free all memory when it is possi-
 | |
|        ble.  It expects new allocations, and keeps some free memory around  to
 | |
|        improve  allocation  speed. However, in low memory conditions, it might
 | |
|        be better to free all possible memory. You can cause this to happen  by
 | |
|        calling  pcre2_jit_free_unused_memory(). Its argument is a general con-
 | |
|        text, for custom memory management, or NULL for standard memory manage-
 | |
|        ment.
 | |
| 
 | |
| 
 | |
| EXAMPLE CODE
 | |
| 
 | |
|        This  is  a  single-threaded example that specifies a JIT stack without
 | |
|        using a callback. A real program should include  error  checking  after
 | |
|        all the function calls.
 | |
| 
 | |
|          int rc;
 | |
|          pcre2_code *re;
 | |
|          pcre2_match_data *match_data;
 | |
|          pcre2_match_context *mcontext;
 | |
|          pcre2_jit_stack *jit_stack;
 | |
| 
 | |
|          re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
 | |
|            &errornumber, &erroffset, NULL);
 | |
|          rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
 | |
|          mcontext = pcre2_match_context_create(NULL);
 | |
|          jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
 | |
|          pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
 | |
|          match_data = pcre2_match_data_create(re, 10);
 | |
|          rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
 | |
|          /* Process result */
 | |
| 
 | |
|          pcre2_code_free(re);
 | |
|          pcre2_match_data_free(match_data);
 | |
|          pcre2_match_context_free(mcontext);
 | |
|          pcre2_jit_stack_free(jit_stack);
 | |
| 
 | |
| 
 | |
| JIT FAST PATH API
 | |
| 
 | |
|        Because the API described above falls back to interpreted matching when
 | |
|        JIT is not available, it is convenient for programs  that  are  written
 | |
|        for  general  use  in  many  environments.  However,  calling  JIT  via
 | |
|        pcre2_match() does have a performance impact. Programs that are written
 | |
|        for  use  where  JIT  is known to be available, and which need the best
 | |
|        possible performance, can instead use a "fast path"  API  to  call  JIT
 | |
|        matching  directly instead of calling pcre2_match() (obviously only for
 | |
|        patterns that have been successfully processed by pcre2_jit_compile()).
 | |
| 
 | |
|        The fast path  function  is  called  pcre2_jit_match(),  and  it  takes
 | |
|        exactly the same arguments as pcre2_match(). The return values are also
 | |
|        the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or
 | |
|        complete)  is  requested that was not compiled. Unsupported option bits
 | |
|        (for example, PCRE2_ANCHORED) are ignored.
 | |
| 
 | |
|        When you call pcre2_match(), as well as testing for invalid options,  a
 | |
|        number of other sanity checks are performed on the arguments. For exam-
 | |
|        ple, if the subject pointer is NULL, an immediate error is given. Also,
 | |
|        unless  PCRE2_NO_UTF_CHECK  is  set, a UTF subject string is tested for
 | |
|        validity. In the interests of speed, these checks do not happen on  the
 | |
|        JIT fast path, and if invalid data is passed, the result is undefined.
 | |
| 
 | |
|        Bypassing  the  sanity  checks  and the pcre2_match() wrapping can give
 | |
|        speedups of more than 10%.
 | |
| 
 | |
| 
 | |
| SEE ALSO
 | |
| 
 | |
|        pcre2api(3)
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel (FAQ by Zoltan Herczeg)
 | |
|        University Computing Service
 | |
|        Cambridge, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 27 November 2014
 | |
|        Copyright (c) 1997-2014 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRE2LIMITS(3)             Library Functions Manual             PCRE2LIMITS(3)
 | |
| 
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE2 - Perl-compatible regular expressions (revised API)
 | |
| 
 | |
| SIZE AND OTHER LIMITATIONS
 | |
| 
 | |
|        There are some size limitations in PCRE2 but it is hoped that they will
 | |
|        never in practice be relevant.
 | |
| 
 | |
|        The maximum size of a compiled pattern is approximately 64K code  units
 | |
|        for  the  8-bit  and  16-bit  libraries  if  PCRE2 is compiled with the
 | |
|        default internal linkage size, which is 2 bytes for these libraries. If
 | |
|        you  want  to  process regular expressions that are truly enormous, you
 | |
|        can compile PCRE2 with an internal linkage size of 3 or 4 (when  build-
 | |
|        ing  the  16-bit library, 3 is rounded up to 4). See the README file in
 | |
|        the source distribution and the pcre2build documentation  for  details.
 | |
|        In  these  cases the limit is substantially larger.  However, the speed
 | |
|        of execution is slower. In the 32-bit  library,  the  internal  linkage
 | |
|        size is always 4.
 | |
| 
 | |
|        The maximum length (in code units) of a subject string is one less than
 | |
|        the largest number a PCRE2_SIZE variable can  hold.  PCRE2_SIZE  is  an
 | |
|        unsigned  integer  type,  usually  defined as size_t. Its maximum value
 | |
|        (that is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-
 | |
|        terminated strings and unset offsets.
 | |
| 
 | |
|        Note  that  when  using  the  traditional matching function, PCRE2 uses
 | |
|        recursion to handle subpatterns and indefinite repetition.  This  means
 | |
|        that  the  available stack space may limit the size of a subject string
 | |
|        that can be processed by certain patterns. For a  discussion  of  stack
 | |
|        issues, see the pcre2stack documentation.
 | |
| 
 | |
|        All values in repeating quantifiers must be less than 65536.
 | |
| 
 | |
|        There is no limit to the number of parenthesized subpatterns, but there
 | |
|        can be no more than 65535 capturing subpatterns. There is,  however,  a
 | |
|        limit  to  the  depth  of  nesting  of parenthesized subpatterns of all
 | |
|        kinds. This is imposed in order to limit the  amount  of  system  stack
 | |
|        used  at  compile time. The limit can be specified when PCRE2 is built;
 | |
|        the default is 250.
 | |
| 
 | |
|        There is a limit to the number of forward references to subsequent sub-
 | |
|        patterns  of  around  200,000.  Repeated  forward references with fixed
 | |
|        upper limits, for example, (?2){0,100} when subpattern number 2  is  to
 | |
|        the  right,  are included in the count. There is no limit to the number
 | |
|        of backward references.
 | |
| 
 | |
|        The maximum length of name for a named subpattern is 32 code units, and
 | |
|        the maximum number of named subpatterns is 10000.
 | |
| 
 | |
|        The  maximum  length  of  a  name  in  a (*MARK), (*PRUNE), (*SKIP), or
 | |
|        (*THEN) verb is 255 for the 8-bit library and 65535 for the 16-bit  and
 | |
|        32-bit libraries.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 25 November 2014
 | |
|        Copyright (c) 1997-2014 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRE2MATCHING(3)           Library Functions Manual           PCRE2MATCHING(3)
 | |
| 
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE2 - Perl-compatible regular expressions (revised API)
 | |
| 
 | |
| PCRE2 MATCHING ALGORITHMS
 | |
| 
 | |
|        This document describes the two different algorithms that are available
 | |
|        in PCRE2 for matching a compiled regular  expression  against  a  given
 | |
|        subject  string.  The  "standard"  algorithm is the one provided by the
 | |
|        pcre2_match() function. This works in the same as  as  Perl's  matching
 | |
|        function,  and  provide a Perl-compatible matching operation. The just-
 | |
|        in-time (JIT) optimization that is described in the pcre2jit documenta-
 | |
|        tion is compatible with this function.
 | |
| 
 | |
|        An alternative algorithm is provided by the pcre2_dfa_match() function;
 | |
|        it operates in a different way, and is not Perl-compatible. This alter-
 | |
|        native  has  advantages  and  disadvantages  compared with the standard
 | |
|        algorithm, and these are described below.
 | |
| 
 | |
|        When there is only one possible way in which a given subject string can
 | |
|        match  a pattern, the two algorithms give the same answer. A difference
 | |
|        arises, however, when there are multiple possibilities. For example, if
 | |
|        the pattern
 | |
| 
 | |
|          ^<.*>
 | |
| 
 | |
|        is matched against the string
 | |
| 
 | |
|          <something> <something else> <something further>
 | |
| 
 | |
|        there are three possible answers. The standard algorithm finds only one
 | |
|        of them, whereas the alternative algorithm finds all three.
 | |
| 
 | |
| 
 | |
| REGULAR EXPRESSIONS AS TREES
 | |
| 
 | |
|        The set of strings that are matched by a regular expression can be rep-
 | |
|        resented  as  a  tree structure. An unlimited repetition in the pattern
 | |
|        makes the tree of infinite size, but it is still a tree.  Matching  the
 | |
|        pattern  to a given subject string (from a given starting point) can be
 | |
|        thought of as a search of the tree.  There are two  ways  to  search  a
 | |
|        tree:  depth-first  and  breadth-first, and these correspond to the two
 | |
|        matching algorithms provided by PCRE2.
 | |
| 
 | |
| 
 | |
| THE STANDARD MATCHING ALGORITHM
 | |
| 
 | |
|        In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
 | |
|        sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
 | |
|        depth-first search of the pattern tree. That is, it  proceeds  along  a
 | |
|        single path through the tree, checking that the subject matches what is
 | |
|        required. When there is a mismatch, the algorithm  tries  any  alterna-
 | |
|        tives  at  the  current point, and if they all fail, it backs up to the
 | |
|        previous branch point in the  tree,  and  tries  the  next  alternative
 | |
|        branch  at  that  level.  This often involves backing up (moving to the
 | |
|        left) in the subject string as well.  The  order  in  which  repetition
 | |
|        branches  are  tried  is controlled by the greedy or ungreedy nature of
 | |
|        the quantifier.
 | |
| 
 | |
|        If a leaf node is reached, a matching string has  been  found,  and  at
 | |
|        that  point the algorithm stops. Thus, if there is more than one possi-
 | |
|        ble match, this algorithm returns the first one that it finds.  Whether
 | |
|        this  is the shortest, the longest, or some intermediate length depends
 | |
|        on the way the greedy and ungreedy repetition quantifiers are specified
 | |
|        in the pattern.
 | |
| 
 | |
|        Because  it  ends  up  with a single path through the tree, it is rela-
 | |
|        tively straightforward for this algorithm to keep  track  of  the  sub-
 | |
|        strings  that  are  matched  by portions of the pattern in parentheses.
 | |
|        This provides support for capturing parentheses and back references.
 | |
| 
 | |
| 
 | |
| THE ALTERNATIVE MATCHING ALGORITHM
 | |
| 
 | |
|        This algorithm conducts a breadth-first search of  the  tree.  Starting
 | |
|        from  the  first  matching  point  in the subject, it scans the subject
 | |
|        string from left to right, once, character by character, and as it does
 | |
|        this,  it remembers all the paths through the tree that represent valid
 | |
|        matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
 | |
|        though  it is not implemented as a traditional finite state machine (it
 | |
|        keeps multiple states active simultaneously).
 | |
| 
 | |
|        Although the general principle of this matching algorithm  is  that  it
 | |
|        scans  the subject string only once, without backtracking, there is one
 | |
|        exception: when a lookaround assertion is encountered,  the  characters
 | |
|        following  or  preceding  the  current  point  have to be independently
 | |
|        inspected.
 | |
| 
 | |
|        The scan continues until either the end of the subject is  reached,  or
 | |
|        there  are  no more unterminated paths. At this point, terminated paths
 | |
|        represent the different matching possibilities (if there are none,  the
 | |
|        match  has  failed).   Thus,  if there is more than one possible match,
 | |
|        this algorithm finds all of them, and in particular, it finds the long-
 | |
|        est.  The  matches are returned in decreasing order of length. There is
 | |
|        an option to stop the algorithm after the first match (which is  neces-
 | |
|        sarily the shortest) is found.
 | |
| 
 | |
|        Note that all the matches that are found start at the same point in the
 | |
|        subject. If the pattern
 | |
| 
 | |
|          cat(er(pillar)?)?
 | |
| 
 | |
|        is matched against the string "the caterpillar catchment",  the  result
 | |
|        is  the  three  strings "caterpillar", "cater", and "cat" that start at
 | |
|        the fifth character of the subject. The algorithm  does  not  automati-
 | |
|        cally move on to find matches that start at later positions.
 | |
| 
 | |
|        PCRE2's "auto-possessification" optimization usually applies to charac-
 | |
|        ter repeats at the end of a pattern (as well as internally). For  exam-
 | |
|        ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
 | |
|        is no point even considering the possibility of backtracking  into  the
 | |
|        repeated  digits.  For  DFA matching, this means that only one possible
 | |
|        match is found. If you really do want multiple matches in  such  cases,
 | |
|        either  use  an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
 | |
|        SESS option when compiling.
 | |
| 
 | |
|        There are a number of features of PCRE2 regular  expressions  that  are
 | |
|        not  supported  by the alternative matching algorithm. They are as fol-
 | |
|        lows:
 | |
| 
 | |
|        1. Because the algorithm finds all  possible  matches,  the  greedy  or
 | |
|        ungreedy  nature  of  repetition quantifiers is not relevant (though it
 | |
|        may affect auto-possessification, as just described). During  matching,
 | |
|        greedy  and  ungreedy  quantifiers are treated in exactly the same way.
 | |
|        However, possessive quantifiers can make a difference when what follows
 | |
|        could  also  match  what  is  quantified, for example in a pattern like
 | |
|        this:
 | |
| 
 | |
|          ^a++\w!
 | |
| 
 | |
|        This pattern matches "aaab!" but not "aaa!", which would be matched  by
 | |
|        a  non-possessive quantifier. Similarly, if an atomic group is present,
 | |
|        it is matched as if it were a standalone pattern at the current  point,
 | |
|        and  the  longest match is then "locked in" for the rest of the overall
 | |
|        pattern.
 | |
| 
 | |
|        2. When dealing with multiple paths through the tree simultaneously, it
 | |
|        is  not  straightforward  to  keep track of captured substrings for the
 | |
|        different matching possibilities, and PCRE2's  implementation  of  this
 | |
|        algorithm does not attempt to do this. This means that no captured sub-
 | |
|        strings are available.
 | |
| 
 | |
|        3. Because no substrings are captured, back references within the  pat-
 | |
|        tern are not supported, and cause errors if encountered.
 | |
| 
 | |
|        4.  For  the same reason, conditional expressions that use a backrefer-
 | |
|        ence as the condition or test for a specific group  recursion  are  not
 | |
|        supported.
 | |
| 
 | |
|        5.  Because  many  paths  through the tree may be active, the \K escape
 | |
|        sequence, which resets the start of the match when encountered (but may
 | |
|        be  on  some  paths  and not on others), is not supported. It causes an
 | |
|        error if encountered.
 | |
| 
 | |
|        6. Callouts are supported, but the value of the  capture_top  field  is
 | |
|        always 1, and the value of the capture_last field is always 0.
 | |
| 
 | |
|        7.  The  \C  escape  sequence, which (in the standard algorithm) always
 | |
|        matches a single code unit, even in a UTF mode,  is  not  supported  in
 | |
|        these  modes,  because the alternative algorithm moves through the sub-
 | |
|        ject string one character (not code unit) at a  time,  for  all  active
 | |
|        paths through the tree.
 | |
| 
 | |
|        8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
 | |
|        are not supported. (*FAIL) is supported, and  behaves  like  a  failing
 | |
|        negative assertion.
 | |
| 
 | |
| 
 | |
| ADVANTAGES OF THE ALTERNATIVE ALGORITHM
 | |
| 
 | |
|        Using  the alternative matching algorithm provides the following advan-
 | |
|        tages:
 | |
| 
 | |
|        1. All possible matches (at a single point in the subject) are automat-
 | |
|        ically  found,  and  in particular, the longest match is found. To find
 | |
|        more than one match using the standard algorithm, you have to do kludgy
 | |
|        things with callouts.
 | |
| 
 | |
|        2.  Because  the  alternative  algorithm  scans the subject string just
 | |
|        once, and never needs to backtrack (except for lookbehinds), it is pos-
 | |
|        sible  to  pass  very  long subject strings to the matching function in
 | |
|        several pieces, checking for partial matching each time. Although it is
 | |
|        also  possible  to  do  multi-segment matching using the standard algo-
 | |
|        rithm, by retaining partially matched substrings, it  is  more  compli-
 | |
|        cated. The pcre2partial documentation gives details of partial matching
 | |
|        and discusses multi-segment matching.
 | |
| 
 | |
| 
 | |
| DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
 | |
| 
 | |
|        The alternative algorithm suffers from a number of disadvantages:
 | |
| 
 | |
|        1. It is substantially slower than  the  standard  algorithm.  This  is
 | |
|        partly  because  it has to search for all possible matches, but is also
 | |
|        because it is less susceptible to optimization.
 | |
| 
 | |
|        2. Capturing parentheses and back references are not supported.
 | |
| 
 | |
|        3. Although atomic groups are supported, their use does not provide the
 | |
|        performance advantage that it does for the standard algorithm.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 29 September 2014
 | |
|        Copyright (c) 1997-2014 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRE2PARTIAL(3)            Library Functions Manual            PCRE2PARTIAL(3)
 | |
| 
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE2 - Perl-compatible regular expressions
 | |
| 
 | |
| PARTIAL MATCHING IN PCRE2
 | |
| 
 | |
|        In  normal  use  of  PCRE2,  if  the subject string that is passed to a
 | |
|        matching function matches as far as it goes, but is too short to  match
 | |
|        the  entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
 | |
|        stances where it might be helpful to distinguish this case  from  other
 | |
|        cases in which there is no match.
 | |
| 
 | |
|        Consider, for example, an application where a human is required to type
 | |
|        in data for a field with specific formatting requirements.  An  example
 | |
|        might be a date in the form ddmmmyy, defined by this pattern:
 | |
| 
 | |
|          ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
 | |
| 
 | |
|        If the application sees the user's keystrokes one by one, and can check
 | |
|        that what has been typed so far is potentially valid,  it  is  able  to
 | |
|        raise  an  error  as  soon  as  a  mistake  is made, by beeping and not
 | |
|        reflecting the character that has been typed, for example. This immedi-
 | |
|        ate  feedback is likely to be a better user interface than a check that
 | |
|        is delayed until the entire string has been entered.  Partial  matching
 | |
|        can  also be useful when the subject string is very long and is not all
 | |
|        available at once.
 | |
| 
 | |
|        PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT  and
 | |
|        PCRE2_PARTIAL_HARD  options,  which  can be set when calling a matching
 | |
|        function.  The difference between the two options is whether or  not  a
 | |
|        partial match is preferred to an alternative complete match, though the
 | |
|        details differ between the two types  of  matching  function.  If  both
 | |
|        options are set, PCRE2_PARTIAL_HARD takes precedence.
 | |
| 
 | |
|        If  you  want to use partial matching with just-in-time optimized code,
 | |
|        you must call pcre2_jit_compile() with one or both of these options:
 | |
| 
 | |
|          PCRE2_JIT_PARTIAL_SOFT
 | |
|          PCRE2_JIT_PARTIAL_HARD
 | |
| 
 | |
|        PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par-
 | |
|        tial  matches  on the same pattern. If the appropriate JIT mode has not
 | |
|        been compiled, interpretive matching code is used.
 | |
| 
 | |
|        Setting a partial matching option  disables  two  of  PCRE2's  standard
 | |
|        optimizations. PCRE2 remembers the last literal code unit in a pattern,
 | |
|        and abandons matching immediately if it is not present in  the  subject
 | |
|        string.  This  optimization  cannot  be  used for a subject string that
 | |
|        might match only partially. PCRE2 also knows the minimum  length  of  a
 | |
|        matching  string,  and  does not bother to run the matching function on
 | |
|        shorter strings. This optimization is also disabled for partial  match-
 | |
|        ing.
 | |
| 
 | |
| 
 | |
| PARTIAL MATCHING USING pcre2_match()
 | |
| 
 | |
|        A  partial  match occurs during a call to pcre2_match() when the end of
 | |
|        the subject string is reached successfully, but  matching  cannot  con-
 | |
|        tinue because more characters are needed. However, at least one charac-
 | |
|        ter in the subject must have been inspected. This  character  need  not
 | |
|        form part of the final matched string; lookbehind assertions and the \K
 | |
|        escape sequence provide ways of inspecting characters before the  start
 | |
|        of  a matched string. The requirement for inspecting at least one char-
 | |
|        acter exists because an empty string can  always  be  matched;  without
 | |
|        such  a  restriction  there would always be a partial match of an empty
 | |
|        string at the end of the subject.
 | |
| 
 | |
|        When a partial match is returned, the first two elements in the ovector
 | |
|        point to the portion of the subject that was matched, but the values in
 | |
|        the rest of the ovector are undefined. The appearance of \K in the pat-
 | |
|        tern has no effect for a partial match. Consider this pattern:
 | |
| 
 | |
|          /abc\K123/
 | |
| 
 | |
|        If it is matched against "456abc123xyz" the result is a complete match,
 | |
|        and the ovector defines the matched string as "123", because \K  resets
 | |
|        the  "start  of  match" point. However, if a partial match is requested
 | |
|        and the subject string is "456abc12", a partial match is found for  the
 | |
|        string  "abc12",  because  all these characters are needed for a subse-
 | |
|        quent re-match with additional characters.
 | |
| 
 | |
|        What happens when a partial match is identified depends on which of the
 | |
|        two partial matching options are set.
 | |
| 
 | |
|    PCRE2_PARTIAL_SOFT WITH pcre2_match()
 | |
| 
 | |
|        If  PCRE2_PARTIAL_SOFT  is  set when pcre2_match() identifies a partial
 | |
|        match, the partial match is remembered, but matching continues as  nor-
 | |
|        mal,  and  other  alternatives in the pattern are tried. If no complete
 | |
|        match  can  be  found,  PCRE2_ERROR_PARTIAL  is  returned  instead   of
 | |
|        PCRE2_ERROR_NOMATCH.
 | |
| 
 | |
|        This  option  is "soft" because it prefers a complete match over a par-
 | |
|        tial match.  All the various matching items in a pattern behave  as  if
 | |
|        the  subject string is potentially complete. For example, \z, \Z, and $
 | |
|        match at the end of the subject, as normal, and for \b and \B  the  end
 | |
|        of the subject is treated as a non-alphanumeric.
 | |
| 
 | |
|        If  there  is more than one partial match, the first one that was found
 | |
|        provides the data that is returned. Consider this pattern:
 | |
| 
 | |
|          /123\w+X|dogY/
 | |
| 
 | |
|        If this is matched against the subject string "abc123dog", both  alter-
 | |
|        natives  fail  to  match,  but the end of the subject is reached during
 | |
|        matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to  3
 | |
|        and  9, identifying "123dog" as the first partial match that was found.
 | |
|        (In this example, there are two partial matches, because "dog"  on  its
 | |
|        own partially matches the second alternative.)
 | |
| 
 | |
|    PCRE2_PARTIAL_HARD WITH pcre2_match()
 | |
| 
 | |
|        If  PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is
 | |
|        returned as soon as a partial match is  found,  without  continuing  to
 | |
|        search  for possible complete matches. This option is "hard" because it
 | |
|        prefers an earlier partial match over a later complete match. For  this
 | |
|        reason,  the  assumption  is  made that the end of the supplied subject
 | |
|        string may not be the true end of the available data, and  so,  if  \z,
 | |
|        \Z,  \b, \B, or $ are encountered at the end of the subject, the result
 | |
|        is PCRE2_ERROR_PARTIAL, provided that at least  one  character  in  the
 | |
|        subject has been inspected.
 | |
| 
 | |
|    Comparing hard and soft partial matching
 | |
| 
 | |
|        The  difference  between the two partial matching options can be illus-
 | |
|        trated by a pattern such as:
 | |
| 
 | |
|          /dog(sbody)?/
 | |
| 
 | |
|        This matches either "dog" or "dogsbody", greedily (that is, it  prefers
 | |
|        the  longer  string  if  possible). If it is matched against the string
 | |
|        "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for  "dog".
 | |
|        However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
 | |
|        TIAL. On the other hand, if the pattern is made ungreedy the result  is
 | |
|        different:
 | |
| 
 | |
|          /dog(sbody)??/
 | |
| 
 | |
|        In  this  case  the  result  is always a complete match because that is
 | |
|        found first, and matching never  continues  after  finding  a  complete
 | |
|        match. It might be easier to follow this explanation by thinking of the
 | |
|        two patterns like this:
 | |
| 
 | |
|          /dog(sbody)?/    is the same as  /dogsbody|dog/
 | |
|          /dog(sbody)??/   is the same as  /dog|dogsbody/
 | |
| 
 | |
|        The second pattern will never match "dogsbody", because it will  always
 | |
|        find the shorter match first.
 | |
| 
 | |
| 
 | |
| PARTIAL MATCHING USING pcre2_dfa_match()
 | |
| 
 | |
|        The DFA functions move along the subject string character by character,
 | |
|        without backtracking, searching for  all  possible  matches  simultane-
 | |
|        ously.  If the end of the subject is reached before the end of the pat-
 | |
|        tern, there is the possibility of a partial match, again provided  that
 | |
|        at least one character has been inspected.
 | |
| 
 | |
|        When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
 | |
|        there have been no complete matches. Otherwise,  the  complete  matches
 | |
|        are  returned.   However, if PCRE2_PARTIAL_HARD is set, a partial match
 | |
|        takes precedence over any complete matches. The portion of  the  string
 | |
|        that was matched when the longest partial match was found is set as the
 | |
|        first matching string.
 | |
| 
 | |
|        Because the DFA functions always search for all possible  matches,  and
 | |
|        there  is  no  difference between greedy and ungreedy repetition, their
 | |
|        behaviour is different from  the  standard  functions  when  PCRE2_PAR-
 | |
|        TIAL_HARD  is  set.  Consider  the  string  "dog"  matched  against the
 | |
|        ungreedy pattern shown above:
 | |
| 
 | |
|          /dog(sbody)??/
 | |
| 
 | |
|        Whereas the standard function stops as soon as it  finds  the  complete
 | |
|        match  for  "dog",  the  DFA  function also finds the partial match for
 | |
|        "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
 | |
| 
 | |
| 
 | |
| PARTIAL MATCHING AND WORD BOUNDARIES
 | |
| 
 | |
|        If a pattern ends with one of sequences \b or \B, which test  for  word
 | |
|        boundaries,  partial matching with PCRE2_PARTIAL_SOFT can give counter-
 | |
|        intuitive results. Consider this pattern:
 | |
| 
 | |
|          /\bcat\b/
 | |
| 
 | |
|        This matches "cat", provided there is a word boundary at either end. If
 | |
|        the subject string is "the cat", the comparison of the final "t" with a
 | |
|        following character cannot take place, so a  partial  match  is  found.
 | |
|        However,  normal  matching carries on, and \b matches at the end of the
 | |
|        subject when the last character is a letter, so  a  complete  match  is
 | |
|        found.   The  result,  therefore,  is  not  PCRE2_ERROR_PARTIAL.  Using
 | |
|        PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because
 | |
|        then the partial match takes precedence.
 | |
| 
 | |
| 
 | |
| EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST
 | |
| 
 | |
|        If  the  partial_soft  (or  ps) modifier is present on a pcre2test data
 | |
|        line, the PCRE2_PARTIAL_SOFT option is used for the match.  Here  is  a
 | |
|        run of pcre2test that uses the date example quoted above:
 | |
| 
 | |
|            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
 | |
|          data> 25jun04\=ps
 | |
|           0: 25jun04
 | |
|           1: jun
 | |
|          data> 25dec3\=ps
 | |
|          Partial match: 23dec3
 | |
|          data> 3ju\=ps
 | |
|          Partial match: 3ju
 | |
|          data> 3juj\=ps
 | |
|          No match
 | |
|          data> j\=ps
 | |
|          No match
 | |
| 
 | |
|        The  first  data  string  is matched completely, so pcre2test shows the
 | |
|        matched substrings. The remaining four strings do not  match  the  com-
 | |
|        plete pattern, but the first two are partial matches. Similar output is
 | |
|        obtained if DFA matching is used.
 | |
| 
 | |
|        If the partial_hard (or ph) modifier is present  on  a  pcre2test  data
 | |
|        line, the PCRE2_PARTIAL_HARD option is set for the match.
 | |
| 
 | |
| 
 | |
| MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
 | |
| 
 | |
|        When  a  partial match has been found using a DFA matching function, it
 | |
|        is possible to continue the match by providing additional subject  data
 | |
|        and  calling  the function again with the same compiled regular expres-
 | |
|        sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
 | |
|        same working space as before, because this is where details of the pre-
 | |
|        vious partial match are stored. Here is an example using pcre2test:
 | |
| 
 | |
|            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
 | |
|          data> 23ja\=dfa,ps
 | |
|          Partial match: 23ja
 | |
|          data> n05\=dfa,dfa_restart
 | |
|           0: n05
 | |
| 
 | |
|        The first call has "23ja" as the subject, and requests  partial  match-
 | |
|        ing;  the  second  call  has  "n05"  as  the  subject for the continued
 | |
|        (restarted) match.  Notice that when the match is  complete,  only  the
 | |
|        last  part  is  shown;  PCRE2 does not retain the previously partially-
 | |
|        matched string. It is up to the calling program to do that if it  needs
 | |
|        to.
 | |
| 
 | |
|        That means that, for an unanchored pattern, if a continued match fails,
 | |
|        it is not possible to try again at  a  new  starting  point.  All  this
 | |
|        facility  is  capable  of  doing  is continuing with the previous match
 | |
|        attempt. In the previous example, if the second set of data  is  "ug23"
 | |
|        the  result is no match, even though there would be a match for "aug23"
 | |
|        if the entire string were given at once. Depending on the  application,
 | |
|        this may or may not be what you want.  The only way to allow for start-
 | |
|        ing again at the next character is to retain the matched  part  of  the
 | |
|        subject and try a new complete match.
 | |
| 
 | |
|        You  can  set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
 | |
|        PCRE2_DFA_RESTART to continue partial matching over multiple  segments.
 | |
|        This  facility can be used to pass very long subject strings to the DFA
 | |
|        matching functions.
 | |
| 
 | |
| 
 | |
| MULTI-SEGMENT MATCHING WITH pcre2_match()
 | |
| 
 | |
|        Unlike the DFA function, it is not possible  to  restart  the  previous
 | |
|        match with a new segment of data when using pcre2_match(). Instead, new
 | |
|        data must be added to the previous subject string, and the entire match
 | |
|        re-run,  starting from the point where the partial match occurred. Ear-
 | |
|        lier data can be discarded.
 | |
| 
 | |
|        It is best to use PCRE2_PARTIAL_HARD in this situation, because it does
 | |
|        not  treat the end of a segment as the end of the subject when matching
 | |
|        \z, \Z, \b, \B, and $. Consider  an  unanchored  pattern  that  matches
 | |
|        dates:
 | |
| 
 | |
|            re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
 | |
|          data> The date is 23ja\=ph
 | |
|          Partial match: 23ja
 | |
| 
 | |
|        At  this stage, an application could discard the text preceding "23ja",
 | |
|        add on text from the next  segment,  and  call  the  matching  function
 | |
|        again.  Unlike  the  DFA  matching function, the entire matching string
 | |
|        must always be available, and the complete matching process occurs  for
 | |
|        each call, so more memory and more processing time is needed.
 | |
| 
 | |
| 
 | |
| ISSUES WITH MULTI-SEGMENT MATCHING
 | |
| 
 | |
|        Certain types of pattern may give problems with multi-segment matching,
 | |
|        whichever matching function is used.
 | |
| 
 | |
|        1. If the pattern contains a test for the beginning of a line, you need
 | |
|        to  pass  the  PCRE2_NOTBOL option when the subject string for any call
 | |
|        does start at the beginning of a line. There  is  also  a  PCRE2_NOTEOL
 | |
|        option, but in practice when doing multi-segment matching you should be
 | |
|        using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL.
 | |
| 
 | |
|        2. If a pattern contains a lookbehind assertion, characters  that  pre-
 | |
|        cede  the start of the partial match may have been inspected during the
 | |
|        matching process.  When using pcre2_match(), sufficient characters must
 | |
|        be  retained  for  the  next  match attempt. You can ensure that enough
 | |
|        characters are retained by doing the following:
 | |
| 
 | |
|        Before doing any matching, find the length of the longest lookbehind in
 | |
|        the     pattern    by    calling    pcre2_pattern_info()    with    the
 | |
|        PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting  count  is  in
 | |
|        characters, not code units. After a partial match, moving back from the
 | |
|        ovector[0] offset in the subject by the number of characters given  for
 | |
|        the  maximum lookbehind gets you to the earliest character that must be
 | |
|        retained. In a non-UTF or a 32-bit situation, moving  back  is  just  a
 | |
|        subtraction,  but in UTF-8 or UTF-16 you have to count characters while
 | |
|        moving back through the code units.
 | |
| 
 | |
|        Characters before the point you have now reached can be discarded,  and
 | |
|        after  the  next segment has been added to what is retained, you should
 | |
|        run the next match with the startoffset argument set so that the  match
 | |
|        begins at the same point as before.
 | |
| 
 | |
|        For  example, if the pattern "(?<=123)abc" is partially matched against
 | |
|        the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
 | |
|        mum  lookbehind  count  is  3, so all characters before offset 2 can be
 | |
|        discarded. The value of startoffset for the next  match  should  be  3.
 | |
|        When  pcre2test  displays  a partial match, it indicates the lookbehind
 | |
|        characters with '<' characters:
 | |
| 
 | |
|            re> "(?<=123)abc"
 | |
|          data> xx123ab\=ph
 | |
|          Partial match: 123ab
 | |
|                         <<<
 | |
| 
 | |
|        3. Because a partial match must always contain at least one  character,
 | |
|        what  might  be  considered a partial match of an empty string actually
 | |
|        gives a "no match" result. For example:
 | |
| 
 | |
|            re> /c(?<=abc)x/
 | |
|          data> ab\=ps
 | |
|          No match
 | |
| 
 | |
|        If the next segment begins "cx", a match should be found, but this will
 | |
|        only  happen  if characters from the previous segment are retained. For
 | |
|        this reason, a "no match" result  should  be  interpreted  as  "partial
 | |
|        match of an empty string" when the pattern contains lookbehinds.
 | |
| 
 | |
|        4.  Matching  a subject string that is split into multiple segments may
 | |
|        not always produce exactly the same result as matching over one  single
 | |
|        long  string,  especially  when PCRE2_PARTIAL_SOFT is used. The section
 | |
|        "Partial Matching and Word Boundaries" above describes  an  issue  that
 | |
|        arises  if  the  pattern ends with \b or \B. Another kind of difference
 | |
|        may occur when there are multiple matching possibilities, because  (for
 | |
|        PCRE2_PARTIAL_SOFT) a partial match result is given only when there are
 | |
|        no completed matches. This means that as soon as the shortest match has
 | |
|        been  found,  continuation to a new subject segment is no longer possi-
 | |
|        ble. Consider this pcre2test example:
 | |
| 
 | |
|            re> /dog(sbody)?/
 | |
|          data> dogsb\=ps
 | |
|           0: dog
 | |
|          data> do\=ps,dfa
 | |
|          Partial match: do
 | |
|          data> gsb\=ps,dfa,dfa_restart
 | |
|           0: g
 | |
|          data> dogsbody\=dfa
 | |
|           0: dogsbody
 | |
|           1: dog
 | |
| 
 | |
|        The first data line passes the string "dogsb" to  a  standard  matching
 | |
|        function, setting the PCRE2_PARTIAL_SOFT option. Although the string is
 | |
|        a partial match for "dogsbody", the result is not  PCRE2_ERROR_PARTIAL,
 | |
|        because  the  shorter string "dog" is a complete match. Similarly, when
 | |
|        the subject is presented to a DFA matching function  in  several  parts
 | |
|        ("do"  and  "gsb"  being  the first two) the match stops when "dog" has
 | |
|        been found, and it is not possible to continue.  On the other hand,  if
 | |
|        "dogsbody"  is  presented  as  a single string, a DFA matching function
 | |
|        finds both matches.
 | |
| 
 | |
|        Because of these problems, it is best to  use  PCRE2_PARTIAL_HARD  when
 | |
|        matching  multi-segment  data.  The  example above then behaves differ-
 | |
|        ently:
 | |
| 
 | |
|            re> /dog(sbody)?/
 | |
|          data> dogsb\=ph
 | |
|          Partial match: dogsb
 | |
|          data> do\=ps,dfa
 | |
|          Partial match: do
 | |
|          data> gsb\=ph,dfa,dfa_restart
 | |
|          Partial match: gsb
 | |
| 
 | |
|        5. Patterns that contain alternatives at the top level which do not all
 | |
|        start  with  the  same  pattern  item  may  not  work  as expected when
 | |
|        PCRE2_DFA_RESTART is used. For example, consider this pattern:
 | |
| 
 | |
|          1234|3789
 | |
| 
 | |
|        If the first part of the subject is "ABC123", a partial  match  of  the
 | |
|        first  alternative  is found at offset 3. There is no partial match for
 | |
|        the second alternative, because such a match does not start at the same
 | |
|        point  in  the  subject  string. Attempting to continue with the string
 | |
|        "7890" does not yield a match  because  only  those  alternatives  that
 | |
|        match  at  one  point in the subject are remembered. The problem arises
 | |
|        because the start of the second alternative matches  within  the  first
 | |
|        alternative.  There  is  no  problem with anchored patterns or patterns
 | |
|        such as:
 | |
| 
 | |
|          1234|ABCD
 | |
| 
 | |
|        where no string can be a partial match for both alternatives.  This  is
 | |
|        not  a  problem  if  a  standard matching function is used, because the
 | |
|        entire match has to be rerun each time:
 | |
| 
 | |
|            re> /1234|3789/
 | |
|          data> ABC123\=ph
 | |
|          Partial match: 123
 | |
|          data> 1237890
 | |
|           0: 3789
 | |
| 
 | |
|        Of course, instead of using PCRE2_DFA_RESTART, the  same  technique  of
 | |
|        re-running  the  entire  match  can  also be used with the DFA matching
 | |
|        function. Another possibility is to work with two buffers. If a partial
 | |
|        match  at  offset  n in the first buffer is followed by "no match" when
 | |
|        PCRE2_DFA_RESTART is used on the second buffer, you can then try a  new
 | |
|        match starting at offset n+1 in the first buffer.
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 22 December 2014
 | |
|        Copyright (c) 1997-2014 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | |
| PCRE2UNICODE(3)            Library Functions Manual            PCRE2UNICODE(3)
 | |
| 
 | |
| 
 | |
| 
 | |
| NAME
 | |
|        PCRE - Perl-compatible regular expressions (revised API)
 | |
| 
 | |
| UNICODE AND UTF SUPPORT
 | |
| 
 | |
|        When PCRE2 is built with Unicode support (which is the default), it has
 | |
|        knowledge of Unicode character properties and can process text  strings
 | |
|        in  UTF-8, UTF-16, or UTF-32 format (depending on the code unit width).
 | |
|        However, by default, PCRE2 assumes that one code unit is one character.
 | |
|        To  process  a  pattern  as a UTF string, where a character may require
 | |
|        more than one  code  unit,  you  must  call  pcre2_compile()  with  the
 | |
|        PCRE2_UTF  option  flag,  or  the  pattern must start with the sequence
 | |
|        (*UTF). When either of these is the case, both the pattern and any sub-
 | |
|        ject  strings  that  are  matched against it are treated as UTF strings
 | |
|        instead of strings of individual one-code-unit characters.
 | |
| 
 | |
|        If you do not need Unicode support you can build PCRE2 without  it,  in
 | |
|        which case the library will be smaller.
 | |
| 
 | |
| 
 | |
| UNICODE PROPERTY SUPPORT
 | |
| 
 | |
|        When  PCRE2 is built with Unicode support, the escape sequences \p{..},
 | |
|        \P{..}, and \X can be used. The Unicode properties that can  be  tested
 | |
|        are  limited to the general category properties such as Lu for an upper
 | |
|        case letter or Nd for a decimal number, the Unicode script  names  such
 | |
|        as Arabic or Han, and the derived properties Any and L&. Full lists are
 | |
|        given in the pcre2pattern and pcre2syntax documentation. Only the short
 | |
|        names  for  properties are supported. For example, \p{L} matches a let-
 | |
|        ter. Its Perl synonym, \p{Letter}, is not supported.   Furthermore,  in
 | |
|        Perl,  many properties may optionally be prefixed by "Is", for compati-
 | |
|        bility with Perl 5.6. PCRE does not support this.
 | |
| 
 | |
| 
 | |
| WIDE CHARACTERS AND UTF MODES
 | |
| 
 | |
|        Codepoints less than 256 can be specified in patterns by either  braced
 | |
|        or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
 | |
|        Larger values have to use braced sequences. Unbraced octal code  points
 | |
|        up to \777 are also recognized; larger ones can be coded using \o{...}.
 | |
| 
 | |
|        In  UTF modes, repeat quantifiers apply to complete UTF characters, not
 | |
|        to individual code units.
 | |
| 
 | |
|        In UTF modes, the dot metacharacter matches one UTF  character  instead
 | |
|        of a single code unit.
 | |
| 
 | |
|        The  escape  sequence  \C can be used to match a single code unit, in a
 | |
|        UTF mode, but its use can lead  to  some  strange  effects  because  it
 | |
|        breaks  up  multi-unit  characters  (see  the  description of \C in the
 | |
|        pcre2pattern documentation). The use of \C  is  not  supported  in  the
 | |
|        alternative matching function pcre2_dfa_match(), nor is it supported in
 | |
|        UTF mode by the JIT optimization. If JIT optimization is requested  for
 | |
|        a  UTF pattern that contains \C, it will not succeed, and so the match-
 | |
|        ing will be carried out by the normal interpretive function.
 | |
| 
 | |
|        The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
 | |
|        characters  of  any  code  value,  but, by default, the characters that
 | |
|        PCRE2 recognizes as digits, spaces, or word characters remain the  same
 | |
|        set  as  in  non-UTF  mode,  all  with  code points less than 256. This
 | |
|        remains true even when PCRE2  is  built  to  include  Unicode  support,
 | |
|        because  to do otherwise would slow down matching in many common cases.
 | |
|        Note that this also applies to \b and \B, because they are  defined  in
 | |
|        terms  of  \w  and  \W.  If you want to test for a wider sense of, say,
 | |
|        "digit", you can use explicit Unicode property tests  such  as  \p{Nd}.
 | |
|        Alternatively,  if you set the PCRE2_UCP option, the way that the char-
 | |
|        acter escapes work is changed so that Unicode properties  are  used  to
 | |
|        determine which characters match. There are more details in the section
 | |
|        on generic character types in the pcre2pattern documentation.
 | |
| 
 | |
|        Similarly, characters that match the POSIX named character classes  are
 | |
|        all low-valued characters, unless the PCRE2_UCP option is set.
 | |
| 
 | |
|        However,  the  special  horizontal  and  vertical  white space matching
 | |
|        escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
 | |
|        acters, whether or not PCRE2_UCP is set.
 | |
| 
 | |
|        Case-insensitive  matching in UTF mode makes use of Unicode properties.
 | |
|        A few Unicode characters such as Greek sigma have more than  two  code-
 | |
|        points that are case-equivalent, and these are treated as such.
 | |
| 
 | |
| 
 | |
| VALIDITY OF UTF STRINGS
 | |
| 
 | |
|        When  the  PCRE2_UTF  option is set, the strings passed as patterns and
 | |
|        subjects are (by default) checked for validity on entry to the relevant
 | |
|        functions.   If an invalid UTF string is passed, an negative error code
 | |
|        is returned. The code unit offset to the  offending  character  can  be
 | |
|        extracted  from  the match data block by calling pcre2_get_startchar(),
 | |
|        which is used for this purpose after a UTF error.
 | |
| 
 | |
|        UTF-16 and UTF-32 strings can indicate their endianness by special code
 | |
|        knows  as  a  byte-order  mark (BOM). The PCRE2 functions do not handle
 | |
|        this, expecting strings to be in host byte order.
 | |
| 
 | |
|        The entire string is checked before any other processing  takes  place.
 | |
|        In  addition  to checking the format of the string, there is a check to
 | |
|        ensure that all code points lie in the range U+0 to U+10FFFF, excluding
 | |
|        the  surrogate area.  The so-called "non-character" code points are not
 | |
|        excluded because Unicode corrigendum #9 makes it clear that they should
 | |
|        not be.
 | |
| 
 | |
|        Characters  in  the "Surrogate Area" of Unicode are reserved for use by
 | |
|        UTF-16, where they are used in pairs to encode code points with  values
 | |
|        greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs
 | |
|        are available independently in the  UTF-8  and  UTF-32  encodings.  (In
 | |
|        other  words,  the  whole  surrogate  thing is a fudge for UTF-16 which
 | |
|        unfortunately messes up UTF-8 and UTF-32.)
 | |
| 
 | |
|        In some situations, you may already know that your strings  are  valid,
 | |
|        and  therefore  want  to  skip these checks in order to improve perfor-
 | |
|        mance, for example in the case of a long subject string that  is  being
 | |
|        scanned  repeatedly.   If you set the PCRE2_NO_UTF_CHECK option at com-
 | |
|        pile time or at match time, PCRE2 assumes that the pattern  or  subject
 | |
|        it is given (respectively) contains only valid UTF code unit sequences.
 | |
| 
 | |
|        Passing  PCRE2_NO_UTF_CHECK  to pcre2_compile() just disables the check
 | |
|        for the pattern; it does not also apply to subject strings. If you want
 | |
|        to  disable the check for a subject string you must pass this option to
 | |
|        pcre2_match() or pcre2_dfa_match().
 | |
| 
 | |
|        If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is  set,  the
 | |
|        result is undefined and your program may crash or loop indefinitely.
 | |
| 
 | |
|    Errors in UTF-8 strings
 | |
| 
 | |
|        The following negative error codes are given for invalid UTF-8 strings:
 | |
| 
 | |
|          PCRE2_ERROR_UTF8_ERR1
 | |
|          PCRE2_ERROR_UTF8_ERR2
 | |
|          PCRE2_ERROR_UTF8_ERR3
 | |
|          PCRE2_ERROR_UTF8_ERR4
 | |
|          PCRE2_ERROR_UTF8_ERR5
 | |
| 
 | |
|        The  string  ends  with a truncated UTF-8 character; the code specifies
 | |
|        how many bytes are missing (1 to 5). Although RFC 3629 restricts  UTF-8
 | |
|        characters  to  be  no longer than 4 bytes, the encoding scheme (origi-
 | |
|        nally defined by RFC 2279) allows for  up  to  6  bytes,  and  this  is
 | |
|        checked first; hence the possibility of 4 or 5 missing bytes.
 | |
| 
 | |
|          PCRE2_ERROR_UTF8_ERR6
 | |
|          PCRE2_ERROR_UTF8_ERR7
 | |
|          PCRE2_ERROR_UTF8_ERR8
 | |
|          PCRE2_ERROR_UTF8_ERR9
 | |
|          PCRE2_ERROR_UTF8_ERR10
 | |
| 
 | |
|        The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
 | |
|        the character do not have the binary value 0b10 (that  is,  either  the
 | |
|        most significant bit is 0, or the next bit is 1).
 | |
| 
 | |
|          PCRE2_ERROR_UTF8_ERR11
 | |
|          PCRE2_ERROR_UTF8_ERR12
 | |
| 
 | |
|        A  character that is valid by the RFC 2279 rules is either 5 or 6 bytes
 | |
|        long; these code points are excluded by RFC 3629.
 | |
| 
 | |
|          PCRE2_ERROR_UTF8_ERR13
 | |
| 
 | |
|        A 4-byte character has a value greater than 0x10fff; these code  points
 | |
|        are excluded by RFC 3629.
 | |
| 
 | |
|          PCRE2_ERROR_UTF8_ERR14
 | |
| 
 | |
|        A  3-byte  character  has  a  value in the range 0xd800 to 0xdfff; this
 | |
|        range of code points are reserved by RFC 3629 for use with UTF-16,  and
 | |
|        so are excluded from UTF-8.
 | |
| 
 | |
|          PCRE2_ERROR_UTF8_ERR15
 | |
|          PCRE2_ERROR_UTF8_ERR16
 | |
|          PCRE2_ERROR_UTF8_ERR17
 | |
|          PCRE2_ERROR_UTF8_ERR18
 | |
|          PCRE2_ERROR_UTF8_ERR19
 | |
| 
 | |
|        A  2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
 | |
|        for a value that can be represented by fewer bytes, which  is  invalid.
 | |
|        For  example,  the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
 | |
|        rect coding uses just one byte.
 | |
| 
 | |
|          PCRE2_ERROR_UTF8_ERR20
 | |
| 
 | |
|        The two most significant bits of the first byte of a character have the
 | |
|        binary  value 0b10 (that is, the most significant bit is 1 and the sec-
 | |
|        ond is 0). Such a byte can only validly occur as the second  or  subse-
 | |
|        quent byte of a multi-byte character.
 | |
| 
 | |
|          PCRE2_ERROR_UTF8_ERR21
 | |
| 
 | |
|        The  first byte of a character has the value 0xfe or 0xff. These values
 | |
|        can never occur in a valid UTF-8 string.
 | |
| 
 | |
|    Errors in UTF-16 strings
 | |
| 
 | |
|        The following  negative  error  codes  are  given  for  invalid  UTF-16
 | |
|        strings:
 | |
| 
 | |
|          PCRE_UTF16_ERR1  Missing low surrogate at end of string
 | |
|          PCRE_UTF16_ERR2  Invalid low surrogate follows high surrogate
 | |
|          PCRE_UTF16_ERR3  Isolated low surrogate
 | |
| 
 | |
| 
 | |
|    Errors in UTF-32 strings
 | |
| 
 | |
|        The  following  negative  error  codes  are  given  for  invalid UTF-32
 | |
|        strings:
 | |
| 
 | |
|          PCRE_UTF32_ERR1  Surrogate character (range from 0xd800 to 0xdfff)
 | |
|          PCRE_UTF32_ERR2  Code point is greater than 0x10ffff
 | |
| 
 | |
| 
 | |
| AUTHOR
 | |
| 
 | |
|        Philip Hazel
 | |
|        University Computing Service
 | |
|        Cambridge, England.
 | |
| 
 | |
| 
 | |
| REVISION
 | |
| 
 | |
|        Last updated: 23 November 2014
 | |
|        Copyright (c) 1997-2014 University of Cambridge.
 | |
| ------------------------------------------------------------------------------
 | |
| 
 | |
| 
 | 
