Added bundled PCRE2 library.

2015-10-20 14:06:04 +03:00
parent 91bb3b288c
commit ee29e85016
333 changed files with 274569 additions and 0 deletions
--- a/pcre2/doc/pcre2syntax.3
+++ b/pcre2/doc/pcre2syntax.3
@ -0,0 +1,575 @@
+.TH PCRE2SYNTAX 3 "13 June 2015" "PCRE2 10.20"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
+.rs
+.sp
+The full syntax and semantics of the regular expressions that are supported by
+PCRE2 are described in the
+.\" HREF
+\fBpcre2pattern\fP
+.\"
+documentation. This document contains a quick-reference summary of the syntax.
+.
+.
+.SH "QUOTING"
+.rs
+.sp
+  \ex         where x is non-alphanumeric is a literal x
+  \eQ...\eE    treat enclosed characters as literal
+.
+.
+.SH "ESCAPED CHARACTERS"
+.rs
+.sp
+This table applies to ASCII and Unicode environments.
+.sp
+  \ea         alarm, that is, the BEL character (hex 07)
+  \ecx        "control-x", where x is any ASCII printing character
+  \ee         escape (hex 1B)
+  \ef         form feed (hex 0C)
+  \en         newline (hex 0A)
+  \er         carriage return (hex 0D)
+  \et         tab (hex 09)
+  \e0dd       character with octal code 0dd
+  \eddd       character with octal code ddd, or backreference
+  \eo{ddd..}  character with octal code ddd..
+  \eU         "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
+  \euhhhh     character with hex code hhhh (if PCRE2_ALT_BSUX is set)
+  \exhh       character with hex code hh
+  \ex{hhh..}  character with hex code hhh..
+.sp
+Note that \e0dd is always an octal code. The treatment of backslash followed by
+a non-zero digit is complicated; for details see the section
+.\" HTML <a href="pcre2pattern.html#digitsafterbackslash">
+.\" </a>
+"Non-printing characters"
+.\"
+in the
+.\" HREF
+\fBpcre2pattern\fP
+.\"
+documentation, where details of escape processing in EBCDIC environments are
+also given.
+.P
+When \ex is not followed by {, from zero to two hexadecimal digits are read,
+but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
+be recognized as a hexadecimal escape; otherwise it matches a literal "x".
+Likewise, if \eu (in ALT_BSUX mode) is not followed by four hexadecimal digits,
+it matches a literal "u".
+.
+.
+.SH "CHARACTER TYPES"
+.rs
+.sp
+  .          any character except newline;
+               in dotall mode, any character whatsoever
+  \eC         one code unit, even in UTF mode (best avoided)
+  \ed         a decimal digit
+  \eD         a character that is not a decimal digit
+  \eh         a horizontal white space character
+  \eH         a character that is not a horizontal white space character
+  \eN         a character that is not a newline
+  \ep{\fIxx\fP}     a character with the \fIxx\fP property
+  \eP{\fIxx\fP}     a character without the \fIxx\fP property
+  \eR         a newline sequence
+  \es         a white space character
+  \eS         a character that is not a white space character
+  \ev         a vertical white space character
+  \eV         a character that is not a vertical white space character
+  \ew         a "word" character
+  \eW         a "non-word" character
+  \eX         a Unicode extended grapheme cluster
+.sp
+The application can lock out the use of \eC by setting the
+PCRE2_NEVER_BACKSLASH_C option. It is dangerous because it may leave the
+current matching point in the middle of a UTF-8 or UTF-16 character.
+.P
+By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
+or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
+happening, \es and \ew may also match characters with code points in the range
+128-255. If the PCRE2_UCP option is set, the behaviour of these escape
+sequences is changed to use Unicode properties and they match many more
+characters.
+.
+.
+.SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
+.rs
+.sp
+  C          Other
+  Cc         Control
+  Cf         Format
+  Cn         Unassigned
+  Co         Private use
+  Cs         Surrogate
+.sp
+  L          Letter
+  Ll         Lower case letter
+  Lm         Modifier letter
+  Lo         Other letter
+  Lt         Title case letter
+  Lu         Upper case letter
+  L&         Ll, Lu, or Lt
+.sp
+  M          Mark
+  Mc         Spacing mark
+  Me         Enclosing mark
+  Mn         Non-spacing mark
+.sp
+  N          Number
+  Nd         Decimal number
+  Nl         Letter number
+  No         Other number
+.sp
+  P          Punctuation
+  Pc         Connector punctuation
+  Pd         Dash punctuation
+  Pe         Close punctuation
+  Pf         Final punctuation
+  Pi         Initial punctuation
+  Po         Other punctuation
+  Ps         Open punctuation
+.sp
+  S          Symbol
+  Sc         Currency symbol
+  Sk         Modifier symbol
+  Sm         Mathematical symbol
+  So         Other symbol
+.sp
+  Z          Separator
+  Zl         Line separator
+  Zp         Paragraph separator
+  Zs         Space separator
+.
+.
+.SH "PCRE2 SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
+.rs
+.sp
+  Xan        Alphanumeric: union of properties L and N
+  Xps        POSIX space: property Z or tab, NL, VT, FF, CR
+  Xsp        Perl space: property Z or tab, NL, VT, FF, CR
+  Xuc        Univerally-named character: one that can be
+               represented by a Universal Character Name
+  Xwd        Perl word: property Xan or underscore
+.sp
+Perl and POSIX space are now the same. Perl added VT to its space character set
+at release 5.18.
+.
+.
+.SH "SCRIPT NAMES FOR \ep AND \eP"
+.rs
+.sp
+Arabic,
+Armenian,
+Avestan,
+Balinese,
+Bamum,
+Bassa_Vah,
+Batak,
+Bengali,
+Bopomofo,
+Brahmi,
+Braille,
+Buginese,
+Buhid,
+Canadian_Aboriginal,
+Carian,
+Caucasian_Albanian,
+Chakma,
+Cham,
+Cherokee,
+Common,
+Coptic,
+Cuneiform,
+Cypriot,
+Cyrillic,
+Deseret,
+Devanagari,
+Duployan,
+Egyptian_Hieroglyphs,
+Elbasan,
+Ethiopic,
+Georgian,
+Glagolitic,
+Gothic,
+Grantha,
+Greek,
+Gujarati,
+Gurmukhi,
+Han,
+Hangul,
+Hanunoo,
+Hebrew,
+Hiragana,
+Imperial_Aramaic,
+Inherited,
+Inscriptional_Pahlavi,
+Inscriptional_Parthian,
+Javanese,
+Kaithi,
+Kannada,
+Katakana,
+Kayah_Li,
+Kharoshthi,
+Khmer,
+Khojki,
+Khudawadi,
+Lao,
+Latin,
+Lepcha,
+Limbu,
+Linear_A,
+Linear_B,
+Lisu,
+Lycian,
+Lydian,
+Mahajani,
+Malayalam,
+Mandaic,
+Manichaean,
+Meetei_Mayek,
+Mende_Kikakui,
+Meroitic_Cursive,
+Meroitic_Hieroglyphs,
+Miao,
+Modi,
+Mongolian,
+Mro,
+Myanmar,
+Nabataean,
+New_Tai_Lue,
+Nko,
+Ogham,
+Ol_Chiki,
+Old_Italic,
+Old_North_Arabian,
+Old_Permic,
+Old_Persian,
+Old_South_Arabian,
+Old_Turkic,
+Oriya,
+Osmanya,
+Pahawh_Hmong,
+Palmyrene,
+Pau_Cin_Hau,
+Phags_Pa,
+Phoenician,
+Psalter_Pahlavi,
+Rejang,
+Runic,
+Samaritan,
+Saurashtra,
+Sharada,
+Shavian,
+Siddham,
+Sinhala,
+Sora_Sompeng,
+Sundanese,
+Syloti_Nagri,
+Syriac,
+Tagalog,
+Tagbanwa,
+Tai_Le,
+Tai_Tham,
+Tai_Viet,
+Takri,
+Tamil,
+Telugu,
+Thaana,
+Thai,
+Tibetan,
+Tifinagh,
+Tirhuta,
+Ugaritic,
+Vai,
+Warang_Citi,
+Yi.
+.
+.
+.SH "CHARACTER CLASSES"
+.rs
+.sp
+  [...]       positive character class
+  [^...]      negative character class
+  [x-y]       range (can be used for hex characters)
+  [[:xxx:]]   positive POSIX named set
+  [[:^xxx:]]  negative POSIX named set
+.sp
+  alnum       alphanumeric
+  alpha       alphabetic
+  ascii       0-127
+  blank       space or tab
+  cntrl       control character
+  digit       decimal digit
+  graph       printing, excluding space
+  lower       lower case letter
+  print       printing, including space
+  punct       printing, excluding alphanumeric
+  space       white space
+  upper       upper case letter
+  word        same as \ew
+  xdigit      hexadecimal digit
+.sp
+In PCRE2, POSIX character set names recognize only ASCII characters by default,
+but some of them use Unicode properties if PCRE2_UCP is set. You can use
+\eQ...\eE inside a character class.
+.
+.
+.SH "QUANTIFIERS"
+.rs
+.sp
+  ?           0 or 1, greedy
+  ?+          0 or 1, possessive
+  ??          0 or 1, lazy
+  *           0 or more, greedy
+  *+          0 or more, possessive
+  *?          0 or more, lazy
+  +           1 or more, greedy
+  ++          1 or more, possessive
+  +?          1 or more, lazy
+  {n}         exactly n
+  {n,m}       at least n, no more than m, greedy
+  {n,m}+      at least n, no more than m, possessive
+  {n,m}?      at least n, no more than m, lazy
+  {n,}        n or more, greedy
+  {n,}+       n or more, possessive
+  {n,}?       n or more, lazy
+.
+.
+.SH "ANCHORS AND SIMPLE ASSERTIONS"
+.rs
+.sp
+  \eb          word boundary
+  \eB          not a word boundary
+  ^           start of subject
+                also after an internal newline in multiline mode
+                (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
+  \eA          start of subject
+  $           end of subject
+                also before newline at end of subject
+                also before internal newline in multiline mode
+  \eZ          end of subject
+                also before newline at end of subject
+  \ez          end of subject
+  \eG          first matching position in subject
+.
+.
+.SH "MATCH POINT RESET"
+.rs
+.sp
+  \eK          reset start of match
+.sp
+\eK is honoured in positive assertions, but ignored in negative ones.
+.
+.
+.SH "ALTERNATION"
+.rs
+.sp
+  expr|expr|expr...
+.
+.
+.SH "CAPTURING"
+.rs
+.sp
+  (...)           capturing group
+  (?<name>...)    named capturing group (Perl)
+  (?'name'...)    named capturing group (Perl)
+  (?P<name>...)   named capturing group (Python)
+  (?:...)         non-capturing group
+  (?|...)         non-capturing group; reset group numbers for
+                   capturing groups in each alternative
+.
+.
+.SH "ATOMIC GROUPS"
+.rs
+.sp
+  (?>...)         atomic, non-capturing group
+.
+.
+.
+.
+.SH "COMMENT"
+.rs
+.sp
+  (?#....)        comment (not nestable)
+.
+.
+.SH "OPTION SETTING"
+.rs
+.sp
+  (?i)            caseless
+  (?J)            allow duplicate names
+  (?m)            multiline
+  (?s)            single line (dotall)
+  (?U)            default ungreedy (lazy)
+  (?x)            extended (ignore white space)
+  (?-...)         unset option(s)
+.sp
+The following are recognized only at the very start of a pattern or after one
+of the newline or \eR options with similar syntax. More than one of them may
+appear.
+.sp
+  (*LIMIT_MATCH=d) set the match limit to d (decimal number)
+  (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
+  (*NOTEMPTY)     set PCRE2_NOTEMPTY when matching
+  (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
+  (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
+  (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
+  (*NO_JIT)       disable JIT optimization
+  (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
+  (*UTF)          set appropriate UTF mode for the library in use
+  (*UCP)          set PCRE2_UCP (use Unicode properties for \ed etc)
+.sp
+Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
+limits set by the caller of pcre2_match(), not increase them. The application
+can lock out the use of (*UTF) and (*UCP) by setting the PCRE2_NEVER_UTF or
+PCRE2_NEVER_UCP options, respectively, at compile time.
+.
+.
+.SH "NEWLINE CONVENTION"
+.rs
+.sp
+These are recognized only at the very start of the pattern or after option
+settings with a similar syntax.
+.sp
+  (*CR)           carriage return only
+  (*LF)           linefeed only
+  (*CRLF)         carriage return followed by linefeed
+  (*ANYCRLF)      all three of the above
+  (*ANY)          any Unicode newline sequence
+.
+.
+.SH "WHAT \eR MATCHES"
+.rs
+.sp
+These are recognized only at the very start of the pattern or after option
+setting with a similar syntax.
+.sp
+  (*BSR_ANYCRLF)  CR, LF, or CRLF
+  (*BSR_UNICODE)  any Unicode newline sequence
+.
+.
+.SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
+.rs
+.sp
+  (?=...)         positive look ahead
+  (?!...)         negative look ahead
+  (?<=...)        positive look behind
+  (?<!...)        negative look behind
+.sp
+Each top-level branch of a look behind must be of a fixed length.
+.
+.
+.SH "BACKREFERENCES"
+.rs
+.sp
+  \en              reference by number (can be ambiguous)
+  \egn             reference by number
+  \eg{n}           reference by number
+  \eg{-n}          relative reference by number
+  \ek<name>        reference by name (Perl)
+  \ek'name'        reference by name (Perl)
+  \eg{name}        reference by name (Perl)
+  \ek{name}        reference by name (.NET)
+  (?P=name)       reference by name (Python)
+.
+.
+.SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
+.rs
+.sp
+  (?R)            recurse whole pattern
+  (?n)            call subpattern by absolute number
+  (?+n)           call subpattern by relative number
+  (?-n)           call subpattern by relative number
+  (?&name)        call subpattern by name (Perl)
+  (?P>name)       call subpattern by name (Python)
+  \eg<name>        call subpattern by name (Oniguruma)
+  \eg'name'        call subpattern by name (Oniguruma)
+  \eg<n>           call subpattern by absolute number (Oniguruma)
+  \eg'n'           call subpattern by absolute number (Oniguruma)
+  \eg<+n>          call subpattern by relative number (PCRE2 extension)
+  \eg'+n'          call subpattern by relative number (PCRE2 extension)
+  \eg<-n>          call subpattern by relative number (PCRE2 extension)
+  \eg'-n'          call subpattern by relative number (PCRE2 extension)
+.
+.
+.SH "CONDITIONAL PATTERNS"
+.rs
+.sp
+  (?(condition)yes-pattern)
+  (?(condition)yes-pattern|no-pattern)
+.sp
+  (?(n)               absolute reference condition
+  (?(+n)              relative reference condition
+  (?(-n)              relative reference condition
+  (?(<name>)          named reference condition (Perl)
+  (?('name')          named reference condition (Perl)
+  (?(name)            named reference condition (PCRE2)
+  (?(R)               overall recursion condition
+  (?(Rn)              specific group recursion condition
+  (?(R&name)          specific recursion condition
+  (?(DEFINE)          define subpattern for reference
+  (?(VERSION[>]=n.m)  test PCRE2 version
+  (?(assert)          assertion condition
+.
+.
+.SH "BACKTRACKING CONTROL"
+.rs
+.sp
+The following act immediately they are reached:
+.sp
+  (*ACCEPT)       force successful match
+  (*FAIL)         force backtrack; synonym (*F)
+  (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
+.sp
+The following act only when a subsequent match failure causes a backtrack to
+reach them. They all force a match failure, but they differ in what happens
+afterwards. Those that advance the start-of-match point do so only if the
+pattern is not anchored.
+.sp
+  (*COMMIT)       overall failure, no advance of starting point
+  (*PRUNE)        advance to next starting character
+  (*PRUNE:NAME)   equivalent to (*MARK:NAME)(*PRUNE)
+  (*SKIP)         advance to current matching position
+  (*SKIP:NAME)    advance to position corresponding to an earlier
+                  (*MARK:NAME); if not found, the (*SKIP) is ignored
+  (*THEN)         local failure, backtrack to next alternation
+  (*THEN:NAME)    equivalent to (*MARK:NAME)(*THEN)
+.
+.
+.SH "CALLOUTS"
+.rs
+.sp
+  (?C)            callout (assumed number 0)
+  (?Cn)           callout with numerical data n
+  (?C"text")      callout with string data
+.sp
+The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
+start and the end), and the starting delimiter { matched with the ending
+delimiter }. To encode the ending delimiter within the string, double it.
+.
+.
+.SH "SEE ALSO"
+.rs
+.sp
+\fBpcre2pattern\fP(3), \fBpcre2api\fP(3), \fBpcre2callout\fP(3),
+\fBpcre2matching\fP(3), \fBpcre2\fP(3).
+.
+.
+.SH AUTHOR
+.rs
+.sp
+.nf
+Philip Hazel
+University Computing Service
+Cambridge, England.
+.fi
+.
+.
+.SH REVISION
+.rs
+.sp
+.nf
+Last updated: 13 June 2015
+Copyright (c) 1997-2015 University of Cambridge.
+.fi