Module Re2_internal_intf

module Re2_internal_intf: sig .. end
These are OCaml bindings for Google's re2 library. Quoting from the re2 homepage:

> RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression > engines like those used in PCRE, Perl, and Python. It is a C++ library.

> Unlike most automata-based engines, RE2 implements almost all the common Perl and > PCRE features and syntactic sugars. It also finds the leftmost-first match, the same > match that Perl would, and can return submatch information. The one significant > exception is that RE2 drops support for backreferences¹ and generalized zero-width > assertions, because they cannot be implemented efficiently. The syntax page gives > full details.

Syntax reference: https://github.com/google/re2/wiki/Syntax


module type S = sig .. end

These are OCaml bindings for Google's re2 library. Quoting from the re2 homepage:

> RE2 is a fast, safe, thread-friendly alternative to backtracking regular expression > engines like those used in PCRE, Perl, and Python. It is a C++ library.

> Unlike most automata-based engines, RE2 implements almost all the common Perl and > PCRE features and syntactic sugars. It also finds the leftmost-first match, the same > match that Perl would, and can return submatch information. The one significant > exception is that RE2 drops support for backreferences¹ and generalized zero-width > assertions, because they cannot be implemented efficiently. The syntax page gives > full details.

Syntax reference: https://github.com/google/re2/wiki/Syntax

Although OCaml strings may legally have internal null bytes, it is expensive to check for them, so this library just assumes that it will never see such a string. The failure mode is the search stops early, which isn't bad considering how rare internal null bytes are in practice.

The strings are considered in UTF-8 encoding by default or in ISO 8859-1 if Options.latin1 is used.

Basic Types


Subpatterns are referenced by name if labelled with the /(?P<...>...)/ syntax, or else by counting open-parens, with subpattern zero referring to the whole regex.

index_of_id t id resolves subpattern names and indices into indices. *

The sub keyword argument means, omit location information for subpatterns with index greater than sub.

Subpatterns are indexed by the number of opening parentheses preceding them:

~sub:(`Index 0) : only the whole match ~sub:(`Index 1) : the whole match and the first submatch, etc.

If you only care whether the pattern does match, you can request no location information at all by passing ~sub:(`Index -1).

With one exception, I quote from re2.h:443,

> Don't ask for more match information than you will use: > runs much faster with nmatch == 1 than nmatch > 1, and > runs even faster if nmatch == 0.

For sub > 1, re2 executes in three steps: 1. run a DFA over the entire input to get the end of the whole match 2. run a DFA backward from the end position to get the start position 3. run an NFA from the match start to match end to extract submatches sub == 1 lets it stop after (2) and sub == 0 lets it stop after (1). (See re2.cc:692 or so.)

The one exception is for the functions get_matches, replace, and Iterator.next: Since they must iterate correctly through the whole string, they need at least the whole match (subpattern 0). These functions will silently rewrite ~sub to be non-negative.

num_submatches t returns 1 + the number of open-parens in the pattern.

N.B. num_submatches t == 1 + RE2::NumberOfCapturingGroups() because RE2::NumberOfCapturingGroups() ignores the whole match ("subpattern zero").

pattern t returns the pattern from which the regex was constructed. *

find_all t input a convenience function that returns all non-overlapping matches of t against input, in left-to-right order.

If sub is given, and the requested subpattern did not capture, then no match is returned at that position even if other parts of the regex did match.

find_first ?sub pattern input finds the first match of pattern in input, and returns the subpattern specified by sub, or an error if the subpattern didn't capture.

find_submatches t input finds the first match and returns all submatches. Element 0 is the whole match and element 1 is the first parenthesized submatch, etc.

matches pattern input

split pattern input

rewrite pattern ~template input is a convenience function for replace: Instead of requiring an arbitrary transformation as a function, it accepts a template string with zero or more substrings of the form "\\n", each of which will be replaced by submatch n. For every match of pattern against input, the template will be specialized and then substituted for the matched substring.

valid_rewrite_template pattern ~template

escape nonregex

Infix Operators


create_exn

input =~ pattern an infix alias of matches

input //~ pattern an infix alias of find_first *

Complicated Interface


A Match.t is the result of applying a regex to an input string

If location information has been omitted (e.g., via ~sub), the error returned is Regex_no_such_subpattern, just as though that subpattern were never defined.

get_all t returns all available matches as strings in an array. For the indexing convention, see comment above regarding sub parameter. *

get_pos_exn ~sub t returns the start offset and length in bytes. Note that for variable-width encodings (e.g., UTF-8) this may not be the same as the character offset and character length.

get_matches pattern input returns all non-overlapping matches of pattern against input

replace ?sub ?max ~f pattern input

Regex_no_such_subpattern (n, max) means n was requested but only max subpatterns are defined (so max - 1 is the highest valid index)

Regex_no_such_named_subpattern (name, pattern)

Match_failed pattern

Regex_submatch_did_not_capture (s, i) means the ith subpattern in the regex compiled from s did not capture a substring.

the string is the C library's error message, generally in the form of "(human-readable error): (piece of pattern that did not compile)"

Regex_rewrite_template_invalid (template, error_msg)