mirror of
https://git.proxmox.com/git/rustc
synced 2025-08-13 06:00:22 +00:00
166 lines
7.9 KiB
Markdown
166 lines
7.9 KiB
Markdown
pattern_limit should not be defined inside nfa::thompson, but rather at the
|
|
top-level.
|
|
|
|
-----
|
|
|
|
Main problem right now is exemplified by the set60 and set70 failing tests. In
|
|
particular, when finding the starting position while matching multiple regexes
|
|
simultaneously, the reverse search is messed up. The reverse search doesn't
|
|
depend on which regex matched in the forward direction, which means it won't
|
|
always find the correcting starting location. Unfortunately, the only way to
|
|
fix this, as far as I can tell, is to add a group of start states for every
|
|
regex in the DFA. Then once we do the reverse search, we need to choose the
|
|
correct start state based on which regex matched in the forward direction.
|
|
|
|
This is a nasty change.
|
|
|
|
So it looks like this only applies when doing an overlapping search in reverse
|
|
to find the start of a match. That means we should make this configurable
|
|
but enable it by default for the reverse automata. It should be configurable
|
|
so that folks can construct a regex that doesn't have the ability to do
|
|
overlapping searches correctly. If an overlapping search is attempted with
|
|
a reverse automaton that lacks starting states for each pattern, then the
|
|
implementation should panic.
|
|
|
|
BUT! It is also convenient to provide this option in general for folks that
|
|
want a DFA that can match any pattern while also being able to match a specific
|
|
pattern.
|
|
|
|
Straw man:
|
|
|
|
* Update dense::Config to have a `starts_for_each_pattern` option. It should
|
|
be disabled by default.
|
|
* In `RegexBuilder::build_many_with_size` tweak the reverse DFA configuration
|
|
to have the aforementioned option enabled.
|
|
* It would be interesting to add new APIs to `Regex` that support matching
|
|
specific patterns, but I think this is a complication. If we did want to do
|
|
this, then we should just add it to the `_at` variants and leave the rest of
|
|
the API untouched.
|
|
* Add a `pattern_id: Option<PatternID>` parameter to each of the five
|
|
`*_at` methods on the `dfa::Automaton` trait. A value of `None` retains the
|
|
existing behavior. A `Some` value means that the starting state for that
|
|
specific pattern must be chosen, which in turn implies an anchored search.
|
|
(This means `starts_for_each_pattern` has utility for single-pattern DFAs
|
|
since it makes it possible to build a DFA that can do both unanchored and
|
|
anchored searches.)
|
|
* Thread this new parameter down into the various functions in `dfa::search`
|
|
all the way down into `init_fwd` and `init_rev`. These functions will then
|
|
pass it to `dfa.start_state_{forward,reverse}`.
|
|
* This is where things get gruesome since we now need to completely re-work how
|
|
start states are represented in dense and sparse DFAs _and_ it needs to be
|
|
configurable. It looks like the `Start` type from `dfa::automaton` can
|
|
basically remain unchanged, since it still represents one of the four
|
|
possible starting states that will need to be applied for every pattern.
|
|
* For `dfa::dense`, change `StartList` to `StartTable`. Currently, its only
|
|
header is the state ID count, which is always 4. We'll want to change this
|
|
to the stride and add a new header value that encodes the number of patterns.
|
|
When the number of patterns is zero, then existing behavior is preserved and
|
|
represents the case where `starts_for_each_pattern` is disabled (or in the
|
|
case of an empty DFA). When non-zero, a table of starting state IDs is
|
|
encoded with each row corresponding to the 4 starting states for each
|
|
pattern. Before this table (even if it's empty), the 4 starting states for
|
|
the entire DFA are encoded.
|
|
* For `dfa::sparse`, do the same as above. They are essentially the same right
|
|
now anyway, with the only difference being that sparse DFAs use `&[u8]`
|
|
instead of `&[S]` (because sparse DFAs don't have any alignment
|
|
requirements).
|
|
* Modify `DFA::empty` to accept a `starts_for_each_pattern` bool that, when
|
|
true, creates a start table with the header, the start states for the entire
|
|
DFA and a row of start states for each pattern. When false, no rows are
|
|
added.
|
|
* Expose whether there are starting states for each pattern via a predicate
|
|
on the DFA.
|
|
* Modify the determinizer's `add_starts` method to basically do what it does,
|
|
but also do it for each pattern when the DFA is configured for it. It should
|
|
continue to reuse states as appropriate or not generate new states if they
|
|
aren't needed. This will want to use the `NFA::start_pattern` method, which
|
|
provides the starting NFA state ID for the given pattern.
|
|
* Fix the dense->sparse conversion. At this point, this piece should be fairly
|
|
straight-forward since the sparse representation of starting states is
|
|
basically identical to the dense representation.
|
|
|
|
At this point, I think the bug should resolve itself.
|
|
|
|
^^^^ DONE! IT WORKS!
|
|
|
|
-----
|
|
|
|
|
|
Add top-level SyntaxConfig (or some such) that has all of the regex-syntax
|
|
options forwarded, but with automata oriented docs. Then use this for all of
|
|
the engines instead of having to repeat every option for every builder.
|
|
|
|
-----
|
|
|
|
These produce different results. PCRE2 looks correct. Basically, we should be
|
|
using the context around the `at` position correctly, which we aren't doing
|
|
right now. Seems tricky to get right, particularly when confirming the match
|
|
with a reverse DFA.
|
|
|
|
Maybe our 'at' functions need to take a full range... Sigh. This is indeed what
|
|
RE2 does. GAH.
|
|
|
|
fn main() {
|
|
let re = regex::Regex::new(r"(?-u)\b\sbar").unwrap();
|
|
let s = "foo bar baz";
|
|
println!("{:?}", re.find_at(s, 3).map(|m| m.as_str()));
|
|
|
|
let re = pcre2::bytes::Regex::new(r"\b\sbar").unwrap();
|
|
let s = "foo bar baz";
|
|
println!("{:?}", re.find_at(s.as_bytes(), 3).unwrap());
|
|
}
|
|
|
|
^^^^ This is fixed now, but we still need to find a way to add test coverage
|
|
for "context" searches. It'd be nice to do this automatically, but we'll
|
|
probably just added a new 'context = [start, end]' option.
|
|
|
|
-----
|
|
|
|
|
|
* Create regex-test crate, based on glob-test. Try to anticipate the needs for
|
|
the full regex test suite.
|
|
* See if we can clean up tests.
|
|
* Provide a way to mark a test as expensive.
|
|
* Provide a way to test is_match_at and find_at.
|
|
* Test shortest_match_at too? Huge pain. Add tests for it.
|
|
* Port ALL tests from the regex crate. Will probably need a way to mark a
|
|
test as skipped.
|
|
* Document tests better.
|
|
* Find a way to remove byteorder dependency.
|
|
* Reorganize crate API:
|
|
* Have errors contain `Box<Error+Send+Sync>` instead of `String`.
|
|
* Make errors non-exhaustive.
|
|
* Audit `StateID` trait for safety.
|
|
* Brainstorm hard about `DFA` trait and the fact that DenseDFA and SparseDFA
|
|
have inefficient implementations of some methods. Maybe use multiple
|
|
traits? Answer: get rid of premultiply/classes knobs and just enable
|
|
them by default. Should remove a huge amount of code.
|
|
* Check whether `unsafe` is really needed to eliminate bounds checks. Use
|
|
micro-benchmarks and bigger CLI workloads using `regex-automata-debug`.
|
|
* Re-write module docs for `dfa` as they are no longer top-level. Keep most.
|
|
* Retain any pertinent top-level crate docs, but don't rewrite yet.
|
|
* Clean up builders if we can. e.g., Determinizer, minimizer, it's all a mess
|
|
right now.
|
|
* Clean up and add 'always_match' and 'never_match' constructors for every
|
|
regex engine.
|
|
* See about supporting ^, $, \A, \z, \b and \B in DFAs. Do the non-Unicode
|
|
version of \b unfortunately. Carefully scrutinize how the regex crate's
|
|
lazy DFA does it and try to make it comprehensible. Done! Except for the
|
|
part about making it comprehensible.
|
|
* Rethink prefilters?
|
|
* Add `regex-automata-generate` CLI tool. This should just be a copy of
|
|
the `ucd-generate dfa` and `ucd-generate regex` commands.
|
|
|
|
Then build new public `nfa` sub-module.
|
|
* For Unicode \b, generate \w DFA (forwards and reverse) and embed it into
|
|
source for fast checking. That way, we don't need to ever do explicit UTF-8
|
|
decoding anywhere. Yay.
|
|
|
|
Then `lazy` sub-module.
|
|
|
|
Then `onepass`.
|
|
|
|
Then `jit`.
|
|
|
|
... and beyond? CRAZY. But it can be done! Build STRONG base layers.
|