Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider transforming regexes for better finite automaton compatibility #91

Open
masklinn opened this issue Oct 26, 2024 · 0 comments
Open

Comments

@masklinn
Copy link

I understand that Go's regexp packages uses finite automaton. Working on uap-rust, I learned that uap-core's extensive use of large bounded repetition (.{x,y} with a large y) causes a state explosion in FA engines (which makes sense), you can see some of it at rust-lang/regex#1206 (reply in thread) as I asked the rust-regex author for assistance (I was investigating the large memory use of my rust-regex-based implementation compared to a C++ re2 implementation).

Assuming Go's regexp is inspired by or derived from re2, it should be much less sensible than rust-regexp, and should not have the performance traps / cliffs of jitted DFA, but I would still expect that the bounded repetitions significantly increase memory use for no value.

Experimentally, in uap-rust I found that converting repetitions with an upper bound of 100 or more (so 3+ digits) had no functional implication, with all of those having been added to mitigate catastrophic backtracking. Below 100 things are more risky, I found at least one regex which uses a functional repetition with an upper bound of 50.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant