Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regexes sorting is incorrect #33

Open
gkalabin opened this issue Aug 10, 2016 · 7 comments
Open

Regexes sorting is incorrect #33

gkalabin opened this issue Aug 10, 2016 · 7 comments

Comments

@gkalabin
Copy link
Contributor

According to specification:

The list of regular-expressions regex shall be evaluated for a given user-agent string beginning with the first regex-item in the list to the last item. The first matching regex stops processing the list. Regex-matching shall be case sensitive.

Here is the proof that sorting of regexes will cause wrong detection results:

package main

import (
    "fmt"
    "log"

    "github.com/uap-go/uaparser"
)

const (
    // specificUA is matched by X and Y, X preceeds Y in regex list
    specificUA = "Opera/9.80 (VRE; Opera Mini/4.2/28.2794; U; en) Presto/2.8.119 Version/11.10"
    // broadUA is matched by Y
    broadUA    = "Opera/9.80 (Windows NT 5.1; U; ru) Presto/2.5.24 Version/10.53"
)

func main() {
    sortThreshold := 100001
    parser, err := uaparser.NewWithOptions("./uap-core/regexes.yaml", uaparser.EUserAgentLookUpMode, sortThreshold, 0, true, true)
    if err != nil {
        log.Fatal(err)
    }

    // specificUA is matched by X, everything is fine
    beforeSort := parser.Parse(specificUA).UserAgent

    // cause regexes sort by parsing broadUA many times: it will cause bubbling up of regex Y
    for i := 0; i < sortThreshold; i++ {
        parser.Parse(broadUA)
    }

    // specificUA is now matched by Y which bubbled up after sort. This causes wrong parsing results
    afterSort := parser.Parse(specificUA).UserAgent
    fmt.Printf("before sort:\t %#v\n after sort:\t %#v\n", beforeSort, afterSort)
}

Result:

$ go run test.go 
2016-08-10 18:21:37.284461949 +0300 MSK Sorting UserAgents slice
before sort:     &uaparser.UserAgent{Family:"Opera Mini", Major:"4", Minor:"2", Patch:""}
 after sort:     &uaparser.UserAgent{Family:"Opera", Major:"11", Minor:"10", Patch:""}
@evgenigourvitch
Copy link
Contributor

According to the regex list there are two (or may be even more) regexes that match the 'specificUA'.
At the beginning (beforeSort := parser.Parse(specificUA).UserAgent) you 'hit' the first one in the list of the regexes, but it doesn't mean there are no more regexes that can match, and the first one returned.

After it you hit some other regex, that is in "lower" index in the list with UA ,that doesn't hit the regex from the first case. This one cause the sort to "pop up" the regex which matched the broadUA to the top of the list.

After it, you are trying to find match for the specificUA once again and now the regex which matches both, specificUA and broadUA is at the top of the list and this one matches.

What's wrong here?
So far I can see the correct behaviour - the more "successful" regex was poped up to the top of the list, to reduce the lookup time.

To summarize it I can see the following scenario:
specificUA matches regexes A, B and C. At the beginning regex A was returned, because it was at the "higher" position than B and C. During the lookups B was popped up to the first position and now B is being returned as matches regex for specificUA.

Am I missing something?

@gkalabin
Copy link
Contributor Author

You're right, regex sorting works in this way.
The problem here is that Opera/9.80 (VRE; Opera Mini/4.2/28.2794; U; en) Presto/2.8.119 Version/11.10 corresponds to opera mini, not opera. Before sorting, detection is correct and this UA is detected as opera mini.
When list is sorted, then the detector says that it's opera, not opera mini, which in fact is incorrect

@evgenigourvitch
Copy link
Contributor

evgenigourvitch commented Aug 11, 2016

This is because the regex '(Opera Mini)(?:/att)?/(\d+).(\d+)' (which is being matched at the first time) located at line 109 in the regexes.yaml and the regex (Opera)/9.80.*Version/(\d+).(\d+)(?:.(\d+))? (which is being matched for the loopped calls, and which affects the scoring) located at line 110.
If you will swap them (regexes) you will get the wrong match at this call:
beforeSort := parser.Parse(specificUA).UserAgent

As I described in my pull request, my code was supposed to reduce lookup time by "popping up" the most "usable" UA. And I think it does it well (especially on the high loaded cluster (4-6M requests/min))

@gkalabin
Copy link
Contributor Author

yes, it's exactly what sorting means. The second call result is indeed incorrect, since input user agent is opera mini, but it's being detected as opera (opera != opera mini).

@evgenigourvitch
Copy link
Contributor

As a conclusion, sorting is correct, but in some cases, like described before, it can affect the regex match. Unfortunately, Golang doesn't support negative (and positive) lookbehind assertion, so it is not an option to create regex which will exclude (Opera Mini). So currently I can see two solutions:

  1. Not to use sort capability (pass false to the NewWithOptions method).
  2. Make some adjustments to the regexes to prevent this issue.

I'll try to implement some walkaround...

gkalabin, thank you for pointing at this issue.

@elsigh
Copy link
Contributor

elsigh commented Aug 11, 2016

Cool - thanks for diving in. In the past when we've discussed tradeoffs
with performance and accuracy, we've sided with accuracy and readability
(e.g. we have a regex-driven yaml file that's relatively easy to
edit/understand/consume in any language).

I'm open to sorting things if they don't affect readability (e.g. a change
of format is out, but a transpile is OK) and result accuracy / consistency
between language implementations.

:-)

On Thu, Aug 11, 2016 at 2:01 PM, evgenigourvitch [email protected]
wrote:

As a conclusion, sorting is correct, but in some cases, like described
before, it can affect the regex match. Unfortunately, Golang doesn't
support negative (and positive) lookbehind assertion, so it is not an
option to create regex which will exclude (Opera Mini). So currently I can
see two solutions:

  1. Not to use sort capability (pass false to the NewWithOptions
    method).
  2. Make some adjustments to the regexes to prevent this issue.

I'll try to implement some walkaround...

gkalabin, thank you for pointing at this issue.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#33 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAfkWvyfJGHXLYjx003nmLdIMB2qGcFTks5qe42ngaJpZM4JhRJe
.

@dgoldstein0
Copy link
Contributor

I skeptical of the whole sorting feature in the first place. We just merged caching #75 which should get a huge speedup for repeated queries of the same user agents - which is likely if used to parse UAs from web traffic. So that might fill the performance need a lot better than this sorting feature, without the downside of potentially incorrect detections.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants