-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix NuGetApiKey detection to not require Guid bytes #101
Fix NuGetApiKey detection to not require Guid bytes #101
Conversation
This also simplifies the regex to just expect oy2 and then 43 lowercase base32 characters.
@@ -318,7 +318,7 @@ | |||
"DetectionMetadata": "Identifiable" | |||
}, | |||
{ | |||
"Pattern": "(^|[^0-9a-z])(?<refine>oy2[a-p][0-9a-z]{15}[aq][0-9a-z]{11}[eu][bdfhjlnprtvxz357][a-p][0-9a-z]{11}[aeimquy4])([^aeimquy4]|$)", | |||
"Pattern": "(^|[^a-zA-Z0-9])(?<refine>oy2[a-z2-7]{43})([^a-zA-Z0-9]|$)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I'd jump through this hoop. I think I understand what's on your mind, you want to prevent identifying a sequence of characters in some other format, like base64, hex or whatever. If that's correct, though, shouldn't you consider adding the special chars for both standard and URL safe base64?
Maybe what you have in mind here is to prevent long sequences of all alphabetic chars, which would be entirely fair, that seems like the most likely source of a false positive (it's mathematically uncertain a truly randomized sequence of encoded chars for which n upper-case letter was feasible didn't have at least one for a 43 char sequence). But the oy2
prefix here is a pretty good preventive against that possibility.
I mention all the above because we haven't really established a good set of principles/conventions for specifying prefix and suffix delimiting regexes. Maybe you can help with that topic, any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good discussion. Generally I am not sure if we can know if one solution is better than another without testing it on a large corpus such as all public GitHub commits and measuring false positive rate.
The intent of [^a-zA-Z0-9]
group at the beginning and end are to find the 46 character base32 string in a config or code file, bounded by non-word characters. The base64 is probably the most common false positive source if I were to guess. Hex could already not match since oy2
is not hex. Perhaps the intent would be better here if we used [^\w]
(non-word characters).
I tried a simple regex on grep.app:
https://grep.app/search?q=oy2%5Ba-z2-7%5D%7B43%7D®exp=true (this is case insensitive and not exactly the same regex) but it shows all the examples are base64 data.
If we include the missing base64 characters (+
, /
, -
, _
) we would potentially reduce our FP rate but it's not clear to me that true positives wouldn't have, say, a leading hyphen character... I think I want to keep it as-is just to catch more potential true positives. LMK if I am misunderstanding you.
I am uncertain on how much thought we should put into reducing false positives. To your point, perhaps the oy2
prefix is already very strong and gets us "99% the way there".
I mention all the above because we haven't really established a good set of principles/conventions for specifying prefix and suffix delimiting regexes. Maybe you can help with that topic, any thoughts?
I think my team would have benefitted greatly by these principles/conventions if they existed. I think the only useful thoughts I have on it are:
- Having an identifiable substring for your secret is a must, unless you want to handle a huge volume of FPs. FP for GitHub Secret scanning is a matter of scaling your integration endpoint, but for client-side tooling it is a real problem. You don't want your users turning off client side secret scanning due to too many FPs.
- This is more of a question than an idea -- but what do you think about having multiple patterns? Perhaps one with a high FP rate for GitHub Secret Scanning and another with a low FP rate for local tooling. I'm imagining IDE integration where you get a build warning if a secret is found in your code. This would put the high FP rate on the service which can handle it with automation without burning out users on client side tooling with high FP.
@@ -318,7 +318,7 @@ | |||
"DetectionMetadata": "Identifiable" | |||
}, | |||
{ | |||
"Pattern": "(^|[^0-9a-z])(?<refine>oy2[a-p][0-9a-z]{15}[aq][0-9a-z]{11}[eu][bdfhjlnprtvxz357][a-p][0-9a-z]{11}[aeimquy4])([^aeimquy4]|$)", | |||
"Pattern": "(^|[^a-zA-Z0-9])(?<refine>oy2[a-z2-7]{43})([^a-zA-Z0-9]|$)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's all good. In the end, we (nuget.org) handled the scale of both patterns with GitHub Secret scanning so that false positive concessions each pattern has is not a problem for us. I think the part I am ignorant about is how other usages of this library are impacted by a false positive in NuGet API keys.
src/Tests.Microsoft.Security.Utilities.Core/SecretMaskerTests.cs
Outdated
Show resolved
Hide resolved
@@ -318,7 +318,7 @@ | |||
"DetectionMetadata": "Identifiable" | |||
}, | |||
{ | |||
"Pattern": "(^|[^0-9a-z])(?<refine>oy2[a-p][0-9a-z]{15}[aq][0-9a-z]{11}[eu][bdfhjlnprtvxz357][a-p][0-9a-z]{11}[aeimquy4])([^aeimquy4]|$)", | |||
"Pattern": "(^|[^a-zA-Z0-9])(?<refine>oy2[a-z2-7]{43})([^a-zA-Z0-9]|$)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's great to know. I had it as an action item on my side to request update on our GitHub pattern after this PR is merged. I have gotten traction with their support email in the past so I anticipate no problem. I think we need to maintain that relationship anyway since there is a live endpoint we have to maintain and keep GitHub in sync with (e.g. correct URL if infra changes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, so GitHub forwards your exposures directly to you, I see that in source. Do you auto-revoke on receipt of findings? Do you have any data on scale of reporting to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scale is relatively low. We have on the order of 100 notifications per week. I don't have data on the FP rate.
# Conflicts: # GeneratedRegexPatterns/HighConfidenceSecurityModels.json # GeneratedRegexPatterns/PreciselyClassifiedSecurityKeys.json
src/Microsoft.Security.Utilities.Core/PreciselyClassifiedSecurityKeys/SEC101_031_NuGetApiKey.cs
Outdated
Show resolved
Hide resolved
…keys which are non-standard but accepted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also simplifies the regex to just expect oy2 and then 43 lowercase base32 characters.
Based on internal feedback we can improve the implementation to:
I would like to update the detection pattern to handle the new pattern, prior to making the change on NuGet.org