Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix NuGetApiKey detection to not require Guid bytes #101

Merged
merged 7 commits into from
Dec 13, 2024

Conversation

joelverhagen
Copy link
Collaborator

@joelverhagen joelverhagen commented Oct 23, 2024

This also simplifies the regex to just expect oy2 and then 43 lowercase base32 characters.

Based on internal feedback we can improve the implementation to:

  • Not use Guid.NewGuid for CPRNG
  • Allow zero bytes in the ID portion

I would like to update the detection pattern to handle the new pattern, prior to making the change on NuGet.org

This also simplifies the regex to just expect oy2 and then 43 lowercase base32 characters.
@@ -318,7 +318,7 @@
"DetectionMetadata": "Identifiable"
},
{
"Pattern": "(^|[^0-9a-z])(?<refine>oy2[a-p][0-9a-z]{15}[aq][0-9a-z]{11}[eu][bdfhjlnprtvxz357][a-p][0-9a-z]{11}[aeimquy4])([^aeimquy4]|$)",
"Pattern": "(^|[^a-zA-Z0-9])(?<refine>oy2[a-z2-7]{43})([^a-zA-Z0-9]|$)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A-Z

I'm not sure I'd jump through this hoop. I think I understand what's on your mind, you want to prevent identifying a sequence of characters in some other format, like base64, hex or whatever. If that's correct, though, shouldn't you consider adding the special chars for both standard and URL safe base64?

Maybe what you have in mind here is to prevent long sequences of all alphabetic chars, which would be entirely fair, that seems like the most likely source of a false positive (it's mathematically uncertain a truly randomized sequence of encoded chars for which n upper-case letter was feasible didn't have at least one for a 43 char sequence). But the oy2 prefix here is a pretty good preventive against that possibility.

I mention all the above because we haven't really established a good set of principles/conventions for specifying prefix and suffix delimiting regexes. Maybe you can help with that topic, any thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good discussion. Generally I am not sure if we can know if one solution is better than another without testing it on a large corpus such as all public GitHub commits and measuring false positive rate.

The intent of [^a-zA-Z0-9] group at the beginning and end are to find the 46 character base32 string in a config or code file, bounded by non-word characters. The base64 is probably the most common false positive source if I were to guess. Hex could already not match since oy2 is not hex. Perhaps the intent would be better here if we used [^\w] (non-word characters).

I tried a simple regex on grep.app:
https://grep.app/search?q=oy2%5Ba-z2-7%5D%7B43%7D&regexp=true (this is case insensitive and not exactly the same regex) but it shows all the examples are base64 data.

If we include the missing base64 characters (+, /, -, _) we would potentially reduce our FP rate but it's not clear to me that true positives wouldn't have, say, a leading hyphen character... I think I want to keep it as-is just to catch more potential true positives. LMK if I am misunderstanding you.

I am uncertain on how much thought we should put into reducing false positives. To your point, perhaps the oy2 prefix is already very strong and gets us "99% the way there".

I mention all the above because we haven't really established a good set of principles/conventions for specifying prefix and suffix delimiting regexes. Maybe you can help with that topic, any thoughts?

I think my team would have benefitted greatly by these principles/conventions if they existed. I think the only useful thoughts I have on it are:

  1. Having an identifiable substring for your secret is a must, unless you want to handle a huge volume of FPs. FP for GitHub Secret scanning is a matter of scaling your integration endpoint, but for client-side tooling it is a real problem. You don't want your users turning off client side secret scanning due to too many FPs.
  2. This is more of a question than an idea -- but what do you think about having multiple patterns? Perhaps one with a high FP rate for GitHub Secret Scanning and another with a low FP rate for local tooling. I'm imagining IDE integration where you get a build warning if a secret is found in your code. This would put the high FP rate on the service which can handle it with automation without burning out users on client side tooling with high FP.

@@ -318,7 +318,7 @@
"DetectionMetadata": "Identifiable"
},
{
"Pattern": "(^|[^0-9a-z])(?<refine>oy2[a-p][0-9a-z]{15}[aq][0-9a-z]{11}[eu][bdfhjlnprtvxz357][a-p][0-9a-z]{11}[aeimquy4])([^aeimquy4]|$)",
"Pattern": "(^|[^a-zA-Z0-9])(?<refine>oy2[a-z2-7]{43})([^a-zA-Z0-9]|$)",
Copy link
Member

@michaelcfanning michaelcfanning Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2-7

What? So our original 'too clever' regex didn't properly reflect a 32 character encoding. Nuts. #Closed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all good. In the end, we (nuget.org) handled the scale of both patterns with GitHub Secret scanning so that false positive concessions each pattern has is not a problem for us. I think the part I am ignorant about is how other usages of this library are impacted by a false positive in NuGet API keys.

@@ -318,7 +318,7 @@
"DetectionMetadata": "Identifiable"
},
{
"Pattern": "(^|[^0-9a-z])(?<refine>oy2[a-p][0-9a-z]{15}[aq][0-9a-z]{11}[eu][bdfhjlnprtvxz357][a-p][0-9a-z]{11}[aeimquy4])([^aeimquy4]|$)",
"Pattern": "(^|[^a-zA-Z0-9])(?<refine>oy2[a-z2-7]{43})([^a-zA-Z0-9]|$)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pattern

FYI, historically our team has maintained the GitHub patterns and we could contribute this update there if you don't see any action on their side. I will go see if my GitHub contributor perms are still functional, it's been a while since I made a change there.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great to know. I had it as an action item on my side to request update on our GitHub pattern after this PR is merged. I have gotten traction with their support email in the past so I anticipate no problem. I think we need to maintain that relationship anyway since there is a live endpoint we have to maintain and keep GitHub in sync with (e.g. correct URL if infra changes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so GitHub forwards your exposures directly to you, I see that in source. Do you auto-revoke on receipt of findings? Do you have any data on scale of reporting to you?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scale is relatively low. We have on the order of 100 notifications per week. I don't have data on the FP rate.

# Conflicts:
#	GeneratedRegexPatterns/HighConfidenceSecurityModels.json
#	GeneratedRegexPatterns/PreciselyClassifiedSecurityKeys.json
@joelverhagen joelverhagen marked this pull request as ready for review December 12, 2024 17:01
Copy link
Member

@michaelcfanning michaelcfanning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@michaelcfanning michaelcfanning merged commit f77c941 into microsoft:main Dec 13, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants