Fix NuGetApiKey detection to not require Guid bytes #101

joelverhagen · 2024-10-23T16:56:40Z

This also simplifies the regex to just expect oy2 and then 43 lowercase base32 characters.

Based on internal feedback we can improve the implementation to:

Not use Guid.NewGuid for CPRNG
Allow zero bytes in the ID portion

I would like to update the detection pattern to handle the new pattern, prior to making the change on NuGet.org

This also simplifies the regex to just expect oy2 and then 43 lowercase base32 characters.

michaelcfanning · 2024-12-12T15:32:36Z

GeneratedRegexPatterns/HighConfidenceSecurityModels.json

@@ -318,7 +318,7 @@
    "DetectionMetadata": "Identifiable"
  },
  {
-    "Pattern": "(^|[^0-9a-z])(?<refine>oy2[a-p][0-9a-z]{15}[aq][0-9a-z]{11}[eu][bdfhjlnprtvxz357][a-p][0-9a-z]{11}[aeimquy4])([^aeimquy4]|$)",
+    "Pattern": "(^|[^a-zA-Z0-9])(?<refine>oy2[a-z2-7]{43})([^a-zA-Z0-9]|$)",


A-Z

I'm not sure I'd jump through this hoop. I think I understand what's on your mind, you want to prevent identifying a sequence of characters in some other format, like base64, hex or whatever. If that's correct, though, shouldn't you consider adding the special chars for both standard and URL safe base64?

Maybe what you have in mind here is to prevent long sequences of all alphabetic chars, which would be entirely fair, that seems like the most likely source of a false positive (it's mathematically uncertain a truly randomized sequence of encoded chars for which n upper-case letter was feasible didn't have at least one for a 43 char sequence). But the oy2 prefix here is a pretty good preventive against that possibility.

I mention all the above because we haven't really established a good set of principles/conventions for specifying prefix and suffix delimiting regexes. Maybe you can help with that topic, any thoughts?

This is a good discussion. Generally I am not sure if we can know if one solution is better than another without testing it on a large corpus such as all public GitHub commits and measuring false positive rate.

The intent of [^a-zA-Z0-9] group at the beginning and end are to find the 46 character base32 string in a config or code file, bounded by non-word characters. The base64 is probably the most common false positive source if I were to guess. Hex could already not match since oy2 is not hex. Perhaps the intent would be better here if we used [^\w] (non-word characters).

I tried a simple regex on grep.app:
https://grep.app/search?q=oy2%5Ba-z2-7%5D%7B43%7D&regexp=true (this is case insensitive and not exactly the same regex) but it shows all the examples are base64 data.

If we include the missing base64 characters (+, /, -, _) we would potentially reduce our FP rate but it's not clear to me that true positives wouldn't have, say, a leading hyphen character... I think I want to keep it as-is just to catch more potential true positives. LMK if I am misunderstanding you.

I am uncertain on how much thought we should put into reducing false positives. To your point, perhaps the oy2 prefix is already very strong and gets us "99% the way there".

I mention all the above because we haven't really established a good set of principles/conventions for specifying prefix and suffix delimiting regexes. Maybe you can help with that topic, any thoughts?

I think my team would have benefitted greatly by these principles/conventions if they existed. I think the only useful thoughts I have on it are:

Having an identifiable substring for your secret is a must, unless you want to handle a huge volume of FPs. FP for GitHub Secret scanning is a matter of scaling your integration endpoint, but for client-side tooling it is a real problem. You don't want your users turning off client side secret scanning due to too many FPs.

This is more of a question than an idea -- but what do you think about having multiple patterns? Perhaps one with a high FP rate for GitHub Secret Scanning and another with a low FP rate for local tooling. I'm imagining IDE integration where you get a build warning if a secret is found in your code. This would put the high FP rate on the service which can handle it with automation without burning out users on client side tooling with high FP.

michaelcfanning · 2024-12-12T15:36:41Z

GeneratedRegexPatterns/HighConfidenceSecurityModels.json

@@ -318,7 +318,7 @@
    "DetectionMetadata": "Identifiable"
  },
  {
-    "Pattern": "(^|[^0-9a-z])(?<refine>oy2[a-p][0-9a-z]{15}[aq][0-9a-z]{11}[eu][bdfhjlnprtvxz357][a-p][0-9a-z]{11}[aeimquy4])([^aeimquy4]|$)",
+    "Pattern": "(^|[^a-zA-Z0-9])(?<refine>oy2[a-z2-7]{43})([^a-zA-Z0-9]|$)",


2-7

What? So our original 'too clever' regex didn't properly reflect a 32 character encoding. Nuts. #Closed

It's all good. In the end, we (nuget.org) handled the scale of both patterns with GitHub Secret scanning so that false positive concessions each pattern has is not a problem for us. I think the part I am ignorant about is how other usages of this library are impacted by a false positive in NuGet API keys.

src/Tests.Microsoft.Security.Utilities.Core/SecretMaskerTests.cs

GeneratedRegexPatterns/PreciselyClassifiedSecurityKeys.json

michaelcfanning · 2024-12-12T15:46:00Z

GeneratedRegexPatterns/HighConfidenceSecurityModels.json

@@ -318,7 +318,7 @@
    "DetectionMetadata": "Identifiable"
  },
  {
-    "Pattern": "(^|[^0-9a-z])(?<refine>oy2[a-p][0-9a-z]{15}[aq][0-9a-z]{11}[eu][bdfhjlnprtvxz357][a-p][0-9a-z]{11}[aeimquy4])([^aeimquy4]|$)",
+    "Pattern": "(^|[^a-zA-Z0-9])(?<refine>oy2[a-z2-7]{43})([^a-zA-Z0-9]|$)",


Pattern

FYI, historically our team has maintained the GitHub patterns and we could contribute this update there if you don't see any action on their side. I will go see if my GitHub contributor perms are still functional, it's been a while since I made a change there.

That's great to know. I had it as an action item on my side to request update on our GitHub pattern after this PR is merged. I have gotten traction with their support email in the past so I anticipate no problem. I think we need to maintain that relationship anyway since there is a live endpoint we have to maintain and keep GitHub in sync with (e.g. correct URL if infra changes).

Ah, so GitHub forwards your exposures directly to you, I see that in source. Do you auto-revoke on receipt of findings? Do you have any data on scale of reporting to you?

The scale is relatively low. We have on the order of 100 notifications per week. I don't have data on the FP rate.

# Conflicts: # GeneratedRegexPatterns/HighConfidenceSecurityModels.json # GeneratedRegexPatterns/PreciselyClassifiedSecurityKeys.json

src/Microsoft.Security.Utilities.Core/PreciselyClassifiedSecurityKeys/SEC101_031_NuGetApiKey.cs

…keys which are non-standard but accepted

michaelcfanning

joelverhagen added 2 commits October 23, 2024 12:54

Fix NuGetApiKey detection to not require Guid bytes

9ba6a47

This also simplifies the regex to just expect oy2 and then 43 lowercase base32 characters.

Improve comments

7bcf887

michaelcfanning reviewed Dec 12, 2024

View reviewed changes

src/Tests.Microsoft.Security.Utilities.Core/SecretMaskerTests.cs Outdated Show resolved Hide resolved

michaelcfanning reviewed Dec 12, 2024

View reviewed changes

GeneratedRegexPatterns/PreciselyClassifiedSecurityKeys.json Outdated Show resolved Hide resolved

michaelcfanning reviewed Dec 12, 2024

View reviewed changes

joelverhagen added 3 commits December 12, 2024 11:41

Merge remote-tracking branch 'origin/main' into users/jver/nuget

7e9d2e8

# Conflicts: # GeneratedRegexPatterns/HighConfidenceSecurityModels.json # GeneratedRegexPatterns/PreciselyClassifiedSecurityKeys.json

Remove tests per feedback

8f549f4

Add more true positive examples

77e5c6e

joelverhagen marked this pull request as ready for review December 12, 2024 17:01

joelverhagen requested review from cfaucon, suvamM, rwoll and nguerrera as code owners December 12, 2024 17:01

michaelcfanning reviewed Dec 12, 2024

View reviewed changes

src/Microsoft.Security.Utilities.Core/PreciselyClassifiedSecurityKeys/SEC101_031_NuGetApiKey.cs Outdated Show resolved Hide resolved

joelverhagen added 2 commits December 13, 2024 13:11

Find NuGet API keys when bounded by non-alphanumeric, find uppercase …

a586a85

…keys which are non-standard but accepted

Address comment

227833a

michaelcfanning approved these changes Dec 13, 2024

View reviewed changes

michaelcfanning merged commit f77c941 into microsoft:main Dec 13, 2024
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix NuGetApiKey detection to not require Guid bytes #101

Fix NuGetApiKey detection to not require Guid bytes #101

joelverhagen commented Oct 23, 2024 •

edited

Loading

michaelcfanning Dec 12, 2024

joelverhagen Dec 12, 2024

michaelcfanning Dec 12, 2024 •

edited

Loading

joelverhagen Dec 12, 2024

michaelcfanning Dec 12, 2024

joelverhagen Dec 12, 2024

michaelcfanning Dec 12, 2024

joelverhagen Dec 12, 2024

michaelcfanning left a comment

Fix NuGetApiKey detection to not require Guid bytes #101

Fix NuGetApiKey detection to not require Guid bytes #101

Conversation

joelverhagen commented Oct 23, 2024 • edited Loading

michaelcfanning Dec 12, 2024

Choose a reason for hiding this comment

joelverhagen Dec 12, 2024

Choose a reason for hiding this comment

michaelcfanning Dec 12, 2024 • edited Loading

Choose a reason for hiding this comment

joelverhagen Dec 12, 2024

Choose a reason for hiding this comment

michaelcfanning Dec 12, 2024

Choose a reason for hiding this comment

joelverhagen Dec 12, 2024

Choose a reason for hiding this comment

michaelcfanning Dec 12, 2024

Choose a reason for hiding this comment

joelverhagen Dec 12, 2024

Choose a reason for hiding this comment

michaelcfanning left a comment

Choose a reason for hiding this comment

joelverhagen commented Oct 23, 2024 •

edited

Loading

michaelcfanning Dec 12, 2024 •

edited

Loading