I have defined the majority of the regex that I need, but I'm having diffuclty with one. The issue I have is finding an expression that will catch a string of numbers with a certain number of digits without any characters or spaces before or after that string. I am set as long as the string has a space or any non-digit character before and after, but running into trouble if the string is the only thing in a line. Any recommendations are appreciated!
Unfortunately I have not had any success in finding a resolution. It would be GREAT if McAfee adopted standard regex rather than creating their own, which are somewhat ineffective.
I've had some success today, actually. I'm working with the SSN concept, and I found that using these expressions instead of the defaults solved my issue.
So, if I were to offer a suggestion, it'd be to work with digits and non-digits exclusively.
I have NOT yet made this change in production, so I don't have data on False Positives yet.
That is the pattern I've found most helpful thus far, my issue is that when you there is nothing else on a particular line, for example:
there are no spaces or characters before or after the number, only a carriage return after. When I try to validate, it comes up as false. I have to put either a space or another non-digit character before and after for it to be recognized as matching their "regex".
Assuming you are using the default concept for "SOCIAL-SECURITY-NUMBER", that expression requires the string begin with a whitespace (\s) and ends with a non-digit character (\D). If you remove those items from the default expression then the pattern will validate without the spaces. Here is the default concept:
I remove the leading \s and trailing \D and now when I enter the pattern into the validate window, the pattern does not require the spaces as shown below:
The consideration for production needs to be what is the likelihood that the number pattern you seek will not have a space or other boundary. For example, if there a product serial number with the string
SESD12345678934333, then the expression modified above will flag the 9 digits inside (123456789) as a match, which is clearly a false positive.
rtrezza- That is the ONLY way I've found for DLP to detect SSNs when they're the only text in a line. Unfortunately, this causes a large number of false positives due to things like you stated, serial numbers, order numbers, foreign phone numbers, etc. I really wish this product was a more viable DLP solution.
Unfortunately you're coming up against the problem of machine learning - how do you tell that 1234567890 is a social security number, a telephone number, or a part number? Even you don't know if this is my social security number or not.
If I wrote it as 123-45-6789 you (and DLP) might make an inference that it's an SSN, just because of the tradition of putting the "-" in certain places, but what if I wrote it like 9876-54-321 - it's more vague.
You're going to find that DLP as a product category, regardless of which vendor you choose has this same limitation - unless there's a way if definitively describing a concept in a mathematically defined way you're always going to be balancing accuracy vs false positives.
I wish there was a good answer for you, but it's not a problem technology can solve.
There are regex patterns that can be fine tuned to target SSNs. There are certain numbers that SSNs do not start with, there are certain strings of numbers that are not used by the Social Security Administration. Regex patterns that consider these do exist, just not in this product.