I've had some success today, actually. I'm working with the SSN concept, and I found that using these expressions instead of the defaults solved my issue.
So, if I were to offer a suggestion, it'd be to work with digits and non-digits exclusively.
I have NOT yet made this change in production, so I don't have data on False Positives yet.
That is the pattern I've found most helpful thus far, my issue is that when you there is nothing else on a particular line, for example:
there are no spaces or characters before or after the number, only a carriage return after. When I try to validate, it comes up as false. I have to put either a space or another non-digit character before and after for it to be recognized as matching their "regex".
Assuming you are using the default concept for "SOCIAL-SECURITY-NUMBER", that expression requires the string begin with a whitespace (\s) and ends with a non-digit character (\D). If you remove those items from the default expression then the pattern will validate without the spaces. Here is the default concept:
I remove the leading \s and trailing \D and now when I enter the pattern into the validate window, the pattern does not require the spaces as shown below:
The consideration for production needs to be what is the likelihood that the number pattern you seek will not have a space or other boundary. For example, if there a product serial number with the string
SESD12345678934333, then the expression modified above will flag the 9 digits inside (123456789) as a match, which is clearly a false positive.
rtrezza- That is the ONLY way I've found for DLP to detect SSNs when they're the only text in a line. Unfortunately, this causes a large number of false positives due to things like you stated, serial numbers, order numbers, foreign phone numbers, etc. I really wish this product was a more viable DLP solution.
Unfortunately you're coming up against the problem of machine learning - how do you tell that 1234567890 is a social security number, a telephone number, or a part number? Even you don't know if this is my social security number or not.
If I wrote it as 123-45-6789 you (and DLP) might make an inference that it's an SSN, just because of the tradition of putting the "-" in certain places, but what if I wrote it like 9876-54-321 - it's more vague.
You're going to find that DLP as a product category, regardless of which vendor you choose has this same limitation - unless there's a way if definitively describing a concept in a mathematically defined way you're always going to be balancing accuracy vs false positives.
I wish there was a good answer for you, but it's not a problem technology can solve.