To create a custom data identifier programmatically, use the CreateCustomDataIdentifier operation of the Amazon Macie API. Or,
if you're using the AWS Command Line Interface (AWS CLI), run the create-custom-data-identifier command.
Before you create a custom data identifier, we strongly recommend that you test and
refine its detection criteria with sample data. Because custom data
identifiers are used by sensitive data discovery jobs, you can't change
a custom data identifier after you create it. This helps ensure that you have an immutable history of sensitive data findings and discovery results.
To test the criteria programmatically, you can use the TestCustomDataIdentifier operation of the Amazon Macie API.
This operation provides an environment for evaluating sample data with
detection criteria. If you're using the AWS CLI, you can run the test-custom-data-identifier command to test the
criteria.
When you're ready to create the custom data identifier, use the following
parameters to define its detection criteria:
-
regex
– Specify the regular expression
(regex) that defines the text
pattern to match. The regex can contain as many as 512
characters.
Macie supports a subset of the pattern syntax provided by the
Perl Compatible Regular
Expressions (PCRE) library. For additional details and
tips, see Detection criteria
for custom data identifiers.
-
keywords
– Optionally specify 1–50
character sequences (keywords)
that must be in proximity of text that matches the regex
pattern.
Macie includes an occurrence in results only if the text matches
the regex pattern and the text is within the maximum match distance
of one of these keywords. Each keyword can contain 3–90 UTF-8
characters. Keywords aren't case sensitive.
-
maximumMatchDistance
– Optionally specify the
maximum number of characters that can exist between the end of a
keyword and the end of text that matches the regex pattern. If
you're using the AWS CLI, use the maximum-match-distance
parameter to specify this value.
Macie includes an occurrence in results only if the text matches
the regex pattern and the text is within this distance of a complete
keyword. The distance can be 1–300 characters. The default
distance is 50 characters.
-
ignoreWords
– Optionally specify 1–10
character sequences (ignore
words) to exclude from results. If you're using the
AWS CLI, use the ignore-words
parameter to specify these
character sequences.
Macie excludes an occurrence from results if the text matches the
regex pattern but it contains one of these ignore words. Each ignore
word can contain 4–90 UTF-8 characters. Ignore words are case
sensitive.
To specify the severity of sensitive data findings that the custom data
identifier produces, use the severityLevels
parameter or, if
you're using the AWS CLI, the severity-levels
parameter:
-
To automatically assign the MEDIUM
severity to all
the findings, omit this parameter. Macie then uses the default
setting. By default, Macie assigns the MEDIUM
severity
to a finding if the affected S3 object contains one or more
occurrences of text that match the detection criteria.
-
To assign severity based on occurrences thresholds that you
specify, specify the minimum number of matches that must exist in an
S3 object to produce a finding with a specified severity.
You can specify as many as three occurrences thresholds, one for
each severity level that Macie supports: LOW
(least
severe), MEDIUM
, or HIGH
(most severe). If
you specify more than one, the thresholds must be in ascending order
by severity, moving from LOW
to HIGH
. If
an S3 object contains fewer occurrences than the lowest threshold,
Macie doesn't create a finding.
Use additional parameters to specify a name and other settings, such as
tags, for the custom data identifier. Avoid including sensitive data in
these settings. Other users of your account might be able to access these
values, depending on the actions that they're allowed to perform in
Macie.
When you submit your request, Macie tests the settings and verifies that
it can compile the regex. If there's an issue with a setting or the regex,
the request fails and Macie returns a message that describes the issue. If
the request succeeds, you receive output similar to the following:
{
"customDataIdentifierId": "393950aa-82ea-4bdc-8f7b-e5be3example"
}
Where customDataIdentifierId
specifies the unique identifier
(ID) for the custom data identifier that was created.
To subsequently retrieve and review the settings for the custom data
identifier, use the GetCustomDataIdentifier operation or, if you’re using the
AWS CLI, run the get-custom-data-identifier command. For the id
parameter, specify the custom data identifier's ID.
The following examples show how to use the AWS CLI to create a custom data
identifier. The examples create a custom data identifier that's designed to
detect employee IDs that use a specific syntax and are within proximity of a
specified keyword. The examples also define custom severity settings for
findings that the identifier produces.
This example is formatted for Linux, macOS, or Unix, and it uses the backslash (\) line-continuation character to improve readability.
$
aws macie2 create-custom-data-identifier \
--name "EmployeeIDs
" \
--regex "[A-Z]-\d{8}
" \
--keywords '["employee","employee ID"
]' \
--maximum-match-distance 20
\
--severity-levels '[{"occurrencesThreshold":1
,"severity":"LOW
"},{"occurrencesThreshold":50
,"severity":"MEDIUM
"},{"occurrencesThreshold":100
,"severity":"HIGH
"}]' \
--description "Detects employee IDs in proximity of a keyword.
" \
--tags '{"Stack
":"Production
"}'
This example is formatted for Microsoft Windows and it uses the caret (^) line-continuation character to improve readability.
C:\>
aws macie2 create-custom-data-identifier ^
--name "EmployeeIDs
" ^
--regex "[A-Z]-\d{8}
" ^
--keywords "[\"employee
\",\"employee ID
\"]" ^
--maximum-match-distance 20
^
--severity-levels "[{\"occurrencesThreshold\":1
,\"severity\":\"LOW
\"},{\"occurrencesThreshold\":50
,\"severity\":\"MEDIUM
\"},{\"occurrencesThreshold\":100
,\"severity\":\"HIGH
\"}]" ^
--description "Detects employee IDs in proximity of a keyword.
" ^
--tags={\"Stack
\":\"Production
\"}
Where:
-
EmployeeIDs
is the name of
the custom data identifier.
-
[A-Z]-\d{8}
is the regex
for the text pattern to match.
-
employee
and
employee ID
are
keywords that must be in proximity of text that matches the regex
pattern.
-
20
is the maximum number
of characters that can exist between the end of a keyword and the
end of text that matches the regex pattern.
-
description
specifies a brief description of the
custom data identifier.
-
severity-levels
defines custom occurrences thresholds
for the severity of findings that the custom data identifier
produces: LOW
for
1–49 occurrences;
MEDIUM
for 50–99
occurrences; and, HIGH
for
100 or more occurrences.
-
Stack
is the tag key of
the tag to assign to the custom data identifier.
Production
is the
tag value for the specified tag key.