[DRAFT] rg-9-26-23-Exact Data Match Data Indexer

The EDM tool creates the irreversible hash fingerprints of your critical data records and uploads them to Umbrella into the template of the configured Exact Data Match Identifier.

Prior to generating the hash fingerprints, the data indexer will validate that the submitted records and their values conform to the defined and supported field types as part of the exact data match template.

Prerequisites
Run the Initial Data Index
Update the Indexed Data Set Periodically

Prerequisites

Full Admin role in Secure Access. See Manage User Roles.
JVM version 17+
The machine where the data indexer is downloaded must be able to connect to the following endpoints:
- POST https://api.umbrella.com/auth/v2/token
- GET https://api.umbrella.com/policies/v2/edm/<edm_template_id>
- POST https://api.umbrella.com/policies/v2/edm/<edm_template_id>/data

Note: <edm_template_id> is the ID of the EDM identifier retrievable from Umbrella UI.

The EDM Data Indexer must be downloaded after the template for the EDM identifier is created. See Steps 1-7 in Create an Exact Data Match Identifier.
The API Key and Secret must be generated for the EDM data indexer. See step 6 in Create an Exact Data Match Identifier.
The indexer supports files with up to 55 million records. The exact records limit is determined by the total number of columns and how many of those are of Alphanumeric type. The indexer will display the exact limit when attempting to load a file that exceeds it. If your dataset is larger than the limit, you need to split the records into multiple files.
The following values are supported to be provided as part of a CSV file:
- record values qualified as Alphanumeric are supported up to words composed of 1 byte and 2 bytes UTF-8 encoded characters, such as European alphabet words, separated by spaces.
- record values must qualify as one of the specific EDM types. For more information see Exact Data Matcher Types

Note: If any of the values provided as part of the source file to the data indexer fails to be validated as per the supported format, then the data indexer will skip that record and proceed with the indexing of the remaining records. Similar for any records that may exceed the template defined fields and for empty rows or records with empty primary values. The position of the skipped records in the file will be provided as part of the output of the data indexer.

Run the Initial Data Index

When you create a new EDM identifier, you need to run the EDM Data Indexer for the first time to upload the first set of data records. For the full procedure on creating an EDM identifier, see Create an Exact Data Match Identifier.

Run the indexer in a terminal window with the following command: java -jar edm-lander.jar -i <source_file.csv> -e <edm_template_id> -k <authKey> -s <authSecret> where:

<source_file.csv>—the relative path to the csv spreadsheet with the actual data records
<edm_template_id>—the ID of the EDM identifier retrievable from Umbrella UI.
—the API Key generated at Step 6d in Create an Exact Data Match Identifier.
—the API Secret generated at Step 6d in Create an Exact Data Match Identifier.

The exact data matcher now has a status of Data Indexed.
Note: When the EDM has a status of Data Indexed, you can add the EDM to a data classification but you can not edit the field types, primary field selection, or matching condition.

Update the Indexed Data Set Periodically

When your source file CSV is updated with new records, the existing EDM data indexer on your configured policy must be updated to reflect the new data fingerprints. This procedure allows you to rerun the indexer periodically to update your source data to Secure Access without performing the initial procedure over again. After you rerun the data indexer with the updated version of the source file against the EDM ID of your EDM Data Identifier, the DLP Policy configured with the this EDM Data Identifier accounts for the most recent updates to your critical records.

In a terminal window, set the the API Key and Secret generated in Step 6d of the Create an Exact Data Match Identifier procedure as values to the environment variables EDM_AUTH_KEY and EDM_AUTH_SECRET.
Run the following command as part of a periodically executed script or as needed:
java -jar edm-indexer.jar -i source_file.csv -e template-id

Note: If the data indexer fails to process the input file and return a base64 encoded error code, provide that code to the Umbrella Support to assist you with troubleshooting.

Table of Contents

Prerequisites

Run the Initial Data Index

Update the Indexed Data Set Periodically