Thursday, January 24, 2013

PII De-Identification using 11g Data Pump Export

Sometime back, I was asked to de-identify sensitive production data for our testing lab. Our lab personnel are not allowed to test new code or use cases with Personally Identifiable Information (PII) but desperately needed to debug data integrity and performance issues with a representative volume of data from production.

Requirements
R1: The Data Masking option for Enterprise Manager cannot be used (no license).
R2: New data or index segments cannot be created to facilitate the operation.
R3: The process must be easily repeatable and scalable.
R4: For the development and quality assurance teams to effectively work  from the lab domain with specific problematic data found in production, the primary and foreign keys (all surrogate values - thankfully they are never updated or reused) must be preserved in the de-identified data set so that authorized personnel can easily identify individuals with masked PII.

My example PII tables are as follows:
Table: PERSON
 Columns: PERSONID (primary key), FIRSTNAME, LASTNAME, BIRTHDATE, GENDER

Table: ADDRESS
 Columns: ADDRESSID (primary key), PERSONID (foreign key), ADDRESSTYPE, STREET, CITY, STATE, ZIP

Analysis
Given the requirements, I researched my available options and took the opportunity to explore what the new REMAP_DATA parameter in 11g Data Pump export might be able to do for us. The REMAP_DATA parameter allows for the transformation or masking of columnar data through custom PL/SQL functions which I will further refer to here as "remap functions". Each REMAP_DATA parameter must map a table column to a remap function, and each remap function must specify a single input parameter declared with the same data type as the column. During execution for each row exported, the REMAP_DATA parameter passes the column value to the input parameter of the remap function where the data transformation logic can mask it. For example, a remap function might take an SSN and strip all but the last four digits; the stripped value would be returned and substituted in the exporting row. Furthermore, the R1 requirement can be met since no extra license is required to use the REMAP_DATA parameter with Data Pump which, in turn, is covered under the Oracle Database license.

To meet the second requirement, I used PL/SQL arrays of fake name and address data to feed my remap functions. I used the plural "functions" since I have multiple PII elements in as many fields (i.e. first name, last name, street address, etc.) that require masking. However, the same function can be used to mask different fields (i.e., middle name) or just return a single arbitrary value for one type of PII data. Also, I only want to mask the identity data not the demographic information. For instance, I want to preserve gender and birth dates but mask personal identifiers and contact information; our business analysts certainly appreciate having their production metrics maintained in the final product instead of looking at skewed reports from scrambled age data!

The R3 requirement can be easily summed up in that Data Pump export is a server-side utility that can leverage more powerful system resources that come with server-class hardware. So, given the same associated scripts and parameter files for this task, any other DBA who has permissions to export the same data should generally observe similar performance results. Also, the custom PL/SQL functions that are compiled in the database (further supporting the prior postulate) and required for data masking through the REMAP_DATA parameter are relatively straightforward for this task of just substituting PII for random name and address data.

The fourth and final requirement - key preservation - allows us to accurately debug specific data issues found in production without unnecessarily exposing a person's real identity to the entire lab. For example, someone can be Joe Smith in production but masked as Jack Summers in the lab while still having PERSONID 123 in both places.  By having a secure method to match PII with their masked counterparts through their common unique key values, specific data issues found in production can be targeted for resolution in the lab.

Implementation
Besides the remap functions, the package - we'll call it pkgDeIdentification - will have declared a "sample data" array for each type of PII to hold the first names, last names, or street names containing the contrived seed data from such external sources as fictional characters, plants/animals/minerals, state names and their capitals, etc. Aggregate data from online resources such as the Census Bureau can also be used to populate the sample data arrays. Each item in these sample data arrays is hard-coded at a particular position upon package instantiation. For each remap function, another data array, further referred to here as a "linked queue", will be declared and also populated upon package instantiation by a procedure declared in the package body that selects, at random, items from the sample data array associated with that particular element of PII.

Now, for those of you familiar with the use of the REMAP_DATA parameter, you might be asking why I'm using these "linked queues" to return the masked data through the remap functions instead of just masking the value passed through the input parameter for each exported row. Well, I've got three reasons all of which have to do with data validity and the fact that nothing of the exporting record is known to the remap function besides the column value passed to it:
1) I don't want masked male first names to end up in female PERSON records or vice versa. Without knowing the gender of the exporting PERSON record, I can't determine whether a first name of John or Jennifer is appropriate.
2) Familial coherence and an accurate distribution of distinct last names in your de-identified data set will save you the headache of explaining to your business analysts why you were so lazy as to give each member of a nuclear family of six a different last name.
3) I want to mask home and mailing addresses but allow the real work addresses in the de-identified data set. 

So, as you'll see in the code below, I never do anything with the inString input parameter specified in the remap functions since the entire de-identified data set has already been conditionally generated and stored in these linked queues; nonetheless, the REMAP_DATA parameter will not work with a remap function that doesn't have an input parameter specified and of the same data type as the column being masked.

For each REMAP_DATA parameter, the remap function will return the first item from its associated linked queue but not before first deleting that item from the array; developers refer to this technique as "popping the stack". Popping the stack - or in ths case, linked queue - ensures that the next value is fetched out of the array and removed with each row exported rather than pulling the same first item over and over again; this is also why, for the data validity reasons noted before, the queues must be ordered appropriately by using some unique key in the looping structures and in concert with corresponding export QUERY parameters for the PII tables. As an alternative to these queues, pipelined table functions and object types may also work here for this effort although I have not tried due to the R2 requirement.

Package Specification and Body
CREATE OR REPLACE PACKAGE pkgDeIdentification
AS
-- header declarations for remap functions
 FUNCTION fncFirstNameMasked(inString VARCHAR2)
  RETURN VARCHAR2;

 FUNCTION fncLastNameMasked(inString VARCHAR2)
  RETURN VARCHAR2;

 FUNCTION fncStreetMasked(inString VARCHAR2)
  RETURN VARCHAR2;
END pkgDeIdentification;
/
CREATE OR REPLACE PACKAGE BODY pkgDeIdentification
AS
 TYPE tFirstNames IS
  TABLE OF VARCHAR2(20)
  INDEX BY PLS_INTEGER;

 TYPE tLastNames IS
  TABLE OF VARCHAR2(26)
  INDEX BY PLS_INTEGER;

 TYPE tAddresses IS
  TABLE OF VARCHAR2(40)
  INDEX BY PLS_INTEGER;

 mSampleMaleFirstNames    tFirstNames;
 mSampleFemaleFirstNames tFirstNames;
 mSampleLastNames            tLastNames;
 mSampleStreetNames         tAddresses;

 mFirstNamesQueue    tFirstNames;
 mLastNamesQueue     tLastNames;
 mAddressesQueue     tAddresses;

 PROCEDURE prcInitializeRandomizer
 AS
  lRandomBytes  RAW(64);
 BEGIN
  lRandomBytes := DBMS_CRYPTO.randomBytes(64);
  DBMS_RANDOM.seed(UTL_RAW.CAST_TO_VARCHAR2(lRandomBytes));
 END prcInitializeRandomizer;

 PROCEDURE prcPopulateSampleArrays
 AS
 BEGIN
 -- populate male first names array (1000)
  mSampleMaleFirstNames(1) := 'James';
  mSampleMaleFirstNames(2) := 'John';
  mSampleMaleFirstNames(3) := 'Robert';
  mSampleMaleFirstNames(4) := 'Michael';
...
...
... 
 -- populate female first names array (1000)
  mSampleFemaleFirstNames(1) := 'Mary';
  mSampleFemaleFirstNames(2) := 'Patricia';
  mSampleFemaleFirstNames(3) := 'Linda';
  mSampleFemaleFirstNames(4) := 'Barbara';
...
...
...
 -- populate last names array (5000)
  mSampleLastNames(1) := 'Smith';
  mSampleLastNames(2) := 'Johnson';
  mSampleLastNames(3) := 'Williams';
  mSampleLastNames(4) := 'Jones';
...
...
...
 -- populate street name array (100)
  mSampleStreetNames(1) := 'Alabama';
  mSampleStreetNames(2) := 'Alaska';
  mSampleStreetNames(3) := 'Arizona';
  mSampleStreetNames(4) := 'Arkansas';
...
...
...
 END prcPopulateSampleArrays;

 PROCEDURE prcPopulateMaskedArrays
 AS
  lRandomNumber       NUMBER;
  lRandomFirstName    VARCHAR2(20);
  lMaskedLastName     VARCHAR2(26);
  lRandomAddress      VARCHAR2(40);
  
  TYPE tMaskedLastNames IS TABLE OF VARCHAR2(26) INDEX BY VARCHAR2(26);
  lMaskedLastNames tMaskedLastNames;
  
  FUNCTION lfRandomNumber(inLowerBound  NUMBER,
                                                 inUpperBound  NUMBER)
   RETURN NUMBER
  IS
  BEGIN
   RETURN TRUNC(DBMS_RANDOM.value(inLowerBound, inUpperBound));
  END lfRandomNumber;
 BEGIN
  -- substitute distinct last names to maintain distribution and familial coherence
  FOR rec IN(SELECT DISTINCT UPPER(TRIM(LASTNAME)) AS LASTNAME FROM PERSON)
  LOOP
   lMaskedLastNames(rec.LASTNAME) := mSampleLastNames(lfRandomNumber(1, mSampleLastNames.COUNT));
  END LOOP;
  
  -- fetch last name and generate first name 
  FOR rec IN(SELECT UPPER(TRIM(LASTNAME)) AS LASTNAME, GENDER
                     FROM PERSON ORDER BY PERSONID)
  LOOP
   -- fetch masked last name indexed by real last name
   lMaskedLastName := lMaskedLastNames(rec.LASTNAME);

   IF (rec.GENDER='MALE') THEN
    lRandomFirstName  := mSampleMaleFirstNames(lfRandomNumber(1, mSampleMaleFirstNames.COUNT));
   ELSE
    lRandomFirstName  := mSampleFemaleFirstNames(lfRandomNumber(1, mSampleFemaleFirstNames.COUNT));
   END IF;
   
   -- add to name arrays
   mFirstNamesQueue(mFirstNamesQueue.COUNT+1)   := lRandomFirstName;
   mLastNamesQueue(mLastNamesQueue.COUNT+1)     := lMaskedLastName;
  END LOOP;

  -- generate random street address
  FOR rec IN(SELECT STREET, ADDRESSTYPE FROM ADDRESS ORDER BY ADDRESSID)
  LOOP
   IF (rec.ADDRESSTYPE IN('HOME','MAIL')) THEN
    -- generate random house number
    lRandomAddress := lfRandomNumber(10,99999);
    -- fetch random street name
    lRandomAddress := lRandomAddress || ' ' || mSampleStreetNames(lfRandomNumber(1, mSampleStreetNames.COUNT));

    -- select random type of thoroughfare
    CASE lfRandomNumber(1,9)
     WHEN 1 THEN
      lRandomAddress := lRandomAddress || ' Avenue';
     WHEN 2 THEN
      lRandomAddress := lRandomAddress || ' Boulevard';
     WHEN 3 THEN
      lRandomAddress := lRandomAddress || ' Court';
     WHEN 4 THEN
      lRandomAddress := lRandomAddress || ' Drive';
     WHEN 5 THEN
      lRandomAddress := lRandomAddress || ' Lane';
     WHEN 6 THEN
      lRandomAddress := lRandomAddress || ' Parkway';
     WHEN 7 THEN
      lRandomAddress := lRandomAddress || ' Road';
     WHEN 8 THEN
      lRandomAddress := lRandomAddress || ' Street';
     ELSE
      lRandomAddress := lRandomAddress || ' Way';
    END CASE;

    mAddressesQueue(mAddressesQueue.COUNT+1) := lRandomAddress;
   ELSIF (rec.ADDRESSTYPE='WORK') THEN
    mAddressesQueue(mAddressesQueue.COUNT+1) := rec.STREET;
   END IF;
  END LOOP;
 END prcPopulateMaskedArrays;

 FUNCTION fncFirstNameMasked(inString VARCHAR2)
  RETURN VARCHAR2
 IS
  lReturnValue VARCHAR2(20);
 BEGIN
  --return the first item in the array and pop it off the queue
  lReturnValue := mFirstNamesQueue(mFirstNamesQueue.FIRST);
  mFirstNamesQueue.DELETE(mFirstNamesQueue.FIRST);
  RETURN lReturnValue;
 END fncFirstNameMasked;

 FUNCTION fncLastNameMasked(inString VARCHAR2)
  RETURN VARCHAR2
 IS
  lReturnValue VARCHAR2(26);
 BEGIN
  --return the first item in the array and pop it off the queue
  lReturnValue := mLastNamesQueue(mLastNamesQueue.FIRST);
  mLastNamesQueue.DELETE(mLastNamesQueue.FIRST);
  RETURN lReturnValue;  
 END fncLastNameMasked;

 FUNCTION fncStreetMasked(inString VARCHAR2)
  RETURN VARCHAR2
 IS
  lReturnValue VARCHAR2(40);
 BEGIN
  --return the first item in the array and pop it off the queue
  lReturnValue := mAddressesQueue(mAddressesQueue.FIRST);
  mAddressesQueue.DELETE(mAddressesQueue.FIRST);
  RETURN lReturnValue;
 END fncStreetMasked;

 BEGIN
  prcInitializeRandomizer;
  prcPopulateSampleArrays;
  prcPopulateMaskedArrays;
 END pkgDeIdentification;
/

Export parameters
remap_data=PERSON.LASTNAME:pkgDeIdentification.fncLastNameMasked
remap_data=PERSON.FIRSTNAME:pkgDeIdentification.fncFirstNameMasked
remap_data=ADDRESS.STREET:pkgDeIdentification.fncStreetMasked
query=PERSON:" ORDER BY PERSONID"
query=ADDRESS:" ORDER BY ADDRESSID"

Results
I have not collected any formal performance metrics on the execution of the procedures and functions in this package together with the REMAP_DATA parameter, however, it takes 15-20 minutes to export about 1,000,000 PERSON rows and about 1,500,000 ADDRESS rows - 20% of which are WORK addresses that are not masked - along with all of the other tables (about 90 other tables having a combined 15,000,000 rows). Please note that I do not use the PARALLEL parameter with these exports (discussed below).

Special Considerations
Another issue arises if the PARALLEL parameter is used in the export invoking multiple Data Pump worker processes when the production database has been configured for dedicated server rather than shared server. With the database in dedicated server mode, each worker process will invoke its own session and therefore its own instantiation of the pkgDeIdentification package. Since each package instantiation randomly selects from the sample data arrays to populate the linked queues used by the remap functions, this situation can result in a skewed distribution of masked last names which may not be desirable if it is important to maintain familial coherence representative of the production data. Also, depending on the estimated size of the de-identified data set and the degree of parallelism, the amount of memory consumed by each worker process for the allocation of the package arrays may deplete the PGA, but using a sizable large pool may help offset this impact.

Conclusion
The REMAP_DATA parameter and a little PL/SQL know-how have been great aids in the task of masking our PII for the lab personnel without having to re-key our tables or skewing our statistical and demographic data. With Data Pump doing all of the work on the server end, this new de-identification process has saved us time and money, and with a little customization, passing these scripts and parameters off to other departments for use in their PII databases would serve the entire organization with greater positive returns on the initial investment.

No comments:

Post a Comment