Thursday, May 30, 2013

Killing Sessions from a Rogue Program with Oracle Resource Manager


I thought this might be an interesting approach to share with those DBAs who have been tasked with preventing certain applications from accessing the databases they maintain.

Here's the scenario: a legacy helper application that should have been uninstalled from - what was then - an indeterminate number of our remote sites started reverting server data to a state that was not compliant with the latest software release of our primary application. Of course, chaos ensued shortly thereafter when some of the compliant sites began automatically syncing those errant data changes to their local caches through a separate background replication process. We were facing rampant data corruption and imminent systemic collapse while the system analysts feverishly tracked down which machines to target for immediate rebuilds. In the meantime, I had to prevent this "rogue program" from wreaking further havoc on our server data and remote sites.

I chose to use the Oracle Resource Manager for a few reasons. First of all, it didn't require any production downtime and only very little scripting to implement the changes immediately. Secondly, the changes required would not affect any compliant sites connecting to the same database accounts whether they were connected now or in the future, but any current or future sessions established by the rogue program through those same accounts would be killed instantly by ORM. Thirdly, I wanted to pursue an Oracle built-in solution rather than a custom solution such as a database logon trigger with application context detection that may have required thorough testing, an obstacle in a critical production situation like this.

The script below is a sanitized version of what was implemented and assumes that Oracle Resource Manager has never been configured on the affected server instance. Given this perceived advantage, I decided to use the built-in Oracle resource plan DEFAULT_PLAN which is already configured for use with the built-in consumer groups DEFAULT_CONSUMER_GROUP, SYS_GROUP, and OTHER_GROUPS. Moreover, the Oracle accounts used by the applications were assigned initially to DEFAULT_CONSUMER_GROUP upon login, so any sessions connecting through those accounts would follow the established plan directives of that consumer group configured under the DEFAULT_PLAN. Any concerns regarding the built-in resource plan and consumer groups should be reviewed prior to implementation.

/* execute all under a dba-enabled account */
BEGIN
 -- clear the plan schema scratchpad
 DBMS_RESOURCE_MANAGER.CLEAR_PENDING_AREA();

 -- initialize the plan schema scratchpad
 DBMS_RESOURCE_MANAGER.CREATE_PENDING_AREA();

 -- create a consumer group to assign unwanted sessions for immediate termination
 DBMS_RESOURCE_MANAGER.CREATE_CONSUMER_GROUP (
  consumer_group => 'KILL_GROUP',
  comment        => 'The group assigned to unwanted sessions for immediate termination.');

 -- validate any pending plans, groups, and directives 
 DBMS_RESOURCE_MANAGER.VALIDATE_PENDING_AREA();

 -- commit any pending plans, groups, and directives
 DBMS_RESOURCE_MANAGER.SUBMIT_PENDING_AREA();
END;
/

BEGIN
 -- clear the plan schema scratchpad
 DBMS_RESOURCE_MANAGER.CLEAR_PENDING_AREA();

 -- initialize the plan schema scratchpad
 DBMS_RESOURCE_MANAGER.CREATE_PENDING_AREA();

 -- add a plan directive to immediately move sessions assigned to the KILL_GROUP
 -- consumer group to the built-in KILL_SESSION consumer group;
 -- note that KILL_SESSION is a special built-in “switch group” instructing
 -- ORM to terminate the session immediately and cannot be assigned consumer
 -- group mappings directly
 DBMS_RESOURCE_MANAGER.CREATE_PLAN_DIRECTIVE (
  plan             => 'DEFAULT_PLAN',
  comment          => 'Switch conditions for KILL_GROUP',
  group_or_subplan => 'KILL_GROUP',
  switch_group     => 'KILL_SESSION',
  max_idle_time    => 0,
  switch_time      => 0); 

 -- automatically assign the Rogue Program to the KILL_GROUP for
 -- immediate termination;
 -- use the value in the PROGRAM field from the V$SESSION view for the second
 -- parameter below, not the actual OS process name 
 DBMS_RESOURCE_MANAGER.SET_CONSUMER_GROUP_MAPPING (
  attribute      => 'CLIENT_PROGRAM',
  value          => 'Rogue Program',
  consumer_group => 'KILL_GROUP');

 -- allow the affected Oracle account “APP” to be assigned to the KILL_GROUP;
 -- if any other accounts are accessed by the Rogue Program, those accounts will
 -- also need to be granted switching to the KILL_GROUP 
 DBMS_RESOURCE_MANAGER_PRIVS.GRANT_SWITCH_CONSUMER_GROUP ( 
  grantee_name   => 'APP', 
  consumer_group => 'KILL_GROUP',
  grant_option   => FALSE); 

 -- validate any pending plans, groups, and directives 
 DBMS_RESOURCE_MANAGER.VALIDATE_PENDING_AREA();

 -- commit any pending plans, groups, and directives
 DBMS_RESOURCE_MANAGER.SUBMIT_PENDING_AREA();
END;
/

-- enable Oracle Resource Manager and use the DEFAULT_PLAN
ALTER SYSTEM SET resource_manager_plan='DEFAULT_PLAN' scope=BOTH;


Tuesday, April 9, 2013

Easy Sequence Value Reset

I've found that there are times when an Oracle sequence has either been advanced just short of or well beyond its intended value that would be more accurately aligned to its associated primary key. In common practice, the former scenario would likely produce an ORA-00001 error, i.e. unique constraint violation, when an attempt to add a new record with a primary key value fetched from a sequence conflicts with a value found in the key's index. However, for the latter situation when the next value of the sequence is well beyond the expected value, it seems like more developers than DBAs that I know have been bothered by inexplicable gaps between the maximum primary key value and a much higher value fetched from its serving sequence. I have found that the concern that these developers have expressed is well founded in certain circumstances. For example, what if that particular sequence was serving primary key values for an access audit table? A table like that with a heavy volume of insert-only activity that doesn't get a lot of attention until something like a major security incident occurs should greatly concern every developer and DBA if they saw large gaps (if any really) between the primary key values of those records. For all they know, a hacker may have covered their tracks by deleting a large number of audit records where those key value gaps exist.

So, how should the next sequence value be adjusted accordingly? The sequence object could be dropped and recreated with an adjusted minimum value, but this option would invalidate any dependent source (triggers, packages, functions, procedures, etc.) and also require any privileges from the old sequence to be applied to the new sequence. All of those steps introduce a degree of unwarranted risk compared to the script that I've developed just for these occasions.

In a nutshell, the script temporarily alters the "maximum value", "cycle", and "increment" properties of the sequence to wrap the next value fetched to the desired value. This script takes three parameters: 1) the desired sequence value, 2) the sequence name, and 3) the sequence owner which defaults to the connected user if omitted or null. Note that the minimum value is never altered and the maximum value allowed by the sequence must be greater than the desired value before execution.

SET SERVEROUTPUT ON
SET VERIFY OFF

DECLARE
 lWorkingValue NUMBER;
 lDesiredValue CONSTANT NUMBER := &1;

 lSequenceName       CONSTANT VARCHAR2(30) := UPPER('&2');
 lSequenceOwner      CONSTANT VARCHAR2(30) := UPPER(NVL('&3', USER));
 lSequenceProperties ALL_SEQUENCES%ROWTYPE;
BEGIN
 -- capture the current sequence properties
 SELECT *
 INTO lSequenceProperties
 FROM ALL_SEQUENCES
 WHERE SEQUENCE_OWNER=lSequenceOwner
   AND SEQUENCE_NAME=lSequenceName;

 -- make sure the the increment is set to 1
 EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' INCREMENT BY 1';

 -- make sure the sequence is allowed to wrap
 EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' CYCLE';

 -- fetch a value from the sequence
 EXECUTE IMMEDIATE 'SELECT '|| lSequenceOwner || '.' || lSequenceName ||'.NEXTVAL FROM DUAL'
 INTO lWorkingValue;

 -- set the max value of the sequence to the last value fetched
 EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' MAXVALUE '|| lWorkingValue;

 -- fetch the next value to wrap the sequence which will reset it to the minimum value
 EXECUTE IMMEDIATE 'SELECT '|| lSequenceOwner || '.' || lSequenceName ||'.NEXTVAL FROM DUAL'
 INTO lWorkingValue;

 -- reset the original max value
 EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' MAXVALUE '|| lSequenceProperties.MAX_VALUE;

 -- if the desired value is in fact the sequence minimum value, 
 -- then the sequence should be at the desired value at this point
 -- since the sequence wrapped (or cycled) with the last fetch;
 -- otherwise, advance the sequence with a single fetch by first changing the 
 -- increment-by value to the difference between the desired value and the
 -- sequence minimum value
 IF (lDesiredValue <> lSequenceProperties.MIN_VALUE) THEN
  EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' INCREMENT BY ' || TO_CHAR(lDesiredValue - lSequenceProperties.MIN_VALUE);

  -- fetch a value from the sequence to set the sequence to the desired value
  EXECUTE IMMEDIATE 'SELECT '|| lSequenceOwner || '.' || lSequenceName ||'.NEXTVAL FROM DUAL'
  INTO lWorkingValue;
 END IF;

 -- reset the original increment value
 EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' INCREMENT BY ' || lSequenceProperties.INCREMENT_BY;

 -- reset the original cycle flag
 IF (lSequenceProperties.CYCLE_FLAG='N') THEN
  EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' NOCYCLE';
 END IF;

 DBMS_OUTPUT.PUT_LINE('Last value fetched from the ' || lSequenceOwner || '.' || lSequenceName || ' sequence was ' || lWorkingValue);
END;
/


This version of the script below has been very useful in the lab environments where Flashback Database is being used to reset test data. Sequences are not affected by flashback operations and need to be reset to the subsequent values of their associated primary keys manually or, as demonstrated here, programmatically. So, three new parameters have been added: 1) the table name, 2) the column name of the primary key, and 3) the table owner which defaults to the current user if omitted or null. Note that this script will derive the desired sequence value from the highest value in the primary key column, so the desired value parameter has been omitted here. I've highlighted the main edit to this script which is just after the BEGIN statement.


SET SERVEROUTPUT ON
SET VERIFY OFF

DECLARE
 lWorkingValue NUMBER;
 lDesiredValue NUMBER;

 lSequenceName       CONSTANT VARCHAR2(30) := '&1';
 lSequenceOwner      CONSTANT VARCHAR2(30) := NVL('&2', USER);
 lSequenceProperties ALL_SEQUENCES%ROWTYPE;

 lTableName          CONSTANT VARCHAR2(30) := '&3';
 lColumnName         CONSTANT VARCHAR2(30) := '&4';
 lTableOwner         CONSTANT VARCHAR2(30) := NVL('&5', USER);
BEGIN
 -- get the desired value of the sequence which should be the highest value found in the primary key column
 EXECUTE IMMEDIATE 'SELECT MAX(' || lColumnName || ') FROM ' || lTableOwner || '.' || lTableName
 INTO lDesiredValue; 

 -- capture the current sequence properties
 SELECT *
 INTO lSequenceProperties
 FROM ALL_SEQUENCES
 WHERE SEQUENCE_OWNER=lSequenceOwner
   AND SEQUENCE_NAME=lSequenceName;

 -- make sure the the increment is set to 1
 EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' INCREMENT BY 1';

 -- make sure the sequence is allowed to wrap
 EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' CYCLE';

 -- fetch a value from the sequence
 EXECUTE IMMEDIATE 'SELECT '|| lSequenceOwner || '.' || lSequenceName ||'.NEXTVAL FROM DUAL'
 INTO lWorkingValue;

 -- set the max value of the sequence to the last value fetched
 EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' MAXVALUE '|| lWorkingValue;

 -- fetch the next value to wrap the sequence which will reset it to the minimum value
 EXECUTE IMMEDIATE 'SELECT '|| lSequenceOwner || '.' || lSequenceName ||'.NEXTVAL FROM DUAL'
 INTO lWorkingValue;

 -- reset the original max value
 EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' MAXVALUE '|| lSequenceProperties.MAX_VALUE;

 -- if the desired value is in fact the sequence minimum value, 
 -- then the sequence should be at the desired value at this point
 -- since the sequence wrapped (or cycled) with the last fetch;
 -- otherwise, advance the sequence with a single fetch by first changing the 
 -- increment-by value to the difference between the desired value and the
 -- sequence minimum value
 IF (lDesiredValue <> lSequenceProperties.MIN_VALUE) THEN
  EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' INCREMENT BY ' || TO_CHAR(lDesiredValue - lSequenceProperties.MIN_VALUE);

  -- fetch a value from the sequence to set the sequence to the desired value
  EXECUTE IMMEDIATE 'SELECT '|| lSequenceOwner || '.' || lSequenceName ||'.NEXTVAL FROM DUAL'
  INTO lWorkingValue;
 END IF;

 -- reset the original increment value
 EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' INCREMENT BY ' || lSequenceProperties.INCREMENT_BY;

 -- reset the original cycle flag
 IF (lSequenceProperties.CYCLE_FLAG='N') THEN
  EXECUTE IMMEDIATE 'ALTER SEQUENCE '|| lSequenceOwner || '.' || lSequenceName ||' NOCYCLE';
 END IF;

 DBMS_OUTPUT.PUT_LINE('Last value fetched from the ' || lSequenceOwner || '.' || lSequenceName || ' sequence was ' || lWorkingValue);
END;
/

Thursday, February 28, 2013

Hexadecimal Encoding an Image into PL/SQL Array

A while ago, I was asked to obfuscate our photo data stored as BLOBs in our database with a single silhouetted image:

I was not allowed to store the silhouette image in a production database table or reference it from the production file system with a BFILE. I also wanted to avoid the extra dependencies and supporting structures required for those options. Basically, the only practical option left was to break down the image into its binary representation and store the hexadecimal value in a string array declared in a PL/SQL function that decodes the hexadecimal array and returns the silhouette image as a BLOB.

Although there exists a number of free online tools to encode the image and determine its hexadecimal equivalent, I decided to use our development database and explore what could be done with the Oracle built-in utilities like the UTL_RAW and DBMS_LOB packages and the BFILENAME, RAWTOHEX, and HEXTORAW functions. Please excuse the lack of proper exception handling around the file I/O operations; this example is just that and not meant for production use.

/* 11g script to encode the image as hexadecimal from a BFILE */
SET SERVEROUTPUT ON SIZE 1000000

DECLARE
 cChunkSize CONSTANT INTEGER := 2000;

 -- assumption here is that the file has already been placed in the file system directory mapped
 -- through the Oracle directory object specified by cDirectory, that the file is named as specified by cFilename, 
 -- and that the necessary file system permissions and Oracle privileges have been granted 
 cDirectory CONSTANT VARCHAR2(15) := 'DATA_PUMP_DIR';
 cFilename  CONSTANT VARCHAR2(40) := 'rodin-thinker-silhouette_21022921.jpg';

 lAmount BINARY_INTEGER; 
 lOffset    INTEGER;
 lBuffer    RAW(2000); -- size should match cChunkSize

 TYPE tRawArray  IS TABLE OF RAW(2000); ---- size should match lBuffer
 lRawArray tRawArray;

 lBFile     BFILE;
 lFileSize  INTEGER;
BEGIN
 -- open the file and get the file size
 lBFile := BFILENAME(directory => cDirectory,filename => cFilename);
 DBMS_LOB.FILEOPEN(file_loc => lBFile);
 lFileSize := NVL(DBMS_LOB.GETLENGTH(file_loc => lBFile),0);

 IF (lFileSize > 0) THEN
  -- initialize collection
  lRawArray := tRawArray();

  --set the offset to the first byte in the file
  lOffset := 1;

  -- read and feed chunks of the file into the raw array
  LOOP
   lAmount := cChunkSize; 

   DBMS_LOB.READ(file_loc => lBFile,
                 amount   => lAmount,
                 offset   => lOffset,
                 buffer   => lBuffer);

   -- add collected buffer data as a new element in the raw array
   lRawArray.EXTEND;
   lRawArray(lRawArray.COUNT) := lBuffer; 

   -- increment the offset by the number of bytes specified by cChunkSize
   lOffset := lOffset + cChunkSize;

   -- exit loop when offset has been set to or beyond the size of the file
   EXIT WHEN lOffset >= lFileSize;
  END LOOP;

  -- each line of output will be a maximum of cChunkSize*2 
  -- since hex values are 2 characters per byte and as such
  -- be aware that cChunkSize should not exceed a size of 16383 since
  -- the maximum line size in the output is 32767 bytes
  FOR i IN 1..lRawArray.COUNT
  LOOP
   DBMS_OUTPUT.PUT_LINE(lRawArray(i));
  END LOOP;
 END IF;

 -- close the file
 DBMS_LOB.FILECLOSE(file_loc => lBFile);
END;
/

I could have used the DBMS_LOB.LOADBLOBFROMFILE procedure to upload the image in one fell swoop, but I still would have had to chunk the binary data into the RAW array from the target BLOB variable using the DBMS_LOB.SUBSTR function. I instead chose to "read and feed" each chunk directly from the file into the array. I also could have used a larger chunk size up to 16383 bytes (and would have if the file were much larger) but wanted to stay within the bounds of SQL size constraints in case I ever want to use these scripts to store chunks of RAW data in a table. 

An output of 12 lines of hexadecimal will be generated, and as I found out, the RAWTOHEX function isn't required to explicitly convert the RAW data into printable hexadecimal strings. The first 11 lines contain 4000 characters each, and the 12th line will be 3870 characters long (truncated here for brevity):

FFD8F.....E0874
DD512.....EC77B
B1639.....08076
FF00B.....6AD73
8ACAF.....7A501
D9410.....F00C1
5FAC9.....8F251
9D2CE.....6FC98
1FE95.....2C4A9
29A0F.....ACD22
C7BE4.....61821
02248.....FFFD9

For the decoding function that reconstructs the silhouette image from these chunks, the hexadecimal output will be copied and pasted into 12 elements of a VARCHAR2 array with one line per element in proper order:

/* 11g function to decode the hexadecimal chunks and return a BLOB of the silhouette image */
CREATE OR REPLACE FUNCTION fncSilhouetteImage
 RETURN BLOB 
IS
 lRetVal BLOB;

 -- size should match cChunkSize from hex encoding script
 lRawChunk RAW(2000);

 -- size should match max line size of hex output
 TYPE tHexArray  IS TABLE OF VARCHAR2(4000); 
 lHexArray tHexArray := tHexArray(); -- initialized
BEGIN
 -- file is just under 24kb in size, so I only need 12 elements in the hex array
 lHexArray.EXTEND(12);

 -- the hex output from the encoded raw array
  lHexArray(1) := 'FFD8F.....E0874';
 lHexArray(2) := 'DD512.....EC77B';
 lHexArray(3) := 'B1639.....08076';
 lHexArray(4) := 'FF00B.....6AD73';
 lHexArray(5) := '8ACAF.....7A501';
 lHexArray(6) := 'D9410.....F00C1';
 lHexArray(7) := '5FAC9.....8F251';
 lHexArray(8) := '9D2CE.....6FC98';
 lHexArray(9) := '1FE95.....2C4A9';
 lHexArray(10) := '29A0F.....ACD22';
 lHexArray(11) := 'C7BE4.....61821';
 lHexArray(12) := '02248.....FFFD9';

 -- initialize lob
 DBMS_LOB.CREATETEMPORARY(lob_loc => lRetVal,
                          cache => FALSE,
                          dur => DBMS_LOB.CALL);

 -- decode the byte array and reconstruct its original LOB form
 FOR i IN 1..lHexArray.COUNT
 LOOP
  DBMS_LOB.WRITEAPPEND(lob_loc => lRetVal,
                       amount => UTL_RAW.LENGTH(HEXTORAW(lHexArray(i))),
                       buffer => lHexArray(i));
 END LOOP;

 -- OR try the UTL_RAW.CONCAT function for kicks
 lRetVal := UTL_RAW.CONCAT(r1 => lHexArray(1),
                                  r2 => lHexArray(2),
                                  r3 => lHexArray(3),
                                  r4 => lHexArray(4),
                                  r5 => lHexArray(5),
                                  r6 => lHexArray(6),
                                  r7 => lHexArray(7),
                                  r8 => lHexArray(8),
                                  r9 => lHexArray(9),
                                  r10 => lHexArray(10),
                                  r11 => lHexArray(11),
                                  r12 => lHexArray(12));

 RETURN lRetVal;
END;
/

In the function, I demonstrate two methods for decoding the hexadecimal data into the original image. Both methods take advantage of the implicit conversion between the RAW data type and hexadecimal strings, so explicit conversions using the HEXTORAW function were removed; some purists out there may want to chastise me for that, so be it.

So, the first method loops over the array appending each element to a BLOB variable using the DBMS_LOB.WRITEAPPEND procedure. In the call to that procedure, it should be noted that the standard LENGTH function instead of the UTL_RAW.LENGTH function could have been used on the hex array element in the amount argument; however, a division of the standard LENGTH result by 2 would be required since the input parameter of the standard function is a VARCHAR2 whereas the input parameter of the UTL_RAW function is a RAW. Both data types are implicitly converted by the two functions, but only the decoded length of the RAW value will work for the DBMS_LOB.WRITEAPPEND procedure for BLOBs (pat yourselves on the back, purists).

In the second method, I use the UTL_RAW.CONCAT function returning a RAW value which is the concatenation of the RAW input arguments provided, the hex array elements in this case. Again for the purists out there, I originally had passed the RAW concatenation to the TO_BLOB function for explicit conversion, but the implicit conversion between RAW and BLOB data types was just too convenient for this example. Coincidentally, the UTL_RAW.CONCAT function has only twelve parameters, so its practical use may be limited by the size of your array. 

Thursday, January 24, 2013

PII De-Identification using 11g Data Pump Export

Sometime back, I was asked to de-identify sensitive production data for our testing lab. Our lab personnel are not allowed to test new code or use cases with Personally Identifiable Information (PII) but desperately needed to debug data integrity and performance issues with a representative volume of data from production.

Requirements
R1: The Data Masking option for Enterprise Manager cannot be used (no license).
R2: New data or index segments cannot be created to facilitate the operation.
R3: The process must be easily repeatable and scalable.
R4: For the development and quality assurance teams to effectively work  from the lab domain with specific problematic data found in production, the primary and foreign keys (all surrogate values - thankfully they are never updated or reused) must be preserved in the de-identified data set so that authorized personnel can easily identify individuals with masked PII.

My example PII tables are as follows:
Table: PERSON
 Columns: PERSONID (primary key), FIRSTNAME, LASTNAME, BIRTHDATE, GENDER

Table: ADDRESS
 Columns: ADDRESSID (primary key), PERSONID (foreign key), ADDRESSTYPE, STREET, CITY, STATE, ZIP

Analysis
Given the requirements, I researched my available options and took the opportunity to explore what the new REMAP_DATA parameter in 11g Data Pump export might be able to do for us. The REMAP_DATA parameter allows for the transformation or masking of columnar data through custom PL/SQL functions which I will further refer to here as "remap functions". Each REMAP_DATA parameter must map a table column to a remap function, and each remap function must specify a single input parameter declared with the same data type as the column. During execution for each row exported, the REMAP_DATA parameter passes the column value to the input parameter of the remap function where the data transformation logic can mask it. For example, a remap function might take an SSN and strip all but the last four digits; the stripped value would be returned and substituted in the exporting row. Furthermore, the R1 requirement can be met since no extra license is required to use the REMAP_DATA parameter with Data Pump which, in turn, is covered under the Oracle Database license.

To meet the second requirement, I used PL/SQL arrays of fake name and address data to feed my remap functions. I used the plural "functions" since I have multiple PII elements in as many fields (i.e. first name, last name, street address, etc.) that require masking. However, the same function can be used to mask different fields (i.e., middle name) or just return a single arbitrary value for one type of PII data. Also, I only want to mask the identity data not the demographic information. For instance, I want to preserve gender and birth dates but mask personal identifiers and contact information; our business analysts certainly appreciate having their production metrics maintained in the final product instead of looking at skewed reports from scrambled age data!

The R3 requirement can be easily summed up in that Data Pump export is a server-side utility that can leverage more powerful system resources that come with server-class hardware. So, given the same associated scripts and parameter files for this task, any other DBA who has permissions to export the same data should generally observe similar performance results. Also, the custom PL/SQL functions that are compiled in the database (further supporting the prior postulate) and required for data masking through the REMAP_DATA parameter are relatively straightforward for this task of just substituting PII for random name and address data.

The fourth and final requirement - key preservation - allows us to accurately debug specific data issues found in production without unnecessarily exposing a person's real identity to the entire lab. For example, someone can be Joe Smith in production but masked as Jack Summers in the lab while still having PERSONID 123 in both places.  By having a secure method to match PII with their masked counterparts through their common unique key values, specific data issues found in production can be targeted for resolution in the lab.

Implementation
Besides the remap functions, the package - we'll call it pkgDeIdentification - will have declared a "sample data" array for each type of PII to hold the first names, last names, or street names containing the contrived seed data from such external sources as fictional characters, plants/animals/minerals, state names and their capitals, etc. Aggregate data from online resources such as the Census Bureau can also be used to populate the sample data arrays. Each item in these sample data arrays is hard-coded at a particular position upon package instantiation. For each remap function, another data array, further referred to here as a "linked queue", will be declared and also populated upon package instantiation by a procedure declared in the package body that selects, at random, items from the sample data array associated with that particular element of PII.

Now, for those of you familiar with the use of the REMAP_DATA parameter, you might be asking why I'm using these "linked queues" to return the masked data through the remap functions instead of just masking the value passed through the input parameter for each exported row. Well, I've got three reasons all of which have to do with data validity and the fact that nothing of the exporting record is known to the remap function besides the column value passed to it:
1) I don't want masked male first names to end up in female PERSON records or vice versa. Without knowing the gender of the exporting PERSON record, I can't determine whether a first name of John or Jennifer is appropriate.
2) Familial coherence and an accurate distribution of distinct last names in your de-identified data set will save you the headache of explaining to your business analysts why you were so lazy as to give each member of a nuclear family of six a different last name.
3) I want to mask home and mailing addresses but allow the real work addresses in the de-identified data set. 

So, as you'll see in the code below, I never do anything with the inString input parameter specified in the remap functions since the entire de-identified data set has already been conditionally generated and stored in these linked queues; nonetheless, the REMAP_DATA parameter will not work with a remap function that doesn't have an input parameter specified and of the same data type as the column being masked.

For each REMAP_DATA parameter, the remap function will return the first item from its associated linked queue but not before first deleting that item from the array; developers refer to this technique as "popping the stack". Popping the stack - or in ths case, linked queue - ensures that the next value is fetched out of the array and removed with each row exported rather than pulling the same first item over and over again; this is also why, for the data validity reasons noted before, the queues must be ordered appropriately by using some unique key in the looping structures and in concert with corresponding export QUERY parameters for the PII tables. As an alternative to these queues, pipelined table functions and object types may also work here for this effort although I have not tried due to the R2 requirement.

Package Specification and Body
CREATE OR REPLACE PACKAGE pkgDeIdentification
AS
-- header declarations for remap functions
 FUNCTION fncFirstNameMasked(inString VARCHAR2)
  RETURN VARCHAR2;

 FUNCTION fncLastNameMasked(inString VARCHAR2)
  RETURN VARCHAR2;

 FUNCTION fncStreetMasked(inString VARCHAR2)
  RETURN VARCHAR2;
END pkgDeIdentification;
/
CREATE OR REPLACE PACKAGE BODY pkgDeIdentification
AS
 TYPE tFirstNames IS
  TABLE OF VARCHAR2(20)
  INDEX BY PLS_INTEGER;

 TYPE tLastNames IS
  TABLE OF VARCHAR2(26)
  INDEX BY PLS_INTEGER;

 TYPE tAddresses IS
  TABLE OF VARCHAR2(40)
  INDEX BY PLS_INTEGER;

 mSampleMaleFirstNames    tFirstNames;
 mSampleFemaleFirstNames tFirstNames;
 mSampleLastNames            tLastNames;
 mSampleStreetNames         tAddresses;

 mFirstNamesQueue    tFirstNames;
 mLastNamesQueue     tLastNames;
 mAddressesQueue     tAddresses;

 PROCEDURE prcInitializeRandomizer
 AS
  lRandomBytes  RAW(64);
 BEGIN
  lRandomBytes := DBMS_CRYPTO.randomBytes(64);
  DBMS_RANDOM.seed(UTL_RAW.CAST_TO_VARCHAR2(lRandomBytes));
 END prcInitializeRandomizer;

 PROCEDURE prcPopulateSampleArrays
 AS
 BEGIN
 -- populate male first names array (1000)
  mSampleMaleFirstNames(1) := 'James';
  mSampleMaleFirstNames(2) := 'John';
  mSampleMaleFirstNames(3) := 'Robert';
  mSampleMaleFirstNames(4) := 'Michael';
...
...
... 
 -- populate female first names array (1000)
  mSampleFemaleFirstNames(1) := 'Mary';
  mSampleFemaleFirstNames(2) := 'Patricia';
  mSampleFemaleFirstNames(3) := 'Linda';
  mSampleFemaleFirstNames(4) := 'Barbara';
...
...
...
 -- populate last names array (5000)
  mSampleLastNames(1) := 'Smith';
  mSampleLastNames(2) := 'Johnson';
  mSampleLastNames(3) := 'Williams';
  mSampleLastNames(4) := 'Jones';
...
...
...
 -- populate street name array (100)
  mSampleStreetNames(1) := 'Alabama';
  mSampleStreetNames(2) := 'Alaska';
  mSampleStreetNames(3) := 'Arizona';
  mSampleStreetNames(4) := 'Arkansas';
...
...
...
 END prcPopulateSampleArrays;

 PROCEDURE prcPopulateMaskedArrays
 AS
  lRandomNumber       NUMBER;
  lRandomFirstName    VARCHAR2(20);
  lMaskedLastName     VARCHAR2(26);
  lRandomAddress      VARCHAR2(40);
  
  TYPE tMaskedLastNames IS TABLE OF VARCHAR2(26) INDEX BY VARCHAR2(26);
  lMaskedLastNames tMaskedLastNames;
  
  FUNCTION lfRandomNumber(inLowerBound  NUMBER,
                                                 inUpperBound  NUMBER)
   RETURN NUMBER
  IS
  BEGIN
   RETURN TRUNC(DBMS_RANDOM.value(inLowerBound, inUpperBound));
  END lfRandomNumber;
 BEGIN
  -- substitute distinct last names to maintain distribution and familial coherence
  FOR rec IN(SELECT DISTINCT UPPER(TRIM(LASTNAME)) AS LASTNAME FROM PERSON)
  LOOP
   lMaskedLastNames(rec.LASTNAME) := mSampleLastNames(lfRandomNumber(1, mSampleLastNames.COUNT));
  END LOOP;
  
  -- fetch last name and generate first name 
  FOR rec IN(SELECT UPPER(TRIM(LASTNAME)) AS LASTNAME, GENDER
                     FROM PERSON ORDER BY PERSONID)
  LOOP
   -- fetch masked last name indexed by real last name
   lMaskedLastName := lMaskedLastNames(rec.LASTNAME);

   IF (rec.GENDER='MALE') THEN
    lRandomFirstName  := mSampleMaleFirstNames(lfRandomNumber(1, mSampleMaleFirstNames.COUNT));
   ELSE
    lRandomFirstName  := mSampleFemaleFirstNames(lfRandomNumber(1, mSampleFemaleFirstNames.COUNT));
   END IF;
   
   -- add to name arrays
   mFirstNamesQueue(mFirstNamesQueue.COUNT+1)   := lRandomFirstName;
   mLastNamesQueue(mLastNamesQueue.COUNT+1)     := lMaskedLastName;
  END LOOP;

  -- generate random street address
  FOR rec IN(SELECT STREET, ADDRESSTYPE FROM ADDRESS ORDER BY ADDRESSID)
  LOOP
   IF (rec.ADDRESSTYPE IN('HOME','MAIL')) THEN
    -- generate random house number
    lRandomAddress := lfRandomNumber(10,99999);
    -- fetch random street name
    lRandomAddress := lRandomAddress || ' ' || mSampleStreetNames(lfRandomNumber(1, mSampleStreetNames.COUNT));

    -- select random type of thoroughfare
    CASE lfRandomNumber(1,9)
     WHEN 1 THEN
      lRandomAddress := lRandomAddress || ' Avenue';
     WHEN 2 THEN
      lRandomAddress := lRandomAddress || ' Boulevard';
     WHEN 3 THEN
      lRandomAddress := lRandomAddress || ' Court';
     WHEN 4 THEN
      lRandomAddress := lRandomAddress || ' Drive';
     WHEN 5 THEN
      lRandomAddress := lRandomAddress || ' Lane';
     WHEN 6 THEN
      lRandomAddress := lRandomAddress || ' Parkway';
     WHEN 7 THEN
      lRandomAddress := lRandomAddress || ' Road';
     WHEN 8 THEN
      lRandomAddress := lRandomAddress || ' Street';
     ELSE
      lRandomAddress := lRandomAddress || ' Way';
    END CASE;

    mAddressesQueue(mAddressesQueue.COUNT+1) := lRandomAddress;
   ELSIF (rec.ADDRESSTYPE='WORK') THEN
    mAddressesQueue(mAddressesQueue.COUNT+1) := rec.STREET;
   END IF;
  END LOOP;
 END prcPopulateMaskedArrays;

 FUNCTION fncFirstNameMasked(inString VARCHAR2)
  RETURN VARCHAR2
 IS
  lReturnValue VARCHAR2(20);
 BEGIN
  --return the first item in the array and pop it off the queue
  lReturnValue := mFirstNamesQueue(mFirstNamesQueue.FIRST);
  mFirstNamesQueue.DELETE(mFirstNamesQueue.FIRST);
  RETURN lReturnValue;
 END fncFirstNameMasked;

 FUNCTION fncLastNameMasked(inString VARCHAR2)
  RETURN VARCHAR2
 IS
  lReturnValue VARCHAR2(26);
 BEGIN
  --return the first item in the array and pop it off the queue
  lReturnValue := mLastNamesQueue(mLastNamesQueue.FIRST);
  mLastNamesQueue.DELETE(mLastNamesQueue.FIRST);
  RETURN lReturnValue;  
 END fncLastNameMasked;

 FUNCTION fncStreetMasked(inString VARCHAR2)
  RETURN VARCHAR2
 IS
  lReturnValue VARCHAR2(40);
 BEGIN
  --return the first item in the array and pop it off the queue
  lReturnValue := mAddressesQueue(mAddressesQueue.FIRST);
  mAddressesQueue.DELETE(mAddressesQueue.FIRST);
  RETURN lReturnValue;
 END fncStreetMasked;

 BEGIN
  prcInitializeRandomizer;
  prcPopulateSampleArrays;
  prcPopulateMaskedArrays;
 END pkgDeIdentification;
/

Export parameters
remap_data=PERSON.LASTNAME:pkgDeIdentification.fncLastNameMasked
remap_data=PERSON.FIRSTNAME:pkgDeIdentification.fncFirstNameMasked
remap_data=ADDRESS.STREET:pkgDeIdentification.fncStreetMasked
query=PERSON:" ORDER BY PERSONID"
query=ADDRESS:" ORDER BY ADDRESSID"

Results
I have not collected any formal performance metrics on the execution of the procedures and functions in this package together with the REMAP_DATA parameter, however, it takes 15-20 minutes to export about 1,000,000 PERSON rows and about 1,500,000 ADDRESS rows - 20% of which are WORK addresses that are not masked - along with all of the other tables (about 90 other tables having a combined 15,000,000 rows). Please note that I do not use the PARALLEL parameter with these exports (discussed below).

Special Considerations
Another issue arises if the PARALLEL parameter is used in the export invoking multiple Data Pump worker processes when the production database has been configured for dedicated server rather than shared server. With the database in dedicated server mode, each worker process will invoke its own session and therefore its own instantiation of the pkgDeIdentification package. Since each package instantiation randomly selects from the sample data arrays to populate the linked queues used by the remap functions, this situation can result in a skewed distribution of masked last names which may not be desirable if it is important to maintain familial coherence representative of the production data. Also, depending on the estimated size of the de-identified data set and the degree of parallelism, the amount of memory consumed by each worker process for the allocation of the package arrays may deplete the PGA, but using a sizable large pool may help offset this impact.

Conclusion
The REMAP_DATA parameter and a little PL/SQL know-how have been great aids in the task of masking our PII for the lab personnel without having to re-key our tables or skewing our statistical and demographic data. With Data Pump doing all of the work on the server end, this new de-identification process has saved us time and money, and with a little customization, passing these scripts and parameters off to other departments for use in their PII databases would serve the entire organization with greater positive returns on the initial investment.