Marking Text – Using Lists, Patterns, and Regular Expressions

In this chapter…

…you will learn how to mark up a document for redaction using three efficient methods. The following sections describe the markup processes:

ADVISORY: The quality and internal structure of PDF files can vary greatly. Consequently, no warranty is offered with respect to the accuracy with which lists, patterns or regular expression lists used with Redax Enterprise Server will locate and mark text. Appligent Document Solutions always recommends a visual review of marked-up documents prior to redaction and release.

Using a list file to mark text

To mark the text in a PDF document that matches the words and phrases in a list file, enter the following command:

$redaxserver -o <output> -flist <listfile.txt> [other options] <input.pdf>

Refer to Creating list files to learn how to create a list file.

Note: If you want the search to be case-insensitive, specify the -ignorecase option.

Redax Enterprise Server searches through the document. Each time it finds text specified for redaction in the list file, it draws a Redax box around the text and overlays the box with the corresponding exemption code from the list file.

Example: Mark text in sample_base.pdf that matches the words and phrases in sample_find_list.txt (both files are in the samples directory), and apply the exemption codes specified in sample_find_list.txt. Perform a case-insensitive search to find all matching text, regardless of capitalization.

In Windows:

>redaxserver -o samples\mark_listed.pdf -flist samples\sample_find_list.txt -ignorecase samples\sample_base.pdf

In UNIX:

$redaxserver -o ./samples/mark_listed.pdf -flist ./samples/sample_find_list.txt -ignorecase ./samples/sample_base.pdf

A segment of the output for this example is displayed in the figure below.

Words found by Find Using List with exemption codes applied

Using a pattern file to mark text

Pattern files are plain text files including items from the Available Patterns List. To develop your own patterns, use a regular expression list file.

To set an exemption code for a specific -fpattern operation, use a preferences file with your desired exemption code set as the default.  See Appendix B: RedaxESconfig for more information on creating preferences files for Redax Enterprise Server.

To mark the text in a PDF document that matches the patterns in a pattern file, enter the following command:

$redaxserver -o <output> -fpattern <pattern_listfile.txt> [other options] <input.pdf>

Redax  Enterprise Server searches through the document. Each time it finds text specified for redaction in the sample_pattern_list.txt, it draws a Redax box around the text and overlays the box with the exemption code defined in the preferences XML file for the current process. See Appendix B: RedaxESconfig for more information on creating preferences files.

Example: Mark text in sample_base.pdf that matches the patterns in sample_pattern_list.txt (both files are in the samples directory), and apply the exemption codes specified therein.

In Windows:

>redaxserver -o samples\mark_listed.pdf -fpattern samples\sample_pattern_list.txt -ignorecase samples\sample_base.pdf

In UNIX:

$redaxserver -o ./samples/mark_listed.pdf -fpattern ./samples/sample_pattern_list.txt -ignorecase ./samples/sample_base.pdf

A segment of the output for this example is displayed in the figure below. The pattern file is setup to find “Date”.  If you wish to search on any other built-in pattern, remove the # sign in front of the pattern name in the sample_pattern_list.txt file.

Marked text area from Find Using Pattern run

Available patterns

The patterns provided with Redax Enterprise Server are:

  • Credit Card
  • Date
  • Date numeric period-separated   (12.08.2010)
  • Date numeric space-separated    (12 08 2010)
  • Email
  • Postal Code Australia
  • Postal Code Brazil
  • Postal Code Canada
  • Postal Code Denmark
  • Postal Code France
  • Postal Code Germany
  • Postal Code India
  • Postal Code Netherlands
  • Postal Code Russia
  • Postal Code Spain
  • Postal Code USA
  • Postal Code United Kingdom
  • Social Security number
  • Telephone # Australia
  • Telephone # NA – 7 digit   (555-1212)
  • Telephone # North American   (888) 555-1212
  • Telephone# United Kingdom
  • URL

Note: To remove a pattern from the active pattern list, place a # at the beginning of the line.

Using a regular expression list file to mark text

If you are already comfortable with regular expressions, then simply refer to Creating list files to learn how to create a regular expression list file.  Otherwise, review Appendix A: Regular Expressions.

To mark the text in a PDF document that matches a list of one or more regular expressions, enter the following command:

$redaxserver -o <output> -fregex <regex_listfile.txt> [other options] <input.pdf>

Redax Enterprise Server searches through the document. Each time it finds a match for the regular expressions defined in the sample_regex_list.txt file, it draws a Redax box around the text and overlays the box with the corresponding exemption code from the list file.

Example: Mark text in sample_base.pdf that matches the words and phrases in sample_regex_list.txt (both files are in the samples directory), and apply the exemption codes specified in sample_regex_list.txt. Perform a case-insensitive search to find all matching text, regardless of capitalization.

In Windows:

>redaxserver -o samples\mark_listed.pdf -fregex samples\sample__regex_list.txt -ignorecase samples\sample_base.pdf

In UNIX:

$redaxserver -o ./samples/mark_listed.pdf -fregex ./samples/sample__regex_list.txt -ignorecase ./samples/sample_base.pdf

A segment of the output for this example is displayed in the figure below.

Marked text area from Find Using Regular Expression run