2.6 textual disambiguation

TEXTUAL DISAMBIGUATION
2.6
The process of contextualizing nonrepetitive unstructured
data is accomplished by technology known as “textual disambiguation” (or “textual ETL”). The process of textual disambiguation has an analogous process in structured processing known as
“ETL,” which stands for “extract/transform/load.” The difference
between ETL and textual ETL is that classical ETL transforms old
legacy systems data and modern textual ETL transforms text. At a
very high level they are analogous, but in terms of the actual details of processing they are very different.
From Narrative into an Analytical Database
The purpose of textual disambiguation is to read raw
text – ­
narrative – and to turn that text into an analytical
database. Figure 2.6.1 shows the general flow of data in textual
d
­ isambiguation.
Once raw text is transformed, it arrives in the analytical
­database in a normalized form. The analytical database looks
like any other analytical database. Typically the analytical data
is “normalized,” where there is a unique key with dependent
­elements of data. The analytical database can be joined with
­other analytical databases to achieve the effect of being able
to analyze structured data and unstructured data in the same
query.
Each element in the analytical database can be tied back directly to the originating source document. This feature is needed
if there ever is any question to the accuracy of the processing that
has occurred in textual disambiguation. In addition, if there ever
is any question as to the context of the data found in the analytical database, it can be easily and quickly verified. Note that the
originating source document is not touched or altered in any way.
Figure 2.6.2 shows that each element of data in the analytical
database can be tied back to originating source.


73
74 Chapter 2.6 Textual disambiguation
Figure 2.6.1
Figure 2.6.2
Input into Textual Disambiguation
The input into textual disambiguation comes from many
­ifferent places. The most obvious source of input is the
d
­electronic-based text that represents the document that is to be
disambiguated. Another important source of data are taxonomies. Taxonomies are essential to the process of disambiguation. There will be an entire chapter on taxonomies. And there are
many other types of parameters based on the document being
d
­ isambiguated.
Figure 2.6.3 shows some of the typical input into the process
of textual disambiguation.
Figure 2.6.3
Chapter 2.6 Textual disambiguation 75
Mapping
To execute textual disambiguation, it is necessary to “map” a
document to the appropriate parameters that can be specified
inside textual disambiguation. The mapping directs textual disambiguation as to how the document needs to be interpreted.
The mapping process is akin to the process of designing how
a system will operate. Each document has its own mapping
­process.
The mapping parameters are specified and on completion of
the mapping process, a document can then be executed. All documents of the same type can be served by the same mapping. For
example, there may be one mapping for oil and gas contracts, another mapping for human resource resume management, another
mapping for call center analysis, and so forth. Figure 2.6.4 shows
the mapping process.
In almost every case the mapping process is done in an iterative manner. The first mapping of a document is created. A few
documents are processed and the analyst sees the results. The
analyst decides to make a few changes and reruns the document
though textual disambiguation with the new mapping specifications. The process of gradually refining the mapping continues
until the analyst is satisfied.
The iterative approach to the creation of a mapping is used because documents are notoriously complex and there are many nuances to a document that are not immediately apparent. For even
an experienced analyst, the creation of the mapping is an iterative
process.
Because of the iterative nature of the creation of the mapping,
it never makes sense to create a mapping and then process thousands of documents using the initial mapping. Such a practice is
wasteful because it is almost guaranteed that the initial mapping
will need to be refined.
Figure 2.6.5 shows the iterative nature of the mapping process.
Figure 2.6.4
76 Chapter 2.6 Textual disambiguation
Figure 2.6.5
Input/Output
The input to the process of textual disambiguation is electronic text. There are many forms of electronic text. Indeed electronic
text can come from almost anywhere. The electronic text can be
in the form of proper language, slang, shorthand, comments, database entries, and many other forms. Textual disambiguation
needs to be able to handle all the forms of electronic text. In addition electronic text can be in different languages.
Textual disambiguation can handle nonelectronic text after the
nonelectronic text passes through an automated capture mechanism such as optical character recognition (OCR) processing.
The output of textual disambiguation can take many forms.
The output of textual disambiguation is output that is created in a
“flat file format.” As such, the output can be sent to any standard
database management system (DBMS) or to Hadoop. Figure 2.6.6
shows the types of output that can be created from textual disambiguation.
Figure 2.6.6
Chapter 2.6 Textual disambiguation 77
Figure 2.6.7
The output from textual disambiguation is placed into a work
table area. From the work table area, the data can be loaded into
a standard DBMS using the load utility of the DBMS. Figure 2.6.7
shows that data is loaded into the DBMS load utility from the work
area created and managed by textual disambiguation.
Document Fracturing/Named Value Processing
There are many features to the actual processing done by textual disambiguation. But there are two primary paths of processing a document. These paths are called “document fracturing” and
“named value processing.”
Document fracturing is the process by which a document is processed word by word, such as stop word processing, alternate spelling and acronym resolution, homographic resolution, and the like.
The effect of document fracturing is that on processing, the document still has a recognizable shape, albeit in a modified form. For all
practical purposes it appears as if the document has been fractured.
The second major type of processing that occurs is named
value processing. Named value processing occurs when inline
contextualization needs to be done. Inline contextualization is
done where the text is repetitive, as sometimes occurs. When text
is repetitive it can be processed by looking for unique beginning
delimiters and ending delimiters.
There are other types of processing that can be done by textual
disambiguation, but document fracturing and named value processing are the two primary analytical processing paths.
Figure 2.6.8 depicts the two primary forms of processing that
occur in textual disambiguation.
Preprocessing a Document
On occasion it is necessary to preprocess a document because
the text of a document cannot be processed in a standard fashion
by textual disambiguation. In these circumstances, it is necessary
to pass the text through a preprocessor. In the preprocessor, the
text can be edited to alter the text to the point that the text can be
processed in a normal manner by textual disambiguation.
78 Chapter 2.6 Textual disambiguation
Figure 2.6.8
Figure 2.6.9
As a rule you don’t want to preprocess text unless you absolutely have to. The reason why you don’t want to have to preprocess
text is that by preprocessing text you automatically double the
machine cycles that are required to process the text. Figure 2.6.9
shows that if necessary, electronic text can be preprocessed.
Emails – A Special Case
Emails are a special case of nonrepetitive unstructured data.
Emails are special because everybody has them and because there
are so many of them. Another reason why emails are special is that
emails carry with them an enormous amount of system overhead
that is useful to the system and no one else. Also, emails carry a
lot of valuable information when it comes to customer’s attitudes
and activities.
It is possible to simply send emails into textual disambiguation.
But such an exercise is fruitless because of the spam and blather that
is found in emails. Spam is the nonbusiness-relevant information
that is generated outside the corporation. Blather is the internally
generated correspondence that is nonbusiness related. For example,
blather contains the jokes that are sent throughout the corporation.
In order to use textual disambiguation effectively, the spam, blather,
and system information needs to be filtered out. Otherwise the system
Chapter 2.6 Textual disambiguation 79
Figure 2.6.10
becomes overwhelmed meaningless information. Figure 2.6.10 shows
a filter to remove unnecessary information from the stream of
emails before the emails are processed by textual disambiguation.
Spreadsheets
Another special case is the case of spreadsheets. Spreadsheets
are ubiquitous. Sometimes the information on the spreadsheet is
purely numerical. But on other occasions there is character-based
information on a spreadsheet. As a rule, textual disambiguation
does not process numerical information from a spreadsheet. That
is because there is no metadata to accurately describe numeric
values on a spreadsheet. (Note: there is formulaic information for
the numbers found on a spreadsheet, but the spreadsheet formulae are almost worthless as metadata descriptions of the meaning
of the numbers.) For this reason the only data that is found on the
spreadsheet that makes its way into textual ETL is the characterbased descriptive data.
To this end, there is an interface that allows the data on the
spreadsheet that is useful to be formatted from the spreadsheet
into a working database. From the working database, the data is
then sent into textual disambiguation, as shown in Figure 2.6.11.
Report Decompilation
Most textual information is found in the form of a document.
And when text is on a document it is processed linearly by textual
disambiguation. Figure 2.6.12 shows that textual disambiguation
operates in a linear fashion.
Figure 2.6.11
80 Chapter 2.6 Textual disambiguation
Figure 2.6.12
But text on a document is not the only form of nonrepetitive
unstructured data. Another common form of nonrepetitive unstructured data is that of a table. Tables are found everywhere including bank statements, research papers, and corporate invoices.
On some occasions it is necessary to read the table in as input,
just as text is read in on a document. To this end a specialized form
of textual disambiguation is required. This form of textual disambiguation is called “report decomposition.”
In report decomposition the contents of the report are handled
very differently than the contents of text. The reason why reports
are handled differently from text is that in a report, the information
cannot be handled in a linear format. Figure 2.6.13 shows that there
are different elements of a report that must be brought together in
a normalized format. The problem is that those elements appear
Figure 2.6.13
Chapter 2.6 Textual disambiguation 81
Figure 2.6.14
is a decidedly nonlinear format. Therefore an entirely ­different
form of textual disambiguation is required.
Figure 2.6.14 shows that reports can be sent to report decompilation for reduction to a normalized format. The end result of report decompilation is exactly the same as the end result of textual
­disambiguation. But the processing and the logic that arrive at the
end result are very different in content and substance.