Sample Projects IBM InfoSphere Discovery Version 4 Release 5.1

IBM InfoSphere Discovery
Version 4 Release 5.1
Sample Projects
SC23-9880-04
IBM InfoSphere Discovery
Version 4 Release 5.1
Sample Projects
SC23-9880-04
© Copyright IBM Corporation 2006, 2011.
US Government Users Restricted Rights – Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
Contents
Chapter 1. Installing IBM InfoSphere
Discovery . . . . . . . . . . . . . . 1
Prerequisites . . . . . . . . . . . . .
Supported Data and Database . . . . . . .
Automatic Database Configuration . . . . .
Using IBM InfoSphere Discovery with a Different
DB2 Database . . . . . . . . . . . .
Installing Discovery with IBM DB2 Express Edition
Uninstalling Discovery . . . . . . . . . .
. 1
. 1
. 1
. 2
. 2
. 3
Chapter 2. Introduction to
demonstrations about IBM InfoSphere
Discovery . . . . . . . . . . . . . . 5
Demonstration Project: Overlaps and the Unified
Schema Builder . . . . . . . . . . . .
Start Discovery Studio and create a project . .
Create and populate the data sets . . . . .
Import tables from the JDBC connection into the
data set . . . . . . . . . . . . . .
Create the Region data set . . . . . . .
Create the CRM data set . . . . . . . .
Run and review column analysis . . . . .
Identifying critical elements . . . . . . .
Discover and review PF Keys . . . . . .
Discover and review data objects . . . . .
Overlaps . . . . . . . . . . . . .
Creating a unified customer model . . . .
© Copyright IBM Corp. 2006, 2011
. 5
. 6
. 9
.
.
.
.
.
.
.
.
.
11
14
16
16
20
21
24
26
29
Unified column analysis . . . . . . . . .
Perform match and merge analysis . . . . .
Create a report to include in your development
specifications . . . . . . . . . . . . .
Demonstration Project: Archiving tables by defining
business objects . . . . . . . . . . . . .
Start InfoSphere Discovery . . . . . . . .
Create the project and the data sets . . . . .
Import CIS tables . . . . . . . . . . .
Defining option sets for your analysis . . . .
Analyzing and reviewing discovered
relationships . . . . . . . . . . . . .
Adjusting the data object . . . . . . . . .
Export artifacts . . . . . . . . . . . .
30
30
32
33
34
34
35
36
39
40
42
Contacting IBM . . . . . . . . . . . 43
Product accessibility
. . . . . . . . 45
Accessing product documentation. . . 47
Links to non-IBM Web sites. . . . . . 49
Notices and trademarks . . . . . . . 51
Index . . . . . . . . . . . . . . . 55
iii
iv
IBM InfoSphere Discovery Sample Projects
Chapter 1. Installing IBM InfoSphere Discovery
Discovery consists of three components: Discovery Server, Discovery Engine
Service, and Discovery Studio.
When you choose to install IBM® DB2® Express® Edition during installation, all
three Discovery components must be installed on a single host.
The Discovery installer installs IBM DB2 Express Edition and all database tables
necessary to run the demo project, along with a completed version of the project
that you can use for reference. If you do not install IBM DB2 Express Edition, the
demo project will not be installed and you cannot use this IBM InfoSphere
Discovery Sample Projects guide or run the demo project.
Prerequisites
To run the demo project, you can either install IBM DB2 9.7 Express Edition along
with Discovery or you can use the existing installation of IBM DB2 9.7 Express
Edition on Windows platform.
Prerequisites are described in the IBM InfoSphere Discovery Installation Guide.
Supported Data and Database
The tutorials and demos used here are pre-configured for IBM DB2 9.7 on
Windows. The bundled demo project uses DB2 source, a DB2 repository and DB2
staging databases.
The IBM InfoSphere Discovery Installation Guide lists the operating system
requirements, supported databases, and supported ODBC or JDBC drivers or your
production environment.
Automatic Database Configuration
You have the option of installing IBM DB2 Express Edition along with Discovery. If
you do, the Discovery installer automatically preconfigures IBM DB2 Express
Edition and IBM InfoSphere Discovery for the demo projects by performing the
following actions:
v creating the data sources and loading the tables into the database
v creating the required users and JDBC connections
v creating a default staging data source in Discovery Studio
v importing a completed version of one project into Discovery Studio
The Discovery installer catalogs the system JDBC data sources with the same
names as the databases.
Note: IBM DB2 Express Edition cannot be installed on a host that already has any
existing DB2 version installed (including other versions of DB2 clients or servers).
© Copyright IBM Corp. 2006, 2011
1
To install the bundled IBM DB2 Express Edition version and the pre-configured
demo project along with Discovery, make sure any previous DB2 packages are
completely uninstalled from the host.
Using IBM InfoSphere Discovery with a Different DB2
Database
You may use a different DB2 database with IBM InfoSphere Discovery, but the
installer will not automatically preconfigure it or Discovery Studio.
Installing Discovery with IBM DB2 Express Edition
About this task
The following instructions are for installing IBM InfoSphere Discovery with IBM
DB2 Express Edition.
Note: If IBM InfoSphere Discovery cannot be installed using these steps, install the
product using the instructions in the IBM InfoSphere Discovery Installation Guide.
Procedure
1. Make sure the host meets the hardware and software prerequisites.
2. Close all applications and windows on the machine.
3. In a file explorer window, open the <installation_disk>/CD directory, then
double-click the file install.exe.
The installation package starts extracting, which can take up to one minute.
When it is finished, the installer's Introduction screen appears.
4. Click Next to start the installation.
5. Accept the license agreement and click Next.
6. On the remaining screens, click Next to accept the default options.
7. If any of the following situations occurs during installation, take action as
noted below.
v If an error message states that IBM DB2 Express Edition cannot be installed,
you have the following options:
– Quit installation and completely uninstall any existing DB2 product from
the machine (including deleting the DB2 directory), then start Discovery
installation again.
– Uncheck the IBM DB2 Express Edition option in the installer, then
continue installation. IBM DB2 Express Edition will not be installed and
you will not be able to use this IBM InfoSphere Discovery Sample Projects
guide or run the demo project.
v If the Discovery Server Port screen states that some or all of the required ports
are unavailable, change the ports as prompted. Contact your system
administrator if needed.
v If a security notice about blocking Java 2 Platform Standard Edition Binary
appears, click Unblock to allow Windows to access Java.
v If the installer asks to install Microsoft Visual J# 2.0 Redistributable Package,
click Next to accept the installation.
v If an error message states that the installer did not successfully preconfigure
IBM DB2 Express Edition or Discovery Studio, you will not be able to use
this IBM InfoSphere Discovery Sample Projects guide or run the demo project.
8. In the Discovery Server Host screen, enter the following value:
2
IBM InfoSphere Discovery Sample Projects
v Discovery Server Hostname: localhost
9. When the Start Both Services screen appears, click Next and then Done to close
the Discovery installer.
Results
IBM InfoSphere Discovery and IBM DB2 Express Edition are now installed. The
appropriate ODBC connections, users, and databases are created, the demo tables
are loaded, and you are ready to start Discovery Studio.
Uninstalling Discovery
About this task
The uninstaller automatically uninstalls Discovery Server, Discovery Engine
Service, and Discovery Studio.
To uninstall IBM InfoSphere Discovery:
Procedure
1. Stop Discovery Studio, Discovery Server, and Discovery Engine Service. Make
sure no Discovery Studio tasks are queued or running.
2. From the Start menu, select Programs>IBM InfoSphere>Discovery>Uninstall
IBM InfoSphere Discovery.
3. Accept the default, Full, by clicking Next.
4. The uninstaller stops the selected components, if they are running, and
uninstalls them from the machine.
5. If any components or files could not be uninstalled, a message appears. In most
cases these are logs, configuration files, or user-created files. These files do not
contain any project data and can be deleted.
Results
IBM InfoSphere Discovery is now uninstalled.
Chapter 1. Installing IBM InfoSphere Discovery
3
4
IBM InfoSphere Discovery Sample Projects
Chapter 2. Introduction to demonstrations about IBM
InfoSphere Discovery
By using InfoSphere® Discovery, you can find and manipulate relationships. These
demonstrations display some of the basic principles of Discovery.
These instructions assume the following things:
v You installed IBM DB2 Express Edition with IBM InfoSphere Discovery
v IBM DB2 Express Edition and InfoSphere Discovery Studio are successfully
preconfigured with the following objects:
– The necessary data sources are created.
– The tables are loaded in IBM DB2 Express Edition.
– The required users and JDBC connections are created.
– A default staging server is created in IBM InfoSphere Discovery Studio.
As part of this preconfiguration, four completed demonstration projects are
imported into Discovery Studio. You can review the completed projects before you
run these learning modules, and use them for reference as you work.
Important: If you did not install IBM DB2 Express Edition, your demonstration
projects will not be automatically configured. See the IBM InfoSphere Discovery User
Guide for instructions on creating projects and executing tasks.
Learning objectives
The objective of the demonstrations is to help you understand how to use
InfoSphere Discovery to analyze data. Specifically you will be able to do the
following:
v Create a project.
v Create and populate data sets.
v Run and review column analysis.
v Discover and review primary and foreign keys.
v Discover and review data objects.
v Discover and review overlaps and unified schemas.
Time required
Each demonstration should take approximately 60 minutes to finish. If you explore
other concepts related to the demonstrations, it could take longer to complete.
Demonstration Project: Overlaps and the Unified Schema Builder
Consolidating data from multiple systems can be difficult. IBM InfoSphere
Discovery enables a 4-step methodology for prototyping the artifacts for the final
solution
The four steps are:
1. Inventory the data landscape
2. Model the target
© Copyright IBM Corp. 2006, 2011
5
3. Map to and analyze the target
4. Perform match and merge analysis.
The Discover_Data_Consolidation sample project contains three data sets already
defined and configured for you:
v CRM
v Region
v Community
Each of the data sets appears as a tab in the Data Sets view. You can view the
connection information for any of the data sets by using the following procedure:
1. Click on a data set tab.
2. Right-click on the data set in Database Connections & Tables.
3. Select Edit the selected connection.
You can also view the data content by clicking the Column Analysis tab.
Learning objectives
After completing the lessons in this module you will be able to consolidate data
from multiple source systems.
This module should take approximately 60 minutes to complete.
Start Discovery Studio and create a project
You can create your own project to start learning to consolidate data from multiple
systems.
All work in IBM InfoSphere Discovery is done in projects. Begin the lesson by
opening Discovery Studio and then creating a project.
1. From the Windows Start menu, select Programs > IBM InfoSphere > Discovery
> > Discovery Studio. Discovery Studio opens and automatically connects to
the Discovery Server. The sample projects that were loaded during installation,
Discover_Data_Consolidation, Discover_PFKey_DataObject, and
Discover_Sensitive_Critical_Data, appear in the project list of the Source Data
Discovery tab. There is also a sample project in the Transformation Discovery
tab called Discover_Transformation.
6
IBM InfoSphere Discovery Sample Projects
To hide the Error List and Output pane, click the button in the upper right
corner of the pane.
2. In the Source Data Discovery tab, click New Project. You can create as many
projects as necessary, but only one project can be open at a time.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
7
3. In the Name field, type the name of the project. In this example, type Training
- Overlaps and Unified Schema Builder.
4. Clear the Use Password checkbox, or you can enter the password to protect
your project from unauthorized access. Use the default settings for the other
fields.
5. Click OK.
The project Training - Overlaps and Unified Schema Builder is now created. Discovery
automatically opens the next tab, Data Sets. You can click the Home tab to see the
Training - Overlaps and Unified Schema Builder project in the Source Data Discovery
project list.
8
IBM InfoSphere Discovery Sample Projects
Create and populate the data sets
The new project requires three data sets. Create and name the data sets, specify a
JDBC connection for each one, and import tables into each one.
1. In the Data Sets tab, click Rename. In the dialog box type Community and click
OK.
2. Click the Click here to add a new connection link.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
9
3. In the Create Connection window, complete the following fields using the
values shown:
v Connection Name: Community Source
v Database Server Name: localhost
v Database Name: ISD_SRC
v User Name: ISD_MDM
v Password: ISD_user1
10
IBM InfoSphere Discovery Sample Projects
4. In the Create Connection window, click Test Connection to verify the
connection parameters.
5. Click OK to save the connection. The Community Source connection is added
to the Import Objects list under the Database Connections & Tables section.
You have created a new data set and specified JDBC connection information for
that data set.
Import tables from the JDBC connection into the data set
After creating the data set, you need to add tables to begin working with the data.
1. In the Import Objects list of the Data Sets tab, right-click the JDBC connection,
Community Source, that you created in the previous lesson. Select Import
Tables/File Formats from the drop-down menu.
2. In the Import Table Wizard, click Search Tables.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
11
3. In the Table Name field, type COMMUNITY_ to search for tables that have names
that begin with that string.
4. Click Next.
5. The result of the search found three tables beginning with the string
COMMUNITY_. Click Select All and then click Finish to select all three tables to
import.
12
IBM InfoSphere Discovery Sample Projects
The tables are imported into the Community data set. The physical tables are
listed in the Database Connections & Tables list and are appended with _PT. One
logical table is created for each physical table, and is listed in Logical Tables.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
13
Create the Region data set
You can now create the second data set that is needed for this demonstration and
import tables into it. You follow the same steps that you did for the Community
data set.
1. Right-click on the Community tab and select Add Data Set. A second, blank
data set is added to the project.
14
IBM InfoSphere Discovery Sample Projects
2. Rename the second data set by clicking Rename and changing the name to
Region.
3. Click the Click here to add a new connection link.
4. In the Create Connection window, complete the connection information by
using the following values:
v Connection Name: Region Source
v Database Server Name: localhost (same as previous connection)
v Database Name: ISD_SRC (same as previous connection)
v User Name: ISD_MDM (same as previous connection)
v Password: ISD_user1 (same as previous connection)
5. Click OK.
6. In the Import Objects list of the Data Sets tab, right-click the JDBC connection,
Region Source that you just created. Select Import Tables/File Formats from
the drop-down menu.
7. In the Import Table Wizard, click Search Tables.
8. In the Table Name field, type Region_ to search for tables that have names
that begin with that string.
9. Click Next.
10. The result of the search found three tables beginning with the string Region_.
Click Select All and then click Finish to select all three tables to import. These
are the tables to import:
v ISD_MDM.REGION_ACCT_NAMES
v ISD_MDM.REGION_ADDR_TYPE
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
15
v ISD_MDM.REGION_BRCH
The tables are imported into the Region data set. The physical tables are listed in
the Database Connections & Tables list and are appended with _PT. One logical
table is created for each physical table, and is listed in Logical Tables.
Create the CRM data set
Create the third of the three data sets and populate the data set. You follow the
same steps that you did for the Region and Community data sets.
1. Right-click the Region tab and select Add Data Set. A third blank data set is
added to the project.
2. Rename the third data set by clicking Rename and changing the name to CRM.
3. Click the Click here to add a new connection link.
4. In the Create Connection window, complete the connection information by
using the following values:
v Connection Name: CRM Source
v Database Server Name: localhost (same as previous connection)
v Database Name: ISD_SRC (same as previous connection)
v User Name: ISD_MDM (same as previous connection)
5.
6.
7.
8.
9.
v Password: ISD_user1 (same as previous connection)
In the Import Objects list of the Data Sets tab, right-click the JDBC connection,
CRM Source, that you just created. Select Import Tables/File Formats from the
drop-down menu.
In the Import Table Wizard, click Search Tables.
In the Table Name field, type CRM_ to search for tables that have names that
begin with that string.
Click Next.
The result of the search found three tables beginning with the string CRM_. Click
Select All > Finish to select all three tables to import. These are the tables to
import:
v ISD_MDM.CRM_ACCT_TYPE
v ISD_MDM.CRM_ADDRESS_TYPE
v ISD_MDM.CRM_BRCH_1A
The tables are imported into the CRM data set. The physical tables are listed in the
Database Connections & Tables list and are appended with _PT. One logical table
is created for each physical table, and is listed in Logical Tables.
Run and review column analysis
Column analysis is performed individually on each table within each data set.
The Column Analysis tab displays information about all columns in the data sets,
such as the following information:
v Metadata
v Data types from physical or logical tables (Native Type)
v Data types used in the staging database
v Formats for textual data discovered as number (NUMBERSTRING) or date-time
data (DATETIMESTRING)
v Statistics gathered during the profiling step
16
IBM InfoSphere Discovery Sample Projects
If necessary, you can manually change a data type of a column and some other
metadata. You can use data preview to verify the actual data.
Tip: Always re-run Discovery, including Column Analysis, after importing tables
or text files, changing a primary sample set, reloading or reimporting tables, or
performing any other action that affects the contents of a table, file, or data set.
1. In the Data Sets tab, click Run Next Steps.
2. In the Processing Options window, click Run to accept the defaults and queue
the Column Analysis task for processing on the tables in the project.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
17
As soon as you queue the task, the Column Analysis tab appears. Imported
metadata and other information available without discovery is displayed.
While the task is queued or running, the project is locked. You can click on
other tabs but you cannot make any changes that affect the project, such as
adding data sets or tables, while a project is locked. Notice in the following
Column Analysis figure that Discovery can include textual data types (the SSN
column is a NumberString).
18
IBM InfoSphere Discovery Sample Projects
3. When processing is complete, review the results in the Column Analysis tab.
Review the tables in the data sets by clicking on each data set tab to display its
tables in the Tables list, and then clicking each table to display the column
information in the center grid.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
19
To display the actual data in a selected table, click Preview Data. You can sort,
filter, and export the data from the preview.
4. Verify that all of the metadata is correct. Imported or discovered metadata is
shown in the first nine columns of the center grid in the Metadata category.
The Data Type, Length, Precision, Scale, and Formats fields are editable, if
necessary. If you change any of those values, click Re-Run Step to re-run
Column Analysis.
5. Review the discovered statistics in the remaining columns of the center grid.
Scroll to the right if necessary to view all columns in the center grid. If these
statistics are not correct, these values can help you identify which columns or
tables might be related. Statistics cannot be manually changed.
For example, Cardinality and Selectivity are used together to identify how
unique the values are in a column. Click Value Frequencies in the menu bar
for a list of each value in the column and how often it appears.
Min and Max display the actual smallest and largest values in the column.
Mode is the most common value in the column.
You have done a very basic column analysis to get an understanding of the data in
these sample data sets.
Identifying critical elements
In most projects you understand at least one data source more than others. For the
purposes of this lesson, assume that you know the CRM data more than the other
sources.
20
IBM InfoSphere Discovery Sample Projects
You know the CRM source, so you first need to mark up the known critical data
elements (CDE).
1. Click the Column Analysis tab.
2. Select the CRM data set.
3. Select the table CRM_BRCH_1A.
4. Select the following boxes in the CDE column:
v FIRST_NAME
v LAST_NAME
v
v
v
v
TAX_ID
ADDRESS_LINE_1
CITY
STATE
This process identifies these particular columns with specific attributes that you
want to include in your new target schema.
5. Click Run Next Steps to process these attributes.
You can go into the other data sets to mark any data elements that you recognize
as critical to retain in the consolidated project. You can also use the Value
Frequencies, Pattern Frequencies, or Length Frequencies views to examine the data
content of a column that you think might be critical. These CDEs help you focus
on the relationships in later discovery steps.
Discover and review PF Keys
PF Key discovery is performed across all columns within each data set.
PF Keys are primary-foreign key pairs. InfoSphere Discovery discovers column
matches, which are relationships between the data in two or more columns in
different tables in the same data set. Based on the statistics and additional
calculations not shown, Discovery promotes certain column matches to the status
of PF Keys. The PF Key with the best statistics for each column pair is selected as
the primary PF Key for that column pair.
1. In the Column Analysis tab, click Run Next Steps.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
21
2. In the Processing Options window, ensure that the slider is next to PF Keys.
Click Run to run PF Key discovery. Processing will take a minute or two to
complete.
3. When processing is complete, review the discovered primary-foreign keys by
clicking on each data set tab.
The Connected Tables and Unconnected Tables in each data set are listed on
the left of the screen and are also shown graphically in the center pane. Expand
the list and each table in the list to view its PF Keys and column matches.
Scroll the display until the PF Keys and column matches of interest are visible
in the center pane, dragging tables (boxes) and relationships (lines) to rearrange
them as necessary.
The Display Mode allows you to filter the center panel to show only column
matches, only PF Keys, or only the selected item. Zoom is also useful.
22
IBM InfoSphere Discovery Sample Projects
4. Review the statistics for each PF Key by clicking on its connection in the
Connected Tables list or on the connecting line in the center pane. The SQL for
the selected PF Key and its discovered statistics are displayed in the grid below
the center pane.
You now know something about the primary and foreign key relationships and
have a better understanding of the data.
The statistics for each relationship are based on the join expression, shown in the
Foreign Keys tab. There might be several join expressions discovered for each
relationship, each with different statistics.
v Row Hit Rate (RHR) is the total number of table rows that satisfy the PF Key
expression.
v Value Hit Rate (VHR) is the number of unique values that satisfy the PF Key
expression.
v Cardinality is the number of unique value combinations involved in the PF Key
expression.
v Selectivity is the Cardinality divided by the total number of rows.
A strong PF Key relationship has a high RHR, high VHR on the primary and
foreign side, and a high Selectivity on the primary side.
In some cases, especially when the statistics for all discovered relationships are
similar, you might need to investigate further to determine which relationships are
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
23
valid and which join expression is the best. The Show Hits, Show Misses, or
Show Duplicates drop-down button allows you to preview the actual data in the
tables.
Discover and review data objects
Data object discovery is performed across all tables within each data set.
A data object is a logical cluster of all tables in a data set that have one or more
columns that contain data that is related to the same business entity. Data objects
are not maps, but instead represent an object view of related tables. By grouping
tables in this way, InfoSphere Discovery can narrow the focus of the analysis to
only the tables that are known to be related.
Each table in the data set is represented in at least one data object, and a data
object can contain as many tables as necessary. If more than one PF Key was found
between a pair of tables, Discovery creates one data object for the tables based on
the primary PF Key. A data object with only one table means that no other tables
in the data set contain data that is related to that table.
Tip: A table that is not related to any others within its own data set may still be
related to a table in another data set. Discovery across data sets is performed in the
Target Matches step, which is not included in these lessons.
For example, assume a data set contains three tables. In the PF Keys step,
Discovery found several primary-foreign keys between two of the tables and
selected one PF Key as primary. In the Data Objects step, Discovery creates two
data objects: one for the two tables related by the primary PF Key, and one for the
unrelated third table.
1. In the PF Keys tab, click Run Next Steps .
24
IBM InfoSphere Discovery Sample Projects
2. In the Processing Options dialog, click Run to execute Data Object processing.
3. When processing is complete, verify that the data objects are sensible and
accurate, as measured by the statistics and your knowledge of the data.
The data objects discovered within each data set are shown in the Data Objects
list on the left of the screen. Expand each data object in the list to display the
tables in it. When you click on a data object or one of its tables, the data object
is displayed in the center pane.
Scroll the center pane display, if necessary, to see all of the tables and
relationships within a data object, dragging tables (boxes) and relationships
(lines) to rearrange them as necessary.
Click on a connecting line in the diagram to display statistics about the PF Key
relationships between the two tables.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
25
You have reviewed the PF Key relationships in the data objects and are satisfied
with the validity of the relationships.
Overlaps
The main task in the Overlaps tab is to review the discovered overlaps. This
includes viewing the column data to verify that the overlaps are useful and valid,
deleting incorrect overlaps, and adding overlaps that you know exist but were not
discovered. Accurate results provide a clear picture of overlapping data in your
data sources.
1. In the Data Objects tab, click Run Next Steps.
26
IBM InfoSphere Discovery Sample Projects
2. In the Processing Options window, click Run.
3. When processing is complete, review the overlaps. Results are provided
separately for each data set, but are combined into Data Set Summary and
Data Set Overlaps pages.
The graphic on the top-level Data Set Summary page provides a visual
summary of the overlap statistics. Each group of columns corresponds to a row
in the grid above the graphic.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
27
4. Review the results by clicking on the statistics to drill down into the data. The
Data Set Summary pages and Data Set Overlaps pages each have three levels:
Data Set Summary, Table Summary, and Column Summary.
When you examine the CRM data set, you can see that 22 out of the 33
columns overlap with columns from other data sets. These overlaps can
provide important insight into the relationships between the CRM data set and
other data sets.
a. Click on the 22 in the table. You see a list of all CRM columns that overlap
in data value (exact values) with some columns in Region and Community.
The instances where you see zeroes, which indicates low overlap, means
that the two sets of data do not have much in common.
b. Examine the overlap on a critical data element such as LAST_NAME, which
is an important natural key. A high degree of overlap on LAST_NAME is a
good indicator of overlapping customers. In this case, out of 77 last names
in CRM, 34 of them appear in Region, and 35 of them appear in
Community.
c. Confirm your conjecture that CRM has common customers with Region by
clicking the 34 hits to see the Region overlap details. Now you can review
the overlapping names in the Value Overlap Details window.
d. In the Value Overlap Details click the data row and then click Show Hits
to see the actual overlapping last names. You can also select Show Misses
to see the names that are not common. The Preview Criteria window opens.
Click OK to close and open the Matches Data Preview window. Click Close
until you return to the Overlaps view.
Tip: Overlap displays can help you find more critical data elements. For
example, if a column has a very high cardinality and strong overlap, it is likely
to be an important natural key that exists in all sources that contain customer
data. You can use data views and data profiles, including all types of
28
IBM InfoSphere Discovery Sample Projects
frequencies to investigate the nature of these columns. If they are indeed
meaningful, you can mark them up as CDEs on the Overlap tab, or on the
Column Analysis tab.
5. When you have reviewed all overlaps and deleted any incorrect overlaps, select
Project > Save to save the project.
Lesson checkpoint
You have used the Overlaps information to help you further understand the
customer information.
This is what you have accomplished to this point in the lessons:
v You have marked CDEs and you might have discovered a few CDEs that you
were not aware of previously.
v You know that the CRM, Community, and Region data sets have overlapping
customer populations.
v You understand the table relationships within each of these data sets.
v You are ready to prototype a canonical customer table.
Creating a unified customer model
Now that you have an inventory of the data sources on hand, you are ready to
prototype a table that contains customer data from all relevant sources. You want
your consolidated table to account for all the critical data elements that you have
marked in previous steps, so that it models critical customer properties.
1. Click the Unified Schema tab.
2. Click the plus (+) symbol in the menu bar under Target Table Navigator to add
a new table.
3. Click on the new table to edit the name, and type ALL_CUSTOMERS. You now
have a target table to model customers, except that it does not yet contain any
columns.
4. Select the new table and then select the Target Table Schema tab. There are
currently no columns in the ALL_CUSTOMERS table.
5. On the right source tree, click on the drop-down next to the CDE header and
click Checked. The source tree now only displays the CDE elements that you
selected earlier.
6. Drag and drop the Table:CRM_BRCH_1A into the empty middle pane. You
have adopted all of the CDEs in table CRM_BRCH_1A into the target model.
You can modify these definitions. For example, change TAX_ID to SSN by
clicking on TAX_ID and typing SSN.
7. Create the source maps. You want to map all three sets of source data to the
new target table, ALL_CUSTOMERS. To do this, create a map for each data set.
a. Click the Source Mapping tab.
b. Select the CRM data source from the drop-down list The CRM map is
already filled out, because you adopted data elements into the target
schema in a prior lesson. Therefore, the CRM source map is complete.
c. Select the Region data source from the drop-down list.
d. Click Suggest Transformations to display any suggestions that Discovery
might provide.
e. Select all of the suggestions that are good. Then click OK.
f. Click Preview Data to review the results of the mapping.
g. Select the Community data source and click Suggest Transformations and
map the data source as you did for the Region data source.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
29
Unified column analysis
You want to test the combined source maps by using unified column analysis.
Profile all of the maps and the combined results.
1. From the Unified Schema view in your new ALL_CUSTOMERS table, select
Run Next Steps.
2. In the Processing Options window, click Run.
3. Select the Unified Column Analysis tab to refine the source maps.
4. Expand the SSN column to see the unified profile of this target column.
Observe that the data seem to appear in 2 formats.
5. With SSN still selected, click on the drop-down menu that is labeled Value
Frequency and click Pattern Frequency. The frequency result shows that the
combined SSN (from 3 maps) contains 51 social security numbers in a no-dash
format, all of them coming from the Region map. The other two sources
contribute social security numbers in a dashed format. You want to fix this
inconsistency. Click OK to close the Pattern Frequency view.
6. Click the Source Mapping tab.
7. Select Region from the drop-down menu.
8. Click on the SSN row. In the Expression editor in the lower pane type a
transformation instruction that inserts dashes into the social security numbers,
as in the following example:
substr(REGION_BRCH.SSN,1,3) || ’-’
|| substr(REGION_BRCH.SSN,4,2) || ’-’
|| substr(REGION_BRCH.SSN,6,4)
9. Click Preview Data to verify that the transformation instruction is correct.
10. Click Run Next Steps to run unified column analysis.
11. Ensure that the SSN target column is now mapped consistently by all of the
maps.
12. Click on Preview Data again. Click on LAST_NAME to sort all records on last
name, so that you can see the mixture of records in this view. This table has
all of the customer information in the correct format, but there are duplicates
and discrepancies which you can analyze in the next lesson.
You have profiled three source maps side by side, and you have used the unified
column statistics and unified pattern frequency to identify problems with the
diverse maps and to bring the data to consistency. You can use the Preview Data
button on the Unified Column Analysis page to see the combined mapping results
in the target format.
Perform match and merge analysis
You are going to try and determine the best keys to use for matching duplicate
rows, and analyze the potential data conflicts.
Match and merge analysis is performed on target table schemas.
1. Click the Match and Merge Analysis tab under the Unified Schema tab.
2. Click the plus (+) symbol on the menu bar in the middle section to add a new
matching condition.
30
IBM InfoSphere Discovery Sample Projects
3. Highlight the Matching Condition cell, and enter this condition in the window
at the bottom of the view: DM_ROW1.SSN = DM_ROW2.SSN
This matching condition means that given any two rows in the table,
DM_ROW1 and DM_ROW2, if they have the same SSN value, then they might
represent the same customer. If yes, then Discovery adds them into the same
group and tries to merge all of the records in that group into a single record.
For this lesson, you used a simple matching condition, matching on a single
column, SSN. You could also use a more complex matching condition. For
example:
DM_ROW1.SSN = DM_ROW2.SSN and DM_ROW1.LAST_NAME = DM_ROW2.LAST_NAME
Discovery allows you to take advantage of the power of SQL expressions,
including User Defined Functions. For instance you can use fuzzy matching
functions that are provided with the product installation, such as
DMCOMPARE_LCS or DMCOMPARE_EDITDST, or any DB2 UDF you created.
4. Click Re-Run Step.
5. In the Processing Options window click the checkboxes for both Match and
Merge, then click Run.
6. After the run completes, look at the statistics that Discovery produces for the
matching condition. You want to assess the correctness of the matching
condition by using the Groups views.
Now that you have entered and processed a match condition, determine its
accuracy. Does it match what you think it should match? Does it accurately
group all records for the same customer into one group?
To assess the semantics and strength of a matching condition, use the Groups
views to see the following:
v All Groups
v Consistent Groups
v Groups with Discrepancies
v Exclusive Groups
v Groups with Source Duplicates
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
31
v Groups without Source Duplicates
7. Select Groups with Discrepancies to view matches with conflicting data. If you
grouped the name John with the name Betty, you might need to reconsider the
matching condition. If the discrepancy level is low and you can explain the
existing discrepancies, then the matching condition is good.
8. In the data view that opens, click on the Conflict Count column header to sort
the values. Find the group with the greatest number of conflicts and select this
group.
9. Click the Details icon (the small binocular) to see the conflicts in this most
troubled group.
The conflicts are due to non-standard data and other issues that do not invalidate
the matching condition by SSN. Therefore, matching by SSN is a good match key.
Create a report to include in your development specifications
Your goal was to prototype the consolidation of three data sources into one
customer data set. You now know the guidelines that should be followed when
developing the Customer Master. You have also identified potential areas for data
cleanup prior to the consolidation. You now need to develop specifications that
will allow you to create the Customer Master data set.
1. Click the Match and Merge Analysis tab under the Unified Schema tab.
2. Click Merge Summary in the menu bar.
3. The Merge Summary view summarizes the results of your prototyping. Export
this information for your developers by clicking the export icon and selecting
Export All Rows.
4. A file browser window opens and you can select a target folder and file name
to save the report. You can save the report in several formats, such as csv, html,
xls, xml, or tsv. Click Save.
You can now share the report with others.
32
IBM InfoSphere Discovery Sample Projects
Lesson checkpoint
In this lesson, you have seen that even good matching conditions do not remove
conflicts. When there are conflicts, when different data sets provide different
information about the same customer you can use the set of facilities in InfoSphere
Discovery that discover a trust index and help you prototype conflict resolution
logistics.
You have now prototyped the consolidation of three data stores into a new
Customer Master.
v You have prototyped a matching condition.
v You have used InfoSphere Discovery’s features to assess the correctness of the
matching condition.
v You have built the following prototype artifacts:
– inventory
– unified model
– source maps, including profiling and testing across maps
– match and merge analysis
To accomplish these tasks you inventoried the data landscape, modeled the new
target schema, determined how to map each source into the new schema,
determined the best keys to use for matching duplicate rows, and analyzed the
potential data conflicts.
Demonstration Project: Archiving tables by defining business objects
Use IBM InfoSphere Discovery to create a complete business object for Optim
archiving.
To deploy an Optim archiving solution successfully, you must archive business
objects, or tables that are related to each other. Correctly identifying business
objects is often a complex task. A typical data set has a large number of tables and
might not have well declared or documented foreign or primary keys. It can be a
challenge to work with such a data set to establish the boundary of tables to be
archived together and maintain the correct relationships between them. IBM
InfoSphere Discovery can help you meet these challenges to create business objects.
After reviewing Optim solutions with IBM sales representatives, Company A
decides to pursue the strategy of archiving orders that are older than two years
from its Customer Information System, also known as CIS. Before Company A can
configure an Optim Access Definition, they need to find out the location and
content of the Orders table and the related tables by using InfoSphere Discovery.
Learning objectives
After completing the lessons in this module you will understand that InfoSphere
Discovery automatically discovers implicit relationships in a large schema and also
clusters tables into business objects. In this module, you complete these tasks:
v Start InfoSphere Discovery.
v Create data sets containing tables from which we wish to create business objects.
v Use Discovery to find and review foreign keys.
v Use Discovery to find and review business objects, also, called Data Objects.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
33
Time required
This module should take approximately 60 minutes to complete.
Start InfoSphere Discovery
You can use InfoSphere Discovery to look for meaningful primary and foreign key
relationships. A relationship is meaningful if the primary side has selectivity of one
(or close to one) and the foreign side is 100% (or close to 100%) matching a value
from the primary side.
1. From the Windows Start menu, select Programs>IBM
InfoSphere>Discovery>Discovery Studio. Discovery Studio opens and
automatically connects to the Discovery Server. You see a sample project that
was loaded during installation, Discover_PFKey_DataObject, that you can refer to
during these lessons.
2. Click OK in the Server Connection dialog window.
Create the project and the data sets
Create the data sets that contain tables from which you want to create business
objects.
For this lesson, you will be working with a source data discovery project. Create
the project Training PFKey_DataObjects
1. Click the Source Data Discovery tab.
2. Click New Project to create a new project.
3. Type Training PFKey_DataObjects in the Name field.
4. Ensure that the Use Default Staging checkbox is selected to use the default
staging database.
5. Clear the Use Password checkbox so that you do not require a password for
this project.
6. Click OK to create the new project.
7. In the Data Sets window, click Rename to rename the data set from its default
name to CIS.
8. In the Import Objects list, click the Click here to add a new connection link.
You can now connect to one or more relational databases or add text files into
the CIS data set. In this lesson, CIS exists as an Oracle database. You connect
to the CIS database by providing ODBC information to Discovery in the
Create Connection window.
9. In the Create Connection window, complete the following fields using the
values shown:
v Connection Name: CIS
v Database Server Name: localhost
v Database Name: ISD_SRC
v User Name: ISD_ASSETS
v Password: ISD_user1. This is case sensitive.
10. In the Create Connection window, click Test Connection to verify the
connection parameters.
11. Click OK to save the connection.
34
IBM InfoSphere Discovery Sample Projects
Tip: You can create more than one connection to the database. With multiple
connections, you can discover relationships between schemas or even between
different databases.
Import CIS tables
After you connect to the database, you need to import tables to prepare for
analysis. You can import all tables from this connection, or selectively import the
tables that you need.
1. From the Data Sets window, right-click the CIS connection in the Database
Connections & Tables view.
2. Click Import Tables/File Formats.
3. In the Import Table Wizard, specify Search Tables.
4. In the Table Name field, type ISD_ASSETS.. If you know that something is
common for all the relevant tables, such as common prefix, or common user
name, you can search for these tables to import them. In this lesson, you search
for all tables owned by CIS.
5. Click Next. The result of the search for all tables with names that begin with
ISD_ASSETS.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
35
6. Click Select All and then click Finish to import all of the selected tables.
For each table that is imported, Discovery displays both a physical table and a
logical table. The logical tables are identical to the physical tables. The rest of the
analysis in these lessons is performed on logical tables. For these lessons, do not
make changes to logical tables; just remember that they are the same as the actual
tables from the CIS database.
Tip: By using Discovery you can use the logical tables to perform more advanced
analysis than the analysis that is used in this scenario.
Defining option sets for your analysis
Before you can analyze the tables that you imported from the previous lesson you
need to set the options for the analysis.
36
IBM InfoSphere Discovery Sample Projects
InfoSphere Discovery performs sophisticated analysis of data to discover
relationships and other data properties. Option sets are a way for you to instruct
Discovery about whatever you know of the data, so that Discovery can deliver
more accurate results.
For example, in the CIS discovery, you are interested in perfect foreign key
relationships, where the primary keys are unique and foreign keys reference the
primary key values 100% of the time. You set the options so that Discovery only
looks for this type of relationship.
You can also specify that you want Discovery to find almost foreign keys, where
the primary key has selectivity of greater than 0.8 and more than 80% of the
foreign key values match the primary key values.
1. Click the Data Sets tab.
2. Click Run Next Steps.
3. In the Processing Options window, click New .
4. In the New Options window, in the Name field, type a name that is
meaningful, such as CIS options.
5. In the Step field, select PF Keys.
6. In the set of options under Generate PFKeys, change the following values to
1.0:
v Min foreign row hit rate to identify column as foreign key.
v Min selectivity value to identify column as primary key.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
37
Figure 1. Options for PF Keys step
7. Click OK to save the options for the PF Keys step.
8. Click Edit in the Options block to modify the CIS options that you just
created.
9. In the Step field, specify Data Objects from the drop-down menu.
10. In the set of options under Generate Data Objects, change the following values
to True:
v Data Object generation includes Reference tables.
v Data Object generation includes attribute tables of reference tables.
38
IBM InfoSphere Discovery Sample Projects
Figure 2. Options for Data Objects step
11. Click OK to save the options for the Data Objects step.
Analyzing and reviewing discovered relationships
You are now ready to run the analysis to discover relationships and data objects.
With a large schema, the relationships could be numerous and complex. It is
important for an analyst to focus on the purpose of their analysis, in this case the
Orders table and related tables. The PF Keys tab provides several facilities for you
to work with a large schema with a large number of keys. Use these facilities to
review the relationships that matter to your immediate goal of archiving Orders
and Details data.
1. In the Processing Options window that you are still in from the previous
lesson, move the slider to Data Objects.
2. Click Run. After you submit the task, you can use the Activity Viewer to
monitor the progress of the task. When the task indicator on the upper right
corner of the Studio shows No Activity, processing is completed. While the
process is running, the project is locked. You can still browse while the process
is running, but you cannot modify anything. If you click the Activity Viewer
you can see which steps have completed and which steps are currently
running.
3. When the task is complete, select the PF Keys tab.
4. Click the view mode of the selected object, such as Show All PF Keys. There
are several display modes that help you focus on relevant tables in a large
diagram. For this scenario, you are only using some of the basic functions.
5. In the list of Connected Tables on the left, find the Orders table and
double-click it.
a. Review all of the relationships around the Orders table by selecting that
relationship either from the diagram or from the tree view. When you select
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
39
a relationship, you see the contents and statistics of that table.
Figure 3. Orders table relationships
b. Optional: If a relationship is not meaningful, delete the relationship by
clicking the X symbol.
c. Optional: To add an additional relationship, click the plus (+) symbol and
enter the join information. Validate the relationship by clicking Validate
Step. Use the statistics to validate the strength of the proposed relationship.
d. Optional: Change a relationship by clicking the down arrow and change the
join information. Click Validate Step to verify the strength of the proposed
change.
e. For any relationships that you determine to be good, approve the
relationship by clicking the Approved checkbox, and optionally, drag the
related tables close together.
Tip: Approving an object such as a key relationship, ensures that if the
steps are run again this relationship information is not updated.
This process of approving and dragging the tables closer together allows
you to explore the diagram and to cluster the related tables together for a
better view.
f. Repeat the review steps until you are satisfied with the valid relationships.
6. Review the data objects for completeness. You can add or change a data object
as needed.
Each discovered PFkey is presented in the diagram with statistical properties as
well as hit or miss data views. For this scenario, you are looking for perfect keys,
so these facilities are not as useful as when you examine an imperfect key and try
to determine whether it is a real relationship or an accidental one. Discovery can
discover perfect keys as well as almost keys.
Adjusting the data object
InfoSphere Discovery analyzes overall relationships to find business objects.
Depending on how deep and wide you want an Access Definition to be, you might
need to expand or shrink the boundaries of the business objects. Even in those
cases, Discovery provides a critical starting point for you to work with business
objects.
40
IBM InfoSphere Discovery Sample Projects
The relationships shown in Figure 3 on page 40 were discovered automatically by
InfoSphere Discovery. After you review the topology and all of the links between
the tables, you might want to change the data objects. In this lesson, it might make
more sense that SALES be included through its relationships with CUSTOMERS,
instead of with ORDERS.
1. In the Data Object window, select DO_ORDERS. In the diagram, right-click the
Sales object and click Delete.
2. Right-click CUSTOMERS and select Add child table.
Chapter 2. Introduction to demonstrations about IBM InfoSphere Discovery
41
3. In the Add Table window, select SALES from the table list. Discovery
automatically inserts the link between the CUSTOMERS table and the SALES
table.
4. Save your work by clicking Project > Save from the main menu bar.
Export artifacts
After reviewing the relevant data objects, in this case the Orders data object, you
have all that you need to generate the code to archive orders that are older than
two years from the Customer Information System. You have the access definitions
and the associated objects.
The next challenge is to make the data objects available to Optim Designer so
Optim Designer can generate the code. Use the export to Optim feature in
Discovery to generate a set of artifacts as an XML file.
1. Access the Optim Connector in Discovery by clicking Project > Export > Optim
Database Models from the main menu bar.
2. In the file browser, select a folder, or click Make New Folder and name it
appropriately.
3. Select the new directory and select OK. The export process begins.
4. When you see the export results window, click OK to close that window.
You can examine the XML files when the code generation is complete.
Discovery generates one Physical Data Model (PDM) file for each Discovery data
set, and one Logical Data Model (LDM) file for each data object from Discovery.
Optim Designer reads the generated files and turns them into access definitions.
What you have learned
You created a business object for Optim archiving.
v You created data sets that contain tables that you examined for relationships.
v You identified relationships between tables and clustered tables together into
business objects.
v You discovered and reviewed foreign keys that InfoSphere Discovery found for
you.
v You used InfoSphere Discovery to find and review business objects or data
objects.
v You exported the data objects that were needed to generate the code to archive
orders that are older than two years from the Customer Information System.
42
IBM InfoSphere Discovery Sample Projects
Contacting IBM
You can contact IBM for customer support, software services, product information,
and general information. You also can provide feedback to IBM about products
and documentation.
The following table lists resources for customer support, software services, training,
and product and solutions information.
Table 1. IBM resources
Resource
Description and location
IBM Support Portal
You can customize support information by
choosing the products and the topics that
interest you at www.ibm.com/support/
entry/portal/Software/
Information_Management/
InfoSphere_Information_Server
Software services
You can find information about software, IT,
and business consulting services, on the
solutions site at www.ibm.com/
businesssolutions/
My IBM
You can manage links to IBM Web sites and
information that meet your specific technical
support needs by creating an account on the
My IBM site at www.ibm.com/account/
Training and certification
You can learn about technical training and
education services designed for individuals,
companies, and public organizations to
acquire, maintain, and optimize their IT
skills at http://www.ibm.com/software/swtraining/
IBM representatives
You can contact an IBM representative to
learn about solutions at
www.ibm.com/connect/ibm/us/en/
Providing feedback
The following table describes how to provide feedback to IBM about products and
product documentation.
Table 2. Providing feedback to IBM
Type of feedback
Action
Product feedback
You can provide general product feedback
through the Consumability Survey at
www.ibm.com/software/data/info/
consumability-survey
© Copyright IBM Corp. 2006, 2011
43
Table 2. Providing feedback to IBM (continued)
Type of feedback
Action
Documentation feedback
To comment on the information center, click
the Feedback link on the top right side of
any topic in the information center. You can
also send comments about PDF file books,
the information center, or any other
documentation in the following ways:
v Online reader comment form:
www.ibm.com/software/data/rcf/
v E-mail: [email protected]
44
IBM InfoSphere Discovery Sample Projects
Product accessibility
You can get information about the accessibility status of IBM products.
The IBM InfoSphere Information Server product modules and user interfaces are
not fully accessible. The installation program installs the following product
modules and components:
v IBM InfoSphere Business Glossary
v IBM InfoSphere Business Glossary Anywhere
v IBM InfoSphere DataStage®
v IBM InfoSphere FastTrack
v IBM InfoSphere Information Analyzer
v IBM InfoSphere Information Services Director
v IBM InfoSphere Metadata Workbench
v IBM InfoSphere QualityStage™
For information about the accessibility status of IBM products, see the IBM product
accessibility information at http://www.ibm.com/able/product_accessibility/
index.html.
Accessible documentation
Accessible documentation for InfoSphere Information Server products is provided
in an information center. The information center presents the documentation in
XHTML 1.0 format, which is viewable in most Web browsers. XHTML allows you
to set display preferences in your browser. It also allows you to use screen readers
and other assistive technologies to access the documentation.
For information about the accessibility features of the information center, see
Accessibility and keyboard shortcuts in the information center.
The documentation that is in the information center is also provided in PDF files,
which are not fully accessible.
IBM and accessibility
See the IBM Human Ability and Accessibility Center for more information about
the commitment that IBM has to accessibility:
© Copyright IBM Corp. 2006, 2011
45
46
IBM InfoSphere Discovery Sample Projects
Accessing product documentation
Documentation is provided in a variety of locations and formats, including in help
that is opened directly from the product client interfaces, in PDF files, and in
HTML files.
Obtaining the documentation
The documentation is distributed with the product, can be accessed from a Web
browser, and is orderable.
v PDF file books are available online and periodically refreshed at
www.ibm.com/support/docview.wss?uid=swg27020315
v You can also order IBM publications in hardcopy format online or through your
local IBM representative. To order publications online, go to the IBM
Publications Center at http://www.ibm.com/e-business/linkweb/publications/
servlet/pbi.wss.
Providing feedback about the documentation
You can send your comments about documentation in the following ways:
v Online reader comment form: www.ibm.com/software/data/rcf/
v E-mail: [email protected]
© Copyright IBM Corp. 2006, 2011
47
48
IBM InfoSphere Discovery Sample Projects
Links to non-IBM Web sites
This information center may provide links or references to non-IBM Web sites and
resources.
IBM makes no representations, warranties, or other commitments whatsoever
about any non-IBM Web sites or third-party resources (including any Lenovo Web
site) that may be referenced, accessible from, or linked to any IBM site. A link to a
non-IBM Web site does not mean that IBM endorses the content or use of such
Web site or its owner. In addition, IBM is not a party to or responsible for any
transactions you may enter into with third parties, even if you learn of such parties
(or use a link to such parties) from an IBM site. Accordingly, you acknowledge and
agree that IBM is not responsible for the availability of such external sites or
resources, and is not responsible or liable for any content, services, products or
other materials on or available from those sites or resources.
When you access a non-IBM Web site, even one that may contain the IBM-logo,
please understand that it is independent from IBM, and that IBM does not control
the content on that Web site. It is up to you to take precautions to protect yourself
from viruses, worms, trojan horses, and other potentially destructive programs,
and to protect your information as you deem appropriate.
© Copyright IBM Corp. 2006, 2011
49
50
IBM InfoSphere Discovery Sample Projects
Notices and trademarks
This information was developed for products and services offered in the U.S.A.
Notices
IBM may not offer the products, services, or features discussed in this document in
other countries. Consult your local IBM representative for information on the
products and services currently available in your area. Any reference to an IBM
product, program, or service is not intended to state or imply that only that IBM
product, program, or service may be used. Any functionally equivalent product,
program, or service that does not infringe any IBM intellectual property right may
be used instead. However, it is the user's responsibility to evaluate and verify the
operation of any non-IBM product, program, or service.
IBM may have patents or pending patent applications covering subject matter
described in this document. The furnishing of this document does not grant you
any license to these patents. You can send license inquiries, in writing, to:
IBM Director of Licensing
IBM Corporation
North Castle Drive
Armonk, NY 10504-1785 U.S.A.
For license inquiries regarding double-byte character set (DBCS) information,
contact the IBM Intellectual Property Department in your country or send
inquiries, in writing, to:
Intellectual Property Licensing
Legal and Intellectual Property Law
IBM Japan Ltd.
1623-14, Shimotsuruma, Yamato-shi
Kanagawa 242-8502 Japan
The following paragraph does not apply to the United Kingdom or any other
country where such provisions are inconsistent with local law:
INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS
PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER
EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS
FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or
implied warranties in certain transactions, therefore, this statement may not apply
to you.
This information could include technical inaccuracies or typographical errors.
Changes are periodically made to the information herein; these changes will be
incorporated in new editions of the publication. IBM may make improvements
and/or changes in the product(s) and/or the program(s) described in this
publication at any time without notice.
Any references in this information to non-IBM Web sites are provided for
convenience only and do not in any manner serve as an endorsement of those Web
© Copyright IBM Corp. 2006, 2011
51
sites. The materials at those Web sites are not part of the materials for this IBM
product and use of those Web sites is at your own risk.
IBM may use or distribute any of the information you supply in any way it
believes appropriate without incurring any obligation to you.
Licensees of this program who wish to have information about it for the purpose
of enabling: (i) the exchange of information between independently created
programs and other programs (including this one) and (ii) the mutual use of the
information which has been exchanged, should contact:
IBM Corporation
J46A/G4
555 Bailey Avenue
San Jose, CA 95141-1003 U.S.A.
Such information may be available, subject to appropriate terms and conditions,
including in some cases, payment of a fee.
The licensed program described in this document and all licensed material
available for it are provided by IBM under terms of the IBM Customer Agreement,
IBM International Program License Agreement or any equivalent agreement
between us.
Any performance data contained herein was determined in a controlled
environment. Therefore, the results obtained in other operating environments may
vary significantly. Some measurements may have been made on development-level
systems and there is no guarantee that these measurements will be the same on
generally available systems. Furthermore, some measurements may have been
estimated through extrapolation. Actual results may vary. Users of this document
should verify the applicable data for their specific environment.
Information concerning non-IBM products was obtained from the suppliers of
those products, their published announcements or other publicly available sources.
IBM has not tested those products and cannot confirm the accuracy of
performance, compatibility or any other claims related to non-IBM products.
Questions on the capabilities of non-IBM products should be addressed to the
suppliers of those products.
All statements regarding IBM's future direction or intent are subject to change or
withdrawal without notice, and represent goals and objectives only.
This information is for planning purposes only. The information herein is subject to
change before the products described become available.
This information contains examples of data and reports used in daily business
operations. To illustrate them as completely as possible, the examples include the
names of individuals, companies, brands, and products. All of these names are
fictitious and any similarity to the names and addresses used by an actual business
enterprise is entirely coincidental.
COPYRIGHT LICENSE:
This information contains sample application programs in source language, which
illustrate programming techniques on various operating platforms. You may copy,
modify, and distribute these sample programs in any form without payment to
52
IBM InfoSphere Discovery Sample Projects
IBM, for the purposes of developing, using, marketing or distributing application
programs conforming to the application programming interface for the operating
platform for which the sample programs are written. These examples have not
been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or
imply reliability, serviceability, or function of these programs. The sample
programs are provided "AS IS", without warranty of any kind. IBM shall not be
liable for any damages arising out of your use of the sample programs.
Each copy or any portion of these sample programs or any derivative work, must
include a copyright notice as follows:
© (your company name) (year). Portions of this code are derived from IBM Corp.
Sample Programs. © Copyright IBM Corp. _enter the year or years_. All rights
reserved.
If you are viewing this information softcopy, the photographs and color
illustrations may not appear.
Trademarks
IBM, the IBM logo, and ibm.com are trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide. Other product and
service names might be trademarks of IBM or other companies. A current list of
IBM trademarks is available on the Web at www.ibm.com/legal/copytrade.shtml.
The following terms are trademarks or registered trademarks of other companies:
Adobe is a registered trademark of Adobe Systems Incorporated in the United
States, and/or other countries.
IT Infrastructure Library is a registered trademark of the Central Computer and
Telecommunications Agency which is now part of the Office of Government
Commerce.
Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, Intel Centrino logo,
Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentium are trademarks or
registered trademarks of Intel Corporation or its subsidiaries in the United States
and other countries.
Linux is a registered trademark of Linus Torvalds in the United States, other
countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of
Microsoft Corporation in the United States, other countries, or both.
ITIL is a registered trademark, and a registered community trademark of the Office
of Government Commerce, and is registered in the U.S. Patent and Trademark
Office
UNIX is a registered trademark of The Open Group in the United States and other
countries.
Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc. in the
United States, other countries, or both and is used under license therefrom.
Notices and trademarks
53
Java and all Java-based trademarks and logos are trademarks or registered
trademarks of Oracle and/or its affiliates.
The United States Postal Service owns the following trademarks: CASS, CASS
Certified, DPV, LACSLink, ZIP, ZIP + 4, ZIP Code, Post Office, Postal Service, USPS
and United States Postal Service. IBM Corporation is a non-exclusive DPV and
LACSLink licensee of the United States Postal Service.
Other company, product or service names may be trademarks or service marks of
others.
54
IBM InfoSphere Discovery Sample Projects
Index
A
archiving business objects
33
B
business objects
33
C
creating a project 6
customer support 43
D
data objects
33
L
legal notices
51
N
non-IBM Web sites
links to 49
P
product accessibility
accessibility 45
product documentation
accessing 47
S
software services
support
customer 43
43
T
trademarks
list of 51
W
Web sites
non-IBM
49
© Copyright IBM Corp. 2006, 2011
55
56
IBM InfoSphere Discovery Sample Projects
Printed in USA
SC23-9880-04