The Challenges of Building Enterprise Content Taxonomies and the Role of

The Challenges of Building Enterprise
Content Taxonomies and the Role of
Classification Technologies in Maintaining
Their
Effectiveness
Reginald J. Twigg, Ph.D. ([email protected])
Capture, Classification and Taxonomy, IBM ECM
© 2007 IBM Corporation
1
Information Management Software | Enterprise Content Management
Agenda
 The Challenge of Unstructured Content
 Key Concepts and Terms
 Taxonomy, Classification and ECM Adoption
 Classification Technologies for ECM
© 2008 IBM Corporation
2
The Challenge of Managing
Unstructured Content
© 2007 IBM Corporation
3
Information Management Software | Enterprise Content Management
80% of Enterprise Data is Unstructured
Databases
• Billing statements
• Claims images
• Customer
correspondence
• Mortgage docs
• Contracts
• Signed BOLs
• Healthcare EOBs
• Marketing collateral
• Website content
• Voice authorizations
• Signature cards
• Credit enrollments
• Material Safety
Data Sheets
• ISO 9000 docs
• Plant schematics
• Product images
• Spec sheets
• ….and much more!
© 2008 IBM Corporation
4
Information Management Software | Enterprise Content Management
What is Enterprise Content?
© 2008 IBM Corporation
5
Information Management Software | Enterprise Content Management
Where do I start?
Organizing the explosion of
unstructured content becomes
critical:
 We’ve got 600 GB of content from
basic content services all over the
enterprise.
How can we get this content
efficiently mapped into our ECM
taxonomy?
 We’ve been managing our content
without classifying it for a few years
now.
How can our users navigate amongst
this existing content in a way that’s
intuitive for our business?
 The lawyers have to review 400,000
electronic documents for their case.
How can we make sure they don’t
waste their time?
© 2008 IBM Corporation
6
Information Management Software | Enterprise Content Management
Business Value of Classification for ECM
Key Business Drivers
ECM Taxonomy and Compliance,
Classification
Records, Legal
Discovery
1
Increase accessibility of
content under
management
 Automated, High Scale
Classification
 Classify at ingestion
and/or re-classify over
time
 Taxonomy Evolution
Tools
 Enhanced Accessibility
 Taxonomy Proposer
© 2008 IBM Corporation
In Process
Classification
2
Increase legal
discovery review
effectiveness while
reducing risk
Legal Discovery
Prioritization and
Workflow Assignment
3
Message Tagging,
Classification and
Monitoring
4
Increase worker
productivity and
automate content
related decisions
Reduce inquiry costs,
automate message
routing and increase
customer satisfaction
 Ad Hoc Category
 Email, Chat Routing
Suggestion
 Agent Response
Suggestion
 Records Classification
and Exception Handling
 Content-Based
Workflow Selection
 Storage and Retention
Policy Assignment
 Content Based
Decision Making
 Email Supervision
and Monitoring
 Automatic
Customer Response
7
Information Management Software | Enterprise Content Management
Ability to Structure Content with Databases
Percent of corporate information value managed in
traditional databases
Unstructured
Data
Data
Creation
And
Demand
Structured
Data
OLTP and BI
(narrow scope)
Application
Types
Compliance, Competitive
Intelligence (wide scope)
Source: Gartner
© 2008 IBM Corporation
8
Information Management Software | Enterprise Content Management
Multiple Repositories Make Access Difficult
1 repository
5%
Don't know
“The Future of Content in the Enterprise,”
Connie Moore and Robert Markham
17%
More than 15
repositories
36%
25%
2-5 repositories
14%
10-15 repositories
4%
6-10 repositories
Base: 81 North American decision-makers
(multiple responses accepted)
© 2008 IBM Corporation
9
Information Management Software | Enterprise Content Management
And Then There’s SharePoint, File Shares and . . .
© 2008 IBM Corporation
10
Key Concepts and Terms
© 2007 IBM Corporation
11
Information Management Software | Enterprise Content Management
Key Concepts

Metadata: a means of describing, locating, cataloging, and
activating content as objects in a software ecosystem (literally,
data about data).

Enterprise Catalog: a centralized and normalized metadata
model for unstructured content for the purposes of providing
consistent services across all ECM applications.

Taxonomy: a hierarchical structure of information
components, any part of which can be used to classify a
content item in relation to other items in the structure.

Classification: a coding of content items as members of a
group for the purposes of cataloging them or associating them
with a taxonomy.
© 2008 IBM Corporation
12
Information Management Software | Enterprise Content Management
Taxonomy Is . . .
 Not turning animals into
trophies
© 2008 IBM Corporation
 A system for organizing the
corpus of business content
13
Information Management Software | Enterprise Content Management
Taxonomy and Classification in ECM
 Classification Examples:
– Document Classing
– Foldering
 Taxonomy Examples:
– Enterprise Content Catalog
– Industry Standard Document Taxonomies (ISO, XMI)
 Methods:
– Rules-Based: Applies pre-determined rules for ‘if,
then’ classification of text and properties
– Analytics-Based: Applies algorithms to interpret
classes in order to apply classification rules to them
© 2008 IBM Corporation
14
Information Management Software | Enterprise Content Management
ECM Taxonomy Illustrated
© 2008 IBM Corporation
15
Taxonomy, Classification and ECM
Adoption
© 2007 IBM Corporation
16
Information Management Software | Enterprise Content Management
Drive New Business Value from Content
Improve Content Access
Organize Unstructured Content
Content
Classification
Solutions
Derive Business Insight
© 2008 IBM Corporation
17
Information Management Software | Enterprise Content Management
Business Drivers for ECM Taxonomy Management
 Proliferating departmental solutions
– Content Management
– Collaboration (SP, Quickr, Team Rooms, Wikis)
 User-based classification and high workforce
turnover
– Productivity declines as knowledge disappears
– Legal discovery is a secondary concern
 Mergers and Acquisitions – need to reconcile
disparate content management practices,
repositories and processes
© 2008 IBM Corporation
18
Information Management Software | Enterprise Content Management
Classification is Hard Work
Key Business Challenges
ECM Taxonomy and
Classification
1
Increase accessibility of
content under
management
Most organizations face content taxonomy pain –
especially as they standardize around ECM
– Mapping content to taxonomy during ingestion
– Reclassifying content under management
 Automated, High Scale
Classification
 Classify at ingestion
and/or re-classify over
time
 Taxonomy Evolution
Tools
– Evolving taxonomies as new types of content
emerge
– Integrating folksonomies (SharePoint) into a
master taxonomy
 Enhanced Accessibility
 Taxonomy Proposer
© 2008 IBM Corporation
19
Information Management Software | Enterprise Content Management
Organization is the Root Cause
 Most organizations face content taxonomy barriers –
especially as they standardize around ECM
– Assigning categories en masse
– Reclassifying existing content as taxonomies evolve
– Merging taxonomies
– Integrating the wisdom of folksonomies
© 2008 IBM Corporation
20
Information Management Software | Enterprise Content Management
Challenges and Impacts of Merging Taxonomies
 Misclassification – change is constant, and master
taxonomies must manage multiple custom taxonomies for
each content source
 “Folksonomies” from departmental collaboration
solutions are created by users and unmanaged by ECM
standards
 Impact:
– Unreliable Metadata – Inconsistencies lose or
mislabel content
– Process Misfires – Poor metadata triggers
incorrect events and workflows
Scale is the Challenge – Automation is Essential
© 2008 IBM Corporation
21
Information Management Software | Enterprise Content Management
Lessons Learned From ERP Adoption
 Getting Classification Right: ‘Garbage in = garbage out’ is often used in
metadata management projects to describe the problem of building a
metadata model on inconsistent sources.
 Driving Process on Taxonomies: ERP systems depending on 3 master
taxonomies – material, vendor and customer. These taxonomies drive
events, workflow definition and the development of transaction-centric
business process applications
 Mastering Metadata: The ability to deploy new enterprise applications
depends upon the re-usability, scalability and integrity of the metadata model
 System of Record is Required for Standardization:
– Establishes an enterprise standard that can be audited
– Forms the foundation for building demonstrable best practices
– Enforces consistency of data capture and output
© 2008 IBM Corporation
23
Information Management Software | Enterprise Content Management
Customer Lessons for Mastering ECM Taxonomies
 ‘Master’ taxonomy of record required for
– Compliance
– Business process applications
 Merged master taxonomies become large and unwieldy
– Multiple taxonomies require integration and translation
– Centralized, decentralized, or hybrid?
 Intelligent Classification increasingly is used to manage:
– Taxonomy merging from multiple use cases
– Taxonomy/folksonomy translation from distributed content
sources
© 2008 IBM Corporation
24
A Look at ECM Classification
Technologies
© 2007 IBM Corporation
25
Information Management Software | Enterprise Content Management
State of Classification Management Technologies
 ECM Classification/Taxonomy is an emerging discipline
– Industry standard taxonomies:
• Focus on business function or transaction types
• Have not reached the enterprise level
– Classification best practices:
• Content ingestion
• Application development reclassification
 Classification software focuses on content ingestion:
– Electronic content (email, Office documents, free-form text)
– Paper content (document images) requires OCR
 Search is not enough – must drive value in the business process
© 2008 IBM Corporation
26
Information Management Software | Enterprise Content Management
Criteria For ECM Classification Management Solutions
 Integrate with and support the ECM metadata model
 Interpret a highly-federated content ecosystem
 Go beyond search to catalog and manage content
 Build on advanced analytic technologies – rules alone
are not enough
– Interpret content to extract meaningful (meta)data
– Employ multiple methods (engines) for classification
– Integrate teaching/learning
© 2008 IBM Corporation
27
Information Management Software | Enterprise Content Management
Common Platform for Electronic Content Classification
Email Queue
Classification and
Monitoring
Compliance,
Records, Legal
Discovery
Classification
Platform
In Process
Classification
© 2008 IBM Corporation
ECM
Taxonomy and
Classification
28
Information Management Software | Enterprise Content Management
IBM Classification Module for Electronic Content
Organize your ECM content
 Automated classification and filtering
 Combines text analytics understanding
with rules
 Acquires domain specificity from your
own content
 Unique learning technology for adaptive
classification
 Suggests new categories or even seeds
an entirely new taxonomy
 Rectifies conflicting taxonomies
 Market proven, scalable platform
© 2008 IBM Corporation
30
Information Management Software | Enterprise Content Management
Understanding Content with Text Analytics
A
The strategic
value of this
market is
paramount to
IBM
Training
(Teach)
A
B
Classification
Engine
C
Feedback
Corpus
(Categorized)
C
The core market
for this new
product has been
defined as
such by IBM
© 2008 IBM Corporation
Audit
Matching
IP is Legal is
essentialcurrently
requiringEngineering
requires clear
full approval
requirements
Strategy is
Important to
the marketing
team
Categories list and
Relevancies
(Scores)
The strategic
value of this
market is
paramount to
IBM
C: 97%,
B: 54%,
A: 12%
31
Information Management Software | Enterprise Content Management
Classification Workflow: Accelerating Content Organization
Classification
Review
Tool
Existing
Unclassified
Managed Content
File
System
Send to taxonomy proposer
Automatically categorize majority of content
Classifier
Filter out documents
Basic
Content
Services
Reference: Integration Components
 Classifier (Runtime Application)
 Classification Review (UI)
 Taxonomy Proposer (UI)
 Content Extractor (training based on P8)
© 2008 IBM Corporation
32
Information Management Software | Enterprise Content Management
Components of the Solution for Text Classification
 Classifier
– Automatically classifies and filters out documents
– Moves some documents for manual review
 Classification Review Tool
– Allows user to manually review documents
 Content Extractor
– Extracts content from the ECM system for training
 Taxonomy Proposer
– User workflow to identify and name new categories or apply
existing taxonomy from P8
© 2008 IBM Corporation
33
Information Management Software | Enterprise Content Management
Classification for Paper Documents
 Classification of paper documents occurs in capture
process
 Use cases for paper document classification
– Recognition using OCR/ICR
– Classification to associate to folders or doc class
– Separation to reduce costs and improve process
© 2008 IBM Corporation
34
Information Management Software | Enterprise Content Management
Three Primary Types of Images –
The Document Recognition Problem
More Advanced
Un-Structured
•
SemiStructured
Structured
Less Advanced
© 2008 IBM Corporation
35
Information Management Software | Enterprise Content Management
The Document Separation Problem in Image Capture
Separation of documents is a
significant expense for a
high-volume capture system
–
Typical ‘structured’ recognition technologies are not applicable
–
Manual insertion of separator sheets is the primary workaround
today
–
50% of document preparation labor is spent sorting documents
and inserting separator pages – source: TAWPI
Where does one document stop and the next begin?
Here?
© 2008 IBM Corporation
Here?
Here?
Here?
36
Information Management Software | Enterprise Content Management
Classification Methods for Paper Content (Images)


Image Classification
–
based on the overall layout and structure of a document
–
Includes lines, boxes, logos and placement of text
Text Classification
–


based on detailed analysis of the text content of a page
Rules-Based Classification
–
performed by searching for specific data or keywords
–
independent of layout
Templated Classification
–
© 2008 IBM Corporation
determined by the presence of one or more marks, barcodes
or items of text in pre-defined locations
37
Information Management Software | Enterprise Content Management
Waterfall Approach to Classification and Separation
Two-pass system:
 1st pass: Classification
– optimizes performance by using fastest classification techniques
first
– Advanced Text Classification final “catch-all
Page #
Barcode
Recognition:
1
2
3
4
5
6
7
8
First
Form X
?
?
?
?
?
?
?
1 ms
Image
Classification:
N/A
Rules
Based :
N/A
Text
Classification:
N/A
© 2008 IBM Corporation
?
?
First
Form Y
?
First
Form Z
?
?
20 ms
?
Last
Form X
N/A
Last
Form Y
N/A
?
Last
Form Z
200 ms
Middle
Form X
N/A
N/A
N/A
N/A
Middle
Form Z
N/A
1000 ms
38
Information Management Software | Enterprise Content Management
Why Invest in Automated Classification?
Accelerate the time to value in your
investment in ECM
Free up your subject matter experts
© 2008 IBM Corporation
Ensure more accurate content
catalogs
Make your content easier to find and
leverage
39
Information Management Software | Enterprise Content Management
Summary
1. Accelerate ECM Standardization
Poor content classification undermines ECM value – maximize your ECM
potential and time-to-value with automated classification
2. Automating Classification Always Pays
Typical employees spend 10 hours/week searching for information – slash
that time and increase productivity
3. Classification Technologies Automate Classification to
Drive Development of Best Practices
IBM Classification Module for IBM FileNet P8
Automatically organizing your content by understanding it
© 2008 IBM Corporation
40
Information Management Software | Enterprise Content Management
Contact Reggie Twigg ([email protected]) for
more information or to arrange a demonstration
© 2008 IBM Corporation
41