Content Classification: How to implement Content Classification - A Technical Overview

Content Classification: How to implement Content
Classification - A Technical Overview
Session Number ECA-2079
Josemina Magdalen ([email protected] )
Yigal Dayan ([email protected] )
Oren Paikowsky ([email protected] )
1
Agenda

Content Classification Overview

Content Classification Concepts

Content Classification Architecture

Content Classification in ECM
Content is Exploding
Content is Evolving
Content is Transforming
The marketplace is driving greater
volume, variety
and velocity
3
Organizations will need to redefine
their content strategy
In order to gain control, optimize business
outcomes, improve collaboration, achieve new
insight, and govern for reduced cost and risk
content in motion
4
© 2012 IBM Corporation
What does IBM Content Classification do?

Content Classification



discovers the intent of a document by analyzing its content
automatically learns from examples
allows you to auto-classify huge volumes of documents into
pre-trained categories, consistently and efficiently
What is IBM Content Classification used for?

Content Classification is most valuable when:

A large number of documents need to be categorized

Documents need to be categorized based on their content

When an action needs to be taken as a result of the classification

Need to order the chaos and bring structure into unstructured data
What is IBM Content Classification used for?
(cont.)

Automatic classification advantages over manual
classification:

Reduces training cost

Reduces laborious activities

Consistent decisions, reduces errors


Coherent and legally defensible
Extremely fast
Why organizations need
Content Classification
 Through automated, advanced classification, knowledge workers:
─ have quick access to relevant content
─ have the information they need to use to complete tasks
─ are not burdened with enforcing compliance and retention policies
─ can analyze content relevant to specific subject matter
Automated classification allows workers to focus on key business
tasks, rather than spend time with manual categorization of content
In short, Content Classification improves productivity
8
Classification Use Cases

Email archiving, retention, and management

Email routing

Organization of File Systems & Shared Drives

Categorizing Scanned Documents

Business Process decisions

Document Automatic Tagging


Medical Coding (ICD-10 and others)
Private vs. Public content identification
Agenda

Content Classification Overview

Content Classification Concepts

Content Classification Architecture

Content Classification in ECM
Classification Process
Train using Quick Start Tool
1. Train
Decision
Plan
2. Deploy
Classification
Server
A
The core market
for this new
product has been
defined as
such by IBM
?
Classification
Application
11
The core market
for this new
product has been
defined as
such by IBM
3. Auto Classify
Quick Start Tool
1. Manual
categorization
Uncategorized
Sample
data
2. Train
Training
3. Hints on
Improving
training
4. Apply to real data
5. Report
Test on
real data
12
4. Use Trained
Classification
IBM Content Classification
Quick Start Tool Demo
Categories can take on different meanings
14

Folders in FileNet Content Manager

Properties in FileNet Content Manager

Records classes in Enterprise Records

Item types in IBM Content Manager 8

Suggested actions in an IBM Content Collector Task Route

Topics associated with a taxonomy

Topics associated with document tagging

Content-centric decisions in BPM applications such as email routing
Classification by Contextual Understanding
Text Analysis, Statistics, and Learning by Example
Knowledge Base
Custom & partner applications
IBM pre-built integrations (ECM, ...)
Input
“Team,
We need to determine how to
handle the results of the most
recent earnings report and how it
will impact the reaction on Wall
Street. We need to get out in
front of this before the press
does! Jack, get the status from
Engineering ahead of time.
Regards, John”
Output
Feedback
PR(92%)
FINANCE(82%)
ENGINEERING(32%)
Intent = PR
Email
IBM Content Classification
Control the level of Classification automation
Advanced classification can be executed as an “assistance” to authors in user
interfaces
Semi-automated
advanced classification
via monitoring
Complete
Automation
Automation
with Auditing
100%
Automation of Medium
Confidence and Above
Assisted classification in user
interfaces like SharePoint or in the
future in IBM’s Office integration
Automation of High
Confidence and Above
Assisted Manual
Classification
0%
Data in motion: Periodic human oversight
facilitates automatic adjustment of
policies
Content Classification learns from user feedback to improve and adapt policies
Category
Recommendation
User Interactions
User Feedback
Classification Server
Content Classification Rules
Decision Plan
A decision plan is a sequence of rules and calls to statistical analysis


Rule capabilities:

String search

Word distance

Regular expressions

Pattern extraction

Boolean expressions
Decision plan capabilities:

Identify category (in more
than one taxonomy)

Set document metadata

Invoke statistical analysis

Language identification

Recommend actions
Agenda

Content Classification Overview

Content Classification Concepts

Content Classification Architecture

Content Classification in ECM
How does Content Classification work?

Content Classification combines multiple methods of categorization
technologies to deliver automatic classification

Uses contextual analysis based on machine learning techniques

Uses natural language processing and semantic analysis

Uses rules-based categorization based on metadata or confidence score

Can be used in tandem or separately depending on requirements
IBM Content Classification Architecture
Component architecture
The server is configured
and maintained by using
the Management Console
administration tool
The server exposes an
API in four flavors (C,
COM, Java, and SOAP)
that enable remote
connections
Knowledge bases and
decision plans are configured
and maintained by using
Classification Workbench
Content Classification Architecture
Component architecture
Customers and business partners who
require programmatic access to
Content Classification functionality can
develop custom applications by using
Content Classification server remote
APIs
Content Classification Architecture
Component architecture/integrations
Integration with IBM ECM
repositories (IBM FileNet
Content Manager and IBM
Content Manager) supports
bulk classification and
manual classification of
repository content
Content Classification Architecture
Component architecture
Integration with IBM
Content Collector
enables users to
classify emails and
documents during
archiving/bulk
ingestion and take
action on the
classification results
Agenda

Content Classification Overview

Content Classification Concepts

Content Classification Architecture

Content Classification in ECM
IBM Content Classification adds value to IBM
ECM

Email, File System and SharePoint archiving with IBM Content Collector

Image-Based content classification with Datacap Taskmaster

Records and Retention Management with IBM Enterprise Records

Content Classification/reclassification with IBM P8, CM8 and File Systems

Content Analysis and Insight with IBM Content Analytics

Enhanced Search with IBM Content Analytics with Enterprise Search

Advanced Case Management with IBM Case Manager

Electronic Discovery with IBM eDiscovery Analyzer
26
Classification and Content Navigator
Classification at the point of entry
• Content Classification plug-in for Content Navigator puts users
in control of content categorization decisions
• New “Add & Classify” action assists the users with
categorization suggestions as content is added to FileNet
Content Manager
• Highly relevant category suggestions are returned to the user
based on the context of the content being added
• Users may override the suggestion and choose a different
category
• System learns from choices selected and refines suggestions,
over time
IBM Content Classification
Content Navigator plug-in Demo
Classification and Content Navigator
Classification at the point of entry
The Classification plug-in adds an Add and Classify
Document button to Content Navigator.
Classification and Content Navigator
Classification at the point of entry
When you click the Add button, the document is added to FileNet
Content Manager according to the classification results that are
returned by Content Classification.
Classification and Content Navigator
Classification at the point of entry
The document is added to the
appropriate folder in FileNet
Content Manager.
These properties were
automatically set according to the
classification results.
Classification and Content Navigator
Classification at the point of entry
You can override the classification result and
select a different folder or document class.
Classification and Content Navigator
Benefits
• The “Add and Classify” plug-in for IBM Content Navigator,
puts users in control of their categorization decisions
• It guides the user with suggestions based on a trained set of
documents and based on user feedback
• The integration provides just the right amount of control with
just the right amount of user flexibility and independence
Classification and Datacap Integration
Content-based analytics for image capture
Consistent, appropriate classification of image-based content.
Good on docs with
logos/images
Good for highly
varied docs
Highly Accurate,
Labor intensive
Highly effective for
invoices, bills of lading
Good for mixed
docs, similar
layouts, text
Highly Accurate,
not always possible
Labor intensive
34
Content Classification
provides text analytics and
statistical probability
Ideal scenario for
Enterprise Capture
Datacap Taskmaster Connector to Content
Classification


Taskmaster

Extracts text using OCR – Optical Character Recognition

Calls Content Classification to identify the page
Content Classification analyzes the text content

Uses natural language processing and semantic analysis

Assigns confidence score to each category suggestion (0 – 100)

Returns the classification results to Taskmaster
How does Taskmaster with Classification work ?


Taskmaster examines each page using multiple methods
–
The fastest methods are executed first : barcode, pattern match, & fingerprint
–
The slower methods that require OCR follow: Text analytics and keywords
–
Finally rules examine the context to determine if any remaining pages can be identified based on
the surrounding pages
The Taskmaster document hierarchy specifies page types contained in each document
–

Separates and assembles the pages into documents

The system outputs classification results statistics to support optimization

Feedback loop improves future results
–
Image fingerprints populated to fingerprint database
–
Text classification trained with feedback to analytics engine
Exceptions, low confidence results are reviewed and classified by users
How can it work with my documents

The key is to understand your documents

Use barcodes whenever possible for speed and accuracy

Documents with image structure work well with fingerprint matching





Documents with text content work well with Text Analytics if the first
page can be distinguished from trailing pages
Keywords and rules can catch exceptions
Pages without text – like diagrams and photos do not support keyword
and text analysis
Combining methods produces the best results
Taskmaster’s classification is more effective and less labor
intensive than traditional methods
Automatic Email Archiving and Records
Declaration / Retention
Compliance
Classification combined with collection and records declaration
assists companies in achieving compliance with business and legal mandates
Use IBM Content Classification, IBM Content Collector, and IBM
Enterprise Records to:

Organize content and make records declaration decisions
automatically

Classify records currently in an existing ECM repository: organize in
place using the Classification Center

Classify records during content collection process through modular
tasks in IBM Content Collector

Invoke Content Classification during the content collection process to
decide when and how to invoke records declaration tasks
IBM Content Collector Task Route
Archiving and Records Declaration Based on
Classification Results
1. Call the IBM Content Classification task to analyze
the emails. Note that attachments must be analyzed as well.
2. Email without business value (“Personal email”) is discarded.
3. Archive email by using the P8 File Document in Folder
task that uses the previously defined fields in the
classification decision plan (folder_name).
4. Records declaration uses the previously defined fields
(file_plan and folder_name) to assign the correct
records class to the business email in the P8 Declare
Record task.
Benefits
Compliance
Automated, advanced classification helps an organization to:




Organize content and make records declaration decisions automatically
Classify records currently in your ECM repository: organize in place using the
Classification Center
Take automated action without burdening your users:

set properties

extract metadata

place in folders

declare as a record and place in file plan
Monitor actions and optimize accuracy in ongoing basis
With IBM Content Classification:


Knowledge workers are not burdened with manual enforcement of compliance and
retention policies
Productivity improves: workers can focus on key business tasks, rather than spend
time on the manual categorization of content
Content Classification for ECM Repositories
Provides document classification and categorization automation
within content management system
Classification provides services on
documents/emails in P8 or CM8:
P8 / CM8
1.Automatic classification
2.Manual review
Classification
Center
IBM Content
Classification
41
Classification uses statistical
methods (Knowledge Base)
and rule/keyword-based
methods (Decision Plan) to
determine document/email
classification
Available as Sample Code
IBM Content Classification & Microsoft SharePoint:
Social Content Integration
• Content Classification used to classify social content that resides in SharePoint
• Provides accurate and consistent organization of content in collaborative environments
• Content Classification can be used in other SharePoint workflows for supporting a business
process
42
StoredIQ and Automatic Classification (SAC)
•
Manage Data-in-place and Retain Business Critical Data
• Don’t move data just to figure out what it is
• Records and Legal Collections need to be managed
• Increase signal to noise ratio in data set
• Identify content that has value to the different stakeholders
•
Practice Good Data Hygiene
• Dispose of Records past their retention periods
• Dispose of Legal Collections when cases end
• More aggressive disposition of data
StoredIQ Auto-Classification (SAC) Architecture
Apply Classification based filters
Import
Content Classification
pre-built model
DATAIQ/ADMINIQ
Apply specified
Classification model
against specific InfoSet
GATEWAY SERVER
Send “Apply Model” request to
participating DataServers
Apply
Model
Archive
Platform
ECM
Forensic
Images/Tapes
Apply
Model
File
Servers
Apply
Model
Email
Servers
Desktops
Apply
Model
SharePoint &
Enterprise
Collaboration
Apply
Model
Cloud
Media
The Business Value of Content Classification
Improving worker productivity
Accessibility
& Usability
Process launched based on
Classification of content in
P8/CM8
Process route determined by
Classification analysis
Classification determines the
repository location by
analyzing context of content
Classif.
?
BPM/
ACM
45
Classification provides
score or metadata for
routing rules
P8/
CM8
BPM/
ACM
Clas
sif.
The Business Value of Content Classification
Improving worker productivity
Classification extracts metadata
and populates a process task
Classification provides
content metadata to
populate task information
BPM/
ACM
Classif
.
Accessibility
& Usability
Content added during a
business process is analyzed
and classified
Content classified when
added during a business
process or case
BPM/
ACM
P8/
CM8
46
Classif
The Business Value of Content Classification
Accessibility
& Usability
Case Management and Business Process

Automatic or Assisted Routing Flow

Customer request is analyzed, auto-routed, and handled
1. User sends a request
2. Request is received by
the Case Manager system
Case
6. User is notified
that request is
handled
5. Request is handled in
Case Manager
3. Content Classification analyzes
text, assigns relevancy scores to
categories in the knowledge base
Class.
4. Request is forwarded to the
department or agent associated
with the highest-scoring category
7. User actions can be interpreted
as feedback to Classification so
the percentage of automation will
be higher in the future
Case Management and Business Process
Provides content classification for routing decisions as well as
ad-hoc classification for in-flight case documents.
Invoke
Classification
Web service
Decision based on
Classification
suggested results
Accessibility
& Usability
The Business Value of Content Classification
Analytics
Analytics – Driving business insight from content
•
•
•
•
49
Augment Content Analytics
with context-sensitive
Classification
Add categories from
Classification as new facets
for visual exploration in
Content Analytics
Teach Classification with
examples exported from
Content Analytics
Ongoing classification of
content that is analyzed by
Content Analytics
The Business Value of Content Classification
Analytics – Driving business insight from content
UIMA pipeline

Analytics
IBM Content Classification is automatically invoked via a UIMA annotator to generate metadata for
analyzing each document:

Classification is based on decision plans

Document in the index can be exported from IBM Content Analytics to train or create a new
knowledge base in IBM Content Classification

Classified categories and relevancy scores are stored in index

Examples of applications that utilize the classification results:

Automated filtering of documents
The Content Classification UIMA
Content
Classification
annotator
is part of the default
Server
IBM Content Analytics pipeline
UIMA
Documents
Custom
Analytics
Classification
Multi-word
Analytics
Named Entity
Recognition
Relevancy ranking and conceptual search based
on the relevancy scores of categories
Word
Analytics

Tokenization
Help text analysis by classification
Language
Identification

Category A
Documents
Category B
Documents
Application
Index
The Business Value of Content Classification
Analytics – Driving business insight from content
UIMA pipeline
The Business Value of Content Classification
Analytics – Document Clustering Data flow
Analytics
 IBM Content Classification enables categorization of documents in a collection
 Available for text analytics collections only
 IBM Content Classification empowers the document clustering data flow as follows:
1. Sample documents in index to detect clusters
2. Detect clusters by mathematical algorithms (LDA/k-means)
3. Train knowledge base with resulting clusters
4. Apply categorization by detected clusters to all documents in the index
Collection
0. Crawling and document processing
Text
Index
1. Sampling
Indexer Service
Doc Cluster session
Global Processing
Categorization by trained
Knowledge Base
4. Categorizing
Sampling and clustering
Train Knowledge Base
Doc Cluster KB session
Hosting Knowledge Base
Knowledge
Base
3. Training
2’. Refining
2. Clustering
Summary

Classification is critical to many
ECM objectives:

53
Consistent organization of
existing content

Improve worker productivity

Ongoing information
management strategies

Metadata management and
enhancement

Legal and regulatory
compliance

Analytics

Resource optimization

Cost control
Images
Spreadsheets
Email
Reports
Documents
Forms
Instant
Messages
Content Classification links


Content Classification page on the ECM Application
Center site
Content Classification on ibm.com
Content Classification
Putting Your Content in Motion
Classification sessions
at IOD 2013
ECA-1853 (Thu 8:15-9:30am)
Usability Sandbox: Content Classification - The Key to Organizing
your Content Breakers CD - Station 1
ECA-1394 (Once every day - registration required)
Usability Sandbox: Auto-Classification using IBM Content Navigator
- Breakers CD - Station 1
Please note
IBM’s statements regarding its plans, directions, and intent are subject to
change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a
commitment, promise, or legal obligation to deliver any material, code or
functionality. Information about potential future products may not be
incorporated into any contract. The development, release, and timing of any
future features or functionality described for our products remains at our sole
discretion.
Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance
that any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job
stream, the I/O configuration, the storage configuration, and the workload
processed. Therefore, no assurance can be given that an individual user will
achieve results similar to those stated here.
Acknowledgements and Disclaimers
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in
which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for
informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant.
While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without
warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this
presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or
representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use
of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have
achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended
to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other
results.
© Copyright IBM Corporation 2013. All rights reserved.
•U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with
IBM Corp.
IBM, the IBM logo, ibm.com, and IBM Content Classification are trademarks or registered trademarks of International Business
Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their
first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law
trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law
trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
Thank You
Your feedback is important!
• Access the Conference Agenda Builder to
complete your session surveys
o Any web or mobile browser at
http://iod13surveys.com/surveys.html
o Any Agenda Builder kiosk onsite
Communities
• On-line communities, User Groups, Technical Forums, Blogs, Social
networks, and more
o Find the community that interests you …
• Information Management bit.ly/InfoMgmtCommunity
• Business Analytics bit.ly/AnalyticsCommunity
• Enterprise Content Management bit.ly/ECMCommunity
• IBM Champions
o Recognizing individuals who have made the most outstanding contributions to
Information Management, Business Analytics, and Enterprise Content
Management communities
•
ibm.com/champion