Content Classification: How to implement Content Classification - A Technical Overview Session Number ECA-2079 Josemina Magdalen ([email protected] ) Yigal Dayan ([email protected] ) Oren Paikowsky ([email protected] ) 1 Agenda Content Classification Overview Content Classification Concepts Content Classification Architecture Content Classification in ECM Content is Exploding Content is Evolving Content is Transforming The marketplace is driving greater volume, variety and velocity 3 Organizations will need to redefine their content strategy In order to gain control, optimize business outcomes, improve collaboration, achieve new insight, and govern for reduced cost and risk content in motion 4 © 2012 IBM Corporation What does IBM Content Classification do? Content Classification discovers the intent of a document by analyzing its content automatically learns from examples allows you to auto-classify huge volumes of documents into pre-trained categories, consistently and efficiently What is IBM Content Classification used for? Content Classification is most valuable when: A large number of documents need to be categorized Documents need to be categorized based on their content When an action needs to be taken as a result of the classification Need to order the chaos and bring structure into unstructured data What is IBM Content Classification used for? (cont.) Automatic classification advantages over manual classification: Reduces training cost Reduces laborious activities Consistent decisions, reduces errors Coherent and legally defensible Extremely fast Why organizations need Content Classification Through automated, advanced classification, knowledge workers: ─ have quick access to relevant content ─ have the information they need to use to complete tasks ─ are not burdened with enforcing compliance and retention policies ─ can analyze content relevant to specific subject matter Automated classification allows workers to focus on key business tasks, rather than spend time with manual categorization of content In short, Content Classification improves productivity 8 Classification Use Cases Email archiving, retention, and management Email routing Organization of File Systems & Shared Drives Categorizing Scanned Documents Business Process decisions Document Automatic Tagging Medical Coding (ICD-10 and others) Private vs. Public content identification Agenda Content Classification Overview Content Classification Concepts Content Classification Architecture Content Classification in ECM Classification Process Train using Quick Start Tool 1. Train Decision Plan 2. Deploy Classification Server A The core market for this new product has been defined as such by IBM ? Classification Application 11 The core market for this new product has been defined as such by IBM 3. Auto Classify Quick Start Tool 1. Manual categorization Uncategorized Sample data 2. Train Training 3. Hints on Improving training 4. Apply to real data 5. Report Test on real data 12 4. Use Trained Classification IBM Content Classification Quick Start Tool Demo Categories can take on different meanings 14 Folders in FileNet Content Manager Properties in FileNet Content Manager Records classes in Enterprise Records Item types in IBM Content Manager 8 Suggested actions in an IBM Content Collector Task Route Topics associated with a taxonomy Topics associated with document tagging Content-centric decisions in BPM applications such as email routing Classification by Contextual Understanding Text Analysis, Statistics, and Learning by Example Knowledge Base Custom & partner applications IBM pre-built integrations (ECM, ...) Input “Team, We need to determine how to handle the results of the most recent earnings report and how it will impact the reaction on Wall Street. We need to get out in front of this before the press does! Jack, get the status from Engineering ahead of time. Regards, John” Output Feedback PR(92%) FINANCE(82%) ENGINEERING(32%) Intent = PR Email IBM Content Classification Control the level of Classification automation Advanced classification can be executed as an “assistance” to authors in user interfaces Semi-automated advanced classification via monitoring Complete Automation Automation with Auditing 100% Automation of Medium Confidence and Above Assisted classification in user interfaces like SharePoint or in the future in IBM’s Office integration Automation of High Confidence and Above Assisted Manual Classification 0% Data in motion: Periodic human oversight facilitates automatic adjustment of policies Content Classification learns from user feedback to improve and adapt policies Category Recommendation User Interactions User Feedback Classification Server Content Classification Rules Decision Plan A decision plan is a sequence of rules and calls to statistical analysis Rule capabilities: String search Word distance Regular expressions Pattern extraction Boolean expressions Decision plan capabilities: Identify category (in more than one taxonomy) Set document metadata Invoke statistical analysis Language identification Recommend actions Agenda Content Classification Overview Content Classification Concepts Content Classification Architecture Content Classification in ECM How does Content Classification work? Content Classification combines multiple methods of categorization technologies to deliver automatic classification Uses contextual analysis based on machine learning techniques Uses natural language processing and semantic analysis Uses rules-based categorization based on metadata or confidence score Can be used in tandem or separately depending on requirements IBM Content Classification Architecture Component architecture The server is configured and maintained by using the Management Console administration tool The server exposes an API in four flavors (C, COM, Java, and SOAP) that enable remote connections Knowledge bases and decision plans are configured and maintained by using Classification Workbench Content Classification Architecture Component architecture Customers and business partners who require programmatic access to Content Classification functionality can develop custom applications by using Content Classification server remote APIs Content Classification Architecture Component architecture/integrations Integration with IBM ECM repositories (IBM FileNet Content Manager and IBM Content Manager) supports bulk classification and manual classification of repository content Content Classification Architecture Component architecture Integration with IBM Content Collector enables users to classify emails and documents during archiving/bulk ingestion and take action on the classification results Agenda Content Classification Overview Content Classification Concepts Content Classification Architecture Content Classification in ECM IBM Content Classification adds value to IBM ECM Email, File System and SharePoint archiving with IBM Content Collector Image-Based content classification with Datacap Taskmaster Records and Retention Management with IBM Enterprise Records Content Classification/reclassification with IBM P8, CM8 and File Systems Content Analysis and Insight with IBM Content Analytics Enhanced Search with IBM Content Analytics with Enterprise Search Advanced Case Management with IBM Case Manager Electronic Discovery with IBM eDiscovery Analyzer 26 Classification and Content Navigator Classification at the point of entry • Content Classification plug-in for Content Navigator puts users in control of content categorization decisions • New “Add & Classify” action assists the users with categorization suggestions as content is added to FileNet Content Manager • Highly relevant category suggestions are returned to the user based on the context of the content being added • Users may override the suggestion and choose a different category • System learns from choices selected and refines suggestions, over time IBM Content Classification Content Navigator plug-in Demo Classification and Content Navigator Classification at the point of entry The Classification plug-in adds an Add and Classify Document button to Content Navigator. Classification and Content Navigator Classification at the point of entry When you click the Add button, the document is added to FileNet Content Manager according to the classification results that are returned by Content Classification. Classification and Content Navigator Classification at the point of entry The document is added to the appropriate folder in FileNet Content Manager. These properties were automatically set according to the classification results. Classification and Content Navigator Classification at the point of entry You can override the classification result and select a different folder or document class. Classification and Content Navigator Benefits • The “Add and Classify” plug-in for IBM Content Navigator, puts users in control of their categorization decisions • It guides the user with suggestions based on a trained set of documents and based on user feedback • The integration provides just the right amount of control with just the right amount of user flexibility and independence Classification and Datacap Integration Content-based analytics for image capture Consistent, appropriate classification of image-based content. Good on docs with logos/images Good for highly varied docs Highly Accurate, Labor intensive Highly effective for invoices, bills of lading Good for mixed docs, similar layouts, text Highly Accurate, not always possible Labor intensive 34 Content Classification provides text analytics and statistical probability Ideal scenario for Enterprise Capture Datacap Taskmaster Connector to Content Classification Taskmaster Extracts text using OCR – Optical Character Recognition Calls Content Classification to identify the page Content Classification analyzes the text content Uses natural language processing and semantic analysis Assigns confidence score to each category suggestion (0 – 100) Returns the classification results to Taskmaster How does Taskmaster with Classification work ? Taskmaster examines each page using multiple methods – The fastest methods are executed first : barcode, pattern match, & fingerprint – The slower methods that require OCR follow: Text analytics and keywords – Finally rules examine the context to determine if any remaining pages can be identified based on the surrounding pages The Taskmaster document hierarchy specifies page types contained in each document – Separates and assembles the pages into documents The system outputs classification results statistics to support optimization Feedback loop improves future results – Image fingerprints populated to fingerprint database – Text classification trained with feedback to analytics engine Exceptions, low confidence results are reviewed and classified by users How can it work with my documents The key is to understand your documents Use barcodes whenever possible for speed and accuracy Documents with image structure work well with fingerprint matching Documents with text content work well with Text Analytics if the first page can be distinguished from trailing pages Keywords and rules can catch exceptions Pages without text – like diagrams and photos do not support keyword and text analysis Combining methods produces the best results Taskmaster’s classification is more effective and less labor intensive than traditional methods Automatic Email Archiving and Records Declaration / Retention Compliance Classification combined with collection and records declaration assists companies in achieving compliance with business and legal mandates Use IBM Content Classification, IBM Content Collector, and IBM Enterprise Records to: Organize content and make records declaration decisions automatically Classify records currently in an existing ECM repository: organize in place using the Classification Center Classify records during content collection process through modular tasks in IBM Content Collector Invoke Content Classification during the content collection process to decide when and how to invoke records declaration tasks IBM Content Collector Task Route Archiving and Records Declaration Based on Classification Results 1. Call the IBM Content Classification task to analyze the emails. Note that attachments must be analyzed as well. 2. Email without business value (“Personal email”) is discarded. 3. Archive email by using the P8 File Document in Folder task that uses the previously defined fields in the classification decision plan (folder_name). 4. Records declaration uses the previously defined fields (file_plan and folder_name) to assign the correct records class to the business email in the P8 Declare Record task. Benefits Compliance Automated, advanced classification helps an organization to: Organize content and make records declaration decisions automatically Classify records currently in your ECM repository: organize in place using the Classification Center Take automated action without burdening your users: set properties extract metadata place in folders declare as a record and place in file plan Monitor actions and optimize accuracy in ongoing basis With IBM Content Classification: Knowledge workers are not burdened with manual enforcement of compliance and retention policies Productivity improves: workers can focus on key business tasks, rather than spend time on the manual categorization of content Content Classification for ECM Repositories Provides document classification and categorization automation within content management system Classification provides services on documents/emails in P8 or CM8: P8 / CM8 1.Automatic classification 2.Manual review Classification Center IBM Content Classification 41 Classification uses statistical methods (Knowledge Base) and rule/keyword-based methods (Decision Plan) to determine document/email classification Available as Sample Code IBM Content Classification & Microsoft SharePoint: Social Content Integration • Content Classification used to classify social content that resides in SharePoint • Provides accurate and consistent organization of content in collaborative environments • Content Classification can be used in other SharePoint workflows for supporting a business process 42 StoredIQ and Automatic Classification (SAC) • Manage Data-in-place and Retain Business Critical Data • Don’t move data just to figure out what it is • Records and Legal Collections need to be managed • Increase signal to noise ratio in data set • Identify content that has value to the different stakeholders • Practice Good Data Hygiene • Dispose of Records past their retention periods • Dispose of Legal Collections when cases end • More aggressive disposition of data StoredIQ Auto-Classification (SAC) Architecture Apply Classification based filters Import Content Classification pre-built model DATAIQ/ADMINIQ Apply specified Classification model against specific InfoSet GATEWAY SERVER Send “Apply Model” request to participating DataServers Apply Model Archive Platform ECM Forensic Images/Tapes Apply Model File Servers Apply Model Email Servers Desktops Apply Model SharePoint & Enterprise Collaboration Apply Model Cloud Media The Business Value of Content Classification Improving worker productivity Accessibility & Usability Process launched based on Classification of content in P8/CM8 Process route determined by Classification analysis Classification determines the repository location by analyzing context of content Classif. ? BPM/ ACM 45 Classification provides score or metadata for routing rules P8/ CM8 BPM/ ACM Clas sif. The Business Value of Content Classification Improving worker productivity Classification extracts metadata and populates a process task Classification provides content metadata to populate task information BPM/ ACM Classif . Accessibility & Usability Content added during a business process is analyzed and classified Content classified when added during a business process or case BPM/ ACM P8/ CM8 46 Classif The Business Value of Content Classification Accessibility & Usability Case Management and Business Process Automatic or Assisted Routing Flow Customer request is analyzed, auto-routed, and handled 1. User sends a request 2. Request is received by the Case Manager system Case 6. User is notified that request is handled 5. Request is handled in Case Manager 3. Content Classification analyzes text, assigns relevancy scores to categories in the knowledge base Class. 4. Request is forwarded to the department or agent associated with the highest-scoring category 7. User actions can be interpreted as feedback to Classification so the percentage of automation will be higher in the future Case Management and Business Process Provides content classification for routing decisions as well as ad-hoc classification for in-flight case documents. Invoke Classification Web service Decision based on Classification suggested results Accessibility & Usability The Business Value of Content Classification Analytics Analytics – Driving business insight from content • • • • 49 Augment Content Analytics with context-sensitive Classification Add categories from Classification as new facets for visual exploration in Content Analytics Teach Classification with examples exported from Content Analytics Ongoing classification of content that is analyzed by Content Analytics The Business Value of Content Classification Analytics – Driving business insight from content UIMA pipeline Analytics IBM Content Classification is automatically invoked via a UIMA annotator to generate metadata for analyzing each document: Classification is based on decision plans Document in the index can be exported from IBM Content Analytics to train or create a new knowledge base in IBM Content Classification Classified categories and relevancy scores are stored in index Examples of applications that utilize the classification results: Automated filtering of documents The Content Classification UIMA Content Classification annotator is part of the default Server IBM Content Analytics pipeline UIMA Documents Custom Analytics Classification Multi-word Analytics Named Entity Recognition Relevancy ranking and conceptual search based on the relevancy scores of categories Word Analytics Tokenization Help text analysis by classification Language Identification Category A Documents Category B Documents Application Index The Business Value of Content Classification Analytics – Driving business insight from content UIMA pipeline The Business Value of Content Classification Analytics – Document Clustering Data flow Analytics IBM Content Classification enables categorization of documents in a collection Available for text analytics collections only IBM Content Classification empowers the document clustering data flow as follows: 1. Sample documents in index to detect clusters 2. Detect clusters by mathematical algorithms (LDA/k-means) 3. Train knowledge base with resulting clusters 4. Apply categorization by detected clusters to all documents in the index Collection 0. Crawling and document processing Text Index 1. Sampling Indexer Service Doc Cluster session Global Processing Categorization by trained Knowledge Base 4. Categorizing Sampling and clustering Train Knowledge Base Doc Cluster KB session Hosting Knowledge Base Knowledge Base 3. Training 2’. Refining 2. Clustering Summary Classification is critical to many ECM objectives: 53 Consistent organization of existing content Improve worker productivity Ongoing information management strategies Metadata management and enhancement Legal and regulatory compliance Analytics Resource optimization Cost control Images Spreadsheets Email Reports Documents Forms Instant Messages Content Classification links Content Classification page on the ECM Application Center site Content Classification on ibm.com Content Classification Putting Your Content in Motion Classification sessions at IOD 2013 ECA-1853 (Thu 8:15-9:30am) Usability Sandbox: Content Classification - The Key to Organizing your Content Breakers CD - Station 1 ECA-1394 (Once every day - registration required) Usability Sandbox: Auto-Classification using IBM Content Navigator - Breakers CD - Station 1 Please note IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion. Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision. The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion. Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here. Acknowledgements and Disclaimers Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates. The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software. All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results. © Copyright IBM Corporation 2013. All rights reserved. •U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. IBM, the IBM logo, ibm.com, and IBM Content Classification are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml Other company, product, or service names may be trademarks or service marks of others. Thank You Your feedback is important! • Access the Conference Agenda Builder to complete your session surveys o Any web or mobile browser at http://iod13surveys.com/surveys.html o Any Agenda Builder kiosk onsite Communities • On-line communities, User Groups, Technical Forums, Blogs, Social networks, and more o Find the community that interests you … • Information Management bit.ly/InfoMgmtCommunity • Business Analytics bit.ly/AnalyticsCommunity • Enterprise Content Management bit.ly/ECMCommunity • IBM Champions o Recognizing individuals who have made the most outstanding contributions to Information Management, Business Analytics, and Enterprise Content Management communities • ibm.com/champion
© Copyright 2024