Natural Language Processing in the EHR Lifecycle

Insight Driven Health
Natural Language
Processing in the EHR
Lifecycle
Cecil O. Lynch, MD, MS
[email protected]
Health & Public Service
Outline
•Medical Data Landscape
•Value Proposition of NLP
•Strategies for voice and text
processing
•Tooling options
•Integration with the EMR lifecycle
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Medical Data Landscape
Copyright
Copyright©©2010
2010Accenture
AccentureAll
AllRights
RightsReserved.
Reserved.Accenture,
Accenture,itsitslogo,
logo,and
andHigh
HighPerformance
PerformanceDelivered
Deliveredare
aretrademarks
trademarksofofAccenture.
Accenture.
Medical Data Landscape
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Medical Data – Where is it?
Two Types of Content
1. Structured Content - Typically found in a database
A.
B.
20%
UMLS RxNorm
Fits a pre-defined data model
Fits well into relational tables.
Examples
•
Databases
•
XML Data
•
Data warehouses
•
Enterprise systems (CRM, ERP, etc.)
2. Unstructured Content - Can be found throughout an organization
80%
A.
Does not fit a pre-defined data model
B.
Does not fit well into relational tables.
Examples - Text-based
•
Email messages
•
Office documents
•
Web documents
•
BLOB (Binary Large Object) field type
(e.g. Transcribed Doctor’s Notes)
Examples – Non-Text-based
•
Voice/Audio files
(e.g. Dictated Doctor’s Notes)
•
Images
•
Video files
Medical Charts
Slide from DataSkill
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
NLP Value
Proposition
Copyright
Copyright©©2010
2010Accenture
AccentureAll
AllRights
RightsReserved.
Reserved.Accenture,
Accenture,itsitslogo,
logo,and
andHigh
HighPerformance
PerformanceDelivered
Deliveredare
aretrademarks
trademarksofofAccenture.
Accenture.
NLP Value Proposition
Data from IBM study at Seton Healthcare
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Case Study 5 – BJC HealthCare
Making healthcare smarter
BJC Healthcare “NLP Results”
Results: Follow-up Appointments and Diagnoses
Element
Precision
Recall
Alcohol Use
91.8%
96.2%
Alcohol Substance
95%
74%
Alcohol Volume
96.3%
100.0%
Alcohol Duration
86.7%
93.3%
Alcohol Quit Duration
100.0%
96.1%
Alcohol Family History
95.8%
83.3%
Tobacco Use
90.0%
93.0%
Medications
90.0%
92.0%
|8
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Strategies for Voice
and Text Analytics
Copyright
Copyright©©2010
2010Accenture
AccentureAll
AllRights
RightsReserved.
Reserved.Accenture,
Accenture,itsitslogo,
logo,and
andHigh
HighPerformance
PerformanceDelivered
Deliveredare
aretrademarks
trademarksofofAccenture.
Accenture.
Strategic Approach
• Voice recognition to standard EMR UI
• Voice recognition to a standard model
• Voice recognition to unstructured text document
• Content analytics on unstructured documents written to
EMR fields
• Content analytics on unstructured documents written to
a data warehouse
• Content analytics used at runtime and for predictive
analytics and decision support
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Is there a limit to Structured Data?
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Tooling Options
Copyright
Copyright©©2010
2010Accenture
AccentureAll
AllRights
RightsReserved.
Reserved.Accenture,
Accenture,itsitslogo,
logo,and
andHigh
HighPerformance
PerformanceDelivered
Deliveredare
aretrademarks
trademarksofofAccenture.
Accenture.
NLP Pipelines - UIMA
Unstructured Information Management Architecture
• 4 Major Software Divisions
– It specifies component interfaces in an analytics pipeline
– It describes a set of Design patterns
– It suggests two data representations: an in-memory representation of
annotations for high-performance analytics and an XML representation
of annotations for integration with remote web services.
– It suggests development roles allowing tools to be used by users with
diverse skills
Is an OASIS Standard
Reference Implementation Donated by IBM (SourceForge)
Maintained by the Apache Foundation
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Tooling
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Tooling - Continued
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Tooling - Continued
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
cTAKES
• Clinical Text Analysis and Knowledge Extraction System (Mayo
Clinic, Children's Hospital Boston)
– http://sourceforge.net/projects/ohnlp/files/cTAKES/
• Components
–
–
–
–
–
–
–
–
–
–
–
–
Sentence boundary detector (OpenNLP)
Rule-based tokenizer to separate punctuations from words
Normalizer (NLM’s NORM)
Part-of-speech tagger (OpenNLP)
Phrasal chunker (OpenNLP)
Dictionary lookup annotator
Context annotator
Negation detector (NegEx)
Dependency parser
Module for the identification of patient smoking status
Drug mention annotator
Context dependent tokenizer
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
cTAKES Derivation
cTAKES
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Refined Lucene OWL Code
Annotation
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
ClearTK
• ClearTK provides a framework for developing statistical natural
language processing (NLP) components in Java and is built on top
of Apache UIMA.
– http://code.google.com/p/cleartk/ (UCB)
– A common interface and wrappers for popular machine learning libraries
such as SVMlight, LIBSVM, OpenNLP MaxEnt, and Mallet.
– A rich feature extraction library that can be used with any of the machine
learning classifiers. Under the covers, ClearTK understands each of the
native machine learning libraries and translates your features into a format
appropriate to whatever model you're using.
– Infrastructure for creating NLP components for specific tasks such as partof-speech tagging, BIO-style chunking, named entity recognition, semantic
role labeling, temporal relation tagging, etc.
– Wrappers for common NLP tools such as the Snowball stemmer, the
OpenNLP tools, the MaltParser dependency parser, and the Stanford
CoreNLP tools.
– Corpus readers for collections like the Penn Treebank, ACE 2005, CoNLL
2003, Genia, TimeBank and TempEval.
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
EMR Integration
Options
Copyright
Copyright©©2010
2010Accenture
AccentureAll
AllRights
RightsReserved.
Reserved.Accenture,
Accenture,itsitslogo,
logo,and
andHigh
HighPerformance
PerformanceDelivered
Deliveredare
aretrademarks
trademarksofofAccenture.
Accenture.
Optimal Goal
• Goal is:
– Convert unstructured to
structured data
– Code this data into standard
Meaningful Use terminologies
– Write the data to standard
information models for health
care data elements in
standard ISO Healthcare
datatypes
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
City of Hope – A Proposed
Architecture
Reporting and Business Intelligence
ETL
Content
Analytics
Natural Language
Processing
Allscripts
Healthcare
Accelerator
Staging - Triplestore
HL7 RIM V3
ETL
Logical Layer
Connection
Physical Layer
Staging - Relational
Allscripts Database
EMR OLTP
ETL
EDW and Datamarts
OLAP
Analytics
Predictive Analytics
Statistics
Datamining
ETL
RDF
Datamining
Triplestore Datamart
• Risk stratification
• Treatment/Protocol evaluations
• Research cohort comparisons
• Real-time clinical decision support
• Disease management
• Population health management
• Personalized medicine / genomics
• Performance assessment
• Patient profiling
• Treatment cost calculations
ETL
ETL
High Performance Analytics
Tool Examples: SPARQL, OWL, IBM SLRP,
IBM IODT , OntoBroker, Sesame, Jena
RDF – Resource Description Framework
OWL – Web Ontology Language
SPARQL – Protocol and RDF Query Language
IBM SLRP – IBM Semantic Layer Research Platform
IBM IODT – IBM’s toolkit for ontology-driven development
OntoBroker – Semantic web middleware
Sesame – Framework for querying and analyzing RDF data.
Jena – Semantic Web Framework for Java
WATSON for Healthcare
WEA Advisor Framework
Utilization Management
Advisor
Tools APIs Methods
Data Platform
Diagnosis and Treatment
Advisor
Massively Parallel Infrastructure
|25
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Wrap Up – Questions ?
?
[email protected]
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.
Thank You - Credits
• IBM jStart Team
– Randall Wilcox, Kevin Conroy
• Dataskill
– Victor Bagwell - CIO
• City of Hope
– Naveen Raja, D.O. – CMIO
– Ying Liu, Ph.D. Bioinformatics Group
• Accenture
– German Acuna
– Suniti Ponkshe
– Jim Traficant
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture.