Open Health Natural Language Processing Consortium (OHNLP)

Open Health Natural Language
Processing Consortium
(OHNLP)
Mayo Clinic:
Guergana Savova, Ph.D.
James Masanz
[email protected]
IBM Watson Research:
Anni Coden, Ph.D.
Michael Tanenblatt
[email protected]
1
Overview
• OHNLP? Oh, NLP?
• Demo of a clinical OHNLP system
(cTAKES)
• Demo of a medical OHNLP system
(MedKAT) with extensions to pathology
(/P)
• How can I adapt the system to my data?
• Lively discussion: how can I get involved,
OHNLP future steps…
2
Open Health Natural Language
Processing Consortium
• www.ohnlp.org (part of caBIG Vocabulary
Knowledge Center web presence)
• Goal
• Foster an open-source collaborative community around
clinical NLP that can deliver best-of-breed annotators,
leverage the dynamic features of UIMA flow-control,
and establish the infrastructure for clinical NLP.
• Two open source releases as part of OHNLP
• Mayo’s pipeline for processing clinical notes (cTAKES)
• IBM’s pipeline for processing medical notes (MedKAT)
and pathology reports (MedKAT/P)
3
Other non-OHNLP clinical NLP
Systems
• Proprietary
• medLEE (Columbia University)
• Topaz (University of Pittsburgh)
• Vanderbilt University
• caTIES (University of Pittsburgh)
• MPLUS/Onyx (University of Utah)
• VA Hospital system
• Open Source
• i2b2 HITEx (Health Information Text Extraction)
4
Clinical example:
clinical Text Analysis and
Knowledge Extraction System
(cTAKES)
Presenters:
Guergana Savova
James Masanz
5
Overview
• cTAKES
• Developed at Mayo Clinic
• Goals:
• Phenotype extraction
• Generic – to be used for a variety of retrievals and use
•
•
•
•
cases
Expandable – at the information model level and
methods
Modular
Cutting edge technologies – best methods combining
existing practices and novel research with rapid
technology transfer
Best software practices (80M+ notes)
• Commitment to both R and D in R&D
6
cTAKES: Components
• Clinical narrative as a sublanguage
• Core components
•
•
•
•
•
•
Sentence boundary detection (OpenNLP technology)
Tokenization (rule-based)
Morphologic normalization (NLM’s LVG)
POS tagging (OpenNLP technology)
Shallow parsing (OpenNLP technology)
Named Entity Recognition
• Dictionary mapping (lookup algorithm)
• Machine learning (MAWUI)
• Negation and context identification (NegEx)
7
Output Example: Disorder Object
• “No evidence of unstable angina.”
• Disorder
•
•
•
•
•
Text: unstable angina
Associated code: SNOMED 4557003
Named entity type: disease/disorder
Status: current
Negation: true
8
Methods
• Preliminary results:
• Savova, Guergana; Kipper-Schuler, Karin;
Buntrock, James and Chute, Christopher. 2008.
UIMA-based clinical information extraction
system. LREC 2008: Towards enhanced
interoperability for large HLT systems: UIMA for
NLP.
• Manuscript with detailed system description and
evaluation under review (JAMIA)
9
cTAKES demo
10
Medical example:
Medical Knowledge Analysis System
MedKAT and MedKAT/P
Presenters:
Anni Coden
Michael Tanenblatt
11
Overview
• MedKAT and MedKAT/P
• Developed at IBM
• Goal:
• Identification of concepts and their attributes based on
•
•
•
•
a standard or proprietary terminology/ontology
/P adaptation to pathology reports – relation extraction
Modular, Generic, Expandable
• Terminology, Conceptual Model
Easy adaptation to specific corpus and conventions
Integration into institutional system
• Ongoing commitment to Research and Development
12
Core Components
• Document structure
• Syntactic tools
(tokenization ...
Shallow parsing)
• Concept
identification
• Negation
• Relationship
extraction
Extracted data
Anatomic site
Histology
Size
Date
Grade
Gross Desc
Lymph Nodes
Primary Tumor
Metastatic Tumor
F-score
0.95
0.98
1.00
1.00
0.98
0.80
0.81
0.82
0.65
13
Document Structure
16
14
Document Structure
17
15
Document Structure
18
16
Output
17
Cancer Disease Knowledge
Representation Model
18
Demos
• Query by Model / Cancer
• Detailed view of annotations in
Document Analyzer
• http://domino.research.ibm.com/com
m/research_projects.nsf/pages/medic
alinformatics.index.html
19
Adaptation
Presenters:
Anni Coden
Michael Tanenblatt
20
Adaptation
• Sentence breaks
• Text case
• Part of speech tags
• Shallow parser
• Dictionary lookup
• Document structure
21
Sentence Breaks
22
Sentence Breaks
• Some solutions:
• Use annotator to re-break sentences
• Retrain tagger
23
Case/Part of Speech Tags
24
Case/Part of Speech Tags
• Some solutions:
• Retrain tagger
• Use UIMA annotator to create a “true
case” view
25
Part of Speech Tags
26
Part of Speech Tags
• Some solutions:
• Retrain tagger
• Use dictionary lookup to modify
incorrect tags
• Create rule-based annotator to
modify incorrect tags
27
Shallow Parser
28
Shallow Parser
31
29
Shallow Parser
32
30
Dictionary Lookup
• Dictionary entries can be added,
changed, deleted
• Dictionary entry attributes can be
added, changed, deleted
• Search parameters can be modified
• Post processing filters
• Tokenization of text and dictionary
should be the same
31
Document Structure
• Plain text or XML (e.g., CDA)
• Processes specific document section
types (e.g., diagnosis)
• Detection of formatting (e.g. bullets)
• Detection of relations between
sections
• Making implicit conventions explicit
(e.g. meaning of title)
32
Discussion: Future of OHNLP.ORG
• Provided seed annotators and tools
• Goal: growing community
• Annotators, tools
• Methodologies
• Gold standards
• Common type system for plug-andplay
• What are the hurdles?
Hands-on Customization
34
MedKAT
• Dictionary adaptation
• Concept identification parameters
• Document structure detection
35
cTAKES
• Negation window
• Lookup window
• Dictionary modifications
36
Questions?
37