Building the World’s Largest Database of Car Features from PDFs John Akred Robert Munro Chief Technology Officer Silicon Valley Data Science Chief Executive Officer Idibon JOHN AKRED @BigDataAnalysis Founder & CTO Silicon Valley Data Science Consulting firm of elite data science and engineering teams who specialize in data-driven product development and business transformation. 2 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Robert Munro @WWRob Founder & CEO Idibon Developers of cloud-based natural language processing services that adapt to business-specific problems, in any language. 3 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. DATA STRATEGY and choosing projects for maximum impact ‹#› 4 DEFINE YOUR ROADMAP 5 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. IDENTIFY STRATEGIC WORKLOADS USE CASE 2 WORKLOAD B WORKLOAD C © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. WORKLOAD B WORKLOAD B WORKLOAD A ‹#› 3 USE CASE 1 WORKLOAD USE CASE WORKLOAD C C WORKLOAD D PRIORITIES 7 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. DIMENSIONS ASSUMPTIONS (overcome them) LATHER, RINSE, REPEAT 8 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Putting this into action for EDMUNDS.com ‹#› 9 Existing revenue streams: • Ads • Price quotes (leads) Shopping is the focus: • Need real-time inventory • Accurately described VIN’s ‹#› © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. THE PDFs 11 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. THE CHALLENGE • Couldn’t keep pace with original equipment manufacturers (OEMs) • Approach was largely manual, and backlogs would develop • ~6.5% of VIN’s being held back • Content operations team was a silo 12 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. THE QUESTION What is the minimum viable data required to get a VIN live? (Then come back to add features and specs.) 13 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. A FRESH LOOK • Worked with Product Owners to define shell products (aka minimally viable data) • Leveraged NLP to automate OEM data translations • Fully integrated NLP workflow into existing tooling 14 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS REERVED. Capabilities have to be integrated into the business process. Ultimately, you may be able to make predictions, but putting those into action is the real thing. 15 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. TRAINING HUMANS IS TRICKY update Idibon training service Idibon prediction service curated data & expanded ontology predict train model OEM order guide PDFs data editor Edmunds DCT Import CSV raw data/predictions ScraperWiki PDF extraction ‹#› © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Vehicle source upload Vehicle source ‹#› © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. ‹#› © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. ‹#› © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. ‹#› © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Hierarchical Entity Resolution 'Micro-filter ventilation system with replaceable active-charcoal filters' Attribute Group Attribute Name Attribute Value ‹#› © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Air_Conditioning Air_Filtration Interior_Active_ Charcoal_Filter Data beats algorithms; feedback beats data Disambiguating “Ford” 0.948 100% 90% 80% 70% 60% 50% precision recall F-value 0.457 0.615 0.473 40% 30% 20% 10% 0% Linear model Deep Learning ‹#› © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. In-domain training 10mins analyst feedback Edmunds ontology 'Micro-filter ventilation system with replaceable active-charcoal filters' Air_Conditioning (depth 1) Air_Filtration (depth 2) Interior_Active_Charcoal_Filter (depth 3) … Climate_Control_Memory … Front_Air_Conditioning Front Zone Rear_Air_Conditioning Rear Zones … ‹#› © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. 4000 (!) car features 20,000 labeled items to date - requires expertise; accuracy with few data points Features: - words, n-grams, word shapes, taxonomies Negative sampling: - other items within the same parent Absence of data: - use the label name as the first data item Accuracy at depths in the ontology Air_Conditioning (depth 1) 1.000 0.900 Interior_Active_Charcoal_Filter (depth 3) 0.800 … 0.700 Climate_Control_Memory 0.600 0.500 … 0.400 Front_Air_Conditioning 0.300 Front Zone 0.200 Rear_Air_Conditioning 0.100 Rear Zones 0.000 Air_Filtration (depth 2) … 0.929 0.794 0.814 Precision Recall F-Value Depth 1 Depth 2 Depth 3 Controlled vocabulary allows accuracy with few data points ‹#› © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. • 1–2 days to get a new Car Model live online vs. 2 weeks (~85% reduction) • 95% reduction in backlog • API for making predictions on unstructured vehicle data using Edmunds’ ontology assets 25 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. NEW CAPABILITIES • Use data to highlight the real issues • Change is difficult, agile methods help users embrace new approaches • New processes must adapt to workflows 26 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. TAKEAWAYS working with UNSTRUCTURED DATA ‹#› Entity Resolution Concept Discovery Recognizing references to objects: “BMW 435” = 2014 BMW 4 Series 435i xDrive 2dr Coupe AWD (3.0L 6cyl Turbo 8A) AWD = 4WD = xDrive = quattro Manual transmission = stick shift Sentiment Analysis Positive comment referred to exterior design of competitor’s model ‹#› © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. THANK YOU John Akred [email protected] Robert Munro [email protected] Want these slides? Go to: TO ADD 29 APPENDIX 30 How do we get the analysts’ feedback? How many ways could we ask someone to distinguish the right “Ford”? Get it right, and save 90% of the labor ≈ 90% of total cost. 31 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Super-user, guided by model suggestions 32 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Targeting one label at a time 33 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Simple accept/reject annotation for fast annotation 34 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
© Copyright 2024