Building the World`s Largest Database of Car Features from

Building the World’s Largest Database
of Car Features from PDFs
John Akred
Robert Munro
Chief Technology Officer
Silicon Valley Data Science
Chief Executive Officer
Idibon
JOHN AKRED
@BigDataAnalysis
Founder & CTO
Silicon Valley Data Science
Consulting firm of elite data
science and engineering
teams who specialize in
data-driven product
development and business
transformation.
2
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Robert Munro
@WWRob
Founder & CEO
Idibon
Developers of cloud-based
natural language
processing services that
adapt to business-specific
problems, in any language.
3
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
DATA STRATEGY
and choosing projects for
maximum impact
‹#›
4
DEFINE YOUR ROADMAP
5
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
IDENTIFY STRATEGIC WORKLOADS
USE CASE
2
WORKLOAD
B
WORKLOAD
C
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
WORKLOAD
B
WORKLOAD
B
WORKLOAD
A
‹#›
3
USE CASE
1
WORKLOAD
USE CASE
WORKLOAD
C
C
WORKLOAD
D
PRIORITIES
7
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
DIMENSIONS
ASSUMPTIONS
(overcome them)
LATHER, RINSE, REPEAT
8
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Putting this into action for
EDMUNDS.com
‹#›
9
Existing revenue streams:
•  Ads
•  Price quotes (leads)
Shopping is the focus:
•  Need real-time
inventory
•  Accurately described
VIN’s
‹#›
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
THE PDFs
11
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
THE CHALLENGE
•  Couldn’t keep pace with
original equipment
manufacturers (OEMs)
•  Approach was largely
manual, and backlogs
would develop
•  ~6.5% of VIN’s being held
back
•  Content operations team
was a silo
12
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
THE QUESTION
What is the
minimum viable data
required to get a VIN live?
(Then come back to add
features and specs.)
13
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
A FRESH LOOK
•  Worked with Product
Owners to define shell
products (aka minimally
viable data)
•  Leveraged NLP to
automate OEM data
translations
•  Fully integrated NLP
workflow into existing
tooling
14
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS REERVED.
Capabilities have to be
integrated into the business
process.
Ultimately, you may be able
to make predictions, but
putting those into action is
the real thing.
15
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
TRAINING HUMANS
IS TRICKY
update
Idibon training
service
Idibon prediction
service
curated data &
expanded ontology
predict
train model
OEM order
guide PDFs
data editor
Edmunds DCT
Import CSV
raw data/predictions
ScraperWiki
PDF extraction
‹#›
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Vehicle source
upload
Vehicle source
‹#›
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
‹#›
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
‹#›
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
‹#›
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Hierarchical Entity Resolution
'Micro-filter ventilation system with replaceable active-charcoal filters'
Attribute Group
Attribute Name
Attribute Value
‹#›
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Air_Conditioning
Air_Filtration
Interior_Active_
Charcoal_Filter
Data beats algorithms; feedback beats data
Disambiguating “Ford”
0.948
100%
90%
80%
70%
60%
50%
precision
recall
F-value
0.457
0.615
0.473
40%
30%
20%
10%
0%
Linear model Deep Learning
‹#›
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
In-domain
training
10mins
analyst
feedback
Edmunds ontology
'Micro-filter ventilation system with replaceable active-charcoal filters'
Air_Conditioning (depth 1)
Air_Filtration (depth 2)
Interior_Active_Charcoal_Filter (depth 3)
…
Climate_Control_Memory
…
Front_Air_Conditioning
Front Zone
Rear_Air_Conditioning
Rear Zones
…
‹#›
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
4000 (!) car features
20,000 labeled items to date
- requires expertise; accuracy with few data points
Features:
- words, n-grams, word shapes, taxonomies
Negative sampling:
- other items within the same parent
Absence of data:
- use the label name as the first data item
Accuracy at depths in the ontology
Air_Conditioning (depth 1)
1.000
0.900
Interior_Active_Charcoal_Filter (depth 3)
0.800
…
0.700
Climate_Control_Memory
0.600
0.500
…
0.400
Front_Air_Conditioning
0.300
Front Zone
0.200
Rear_Air_Conditioning
0.100
Rear Zones
0.000
Air_Filtration (depth 2)
…
0.929
0.794
0.814
Precision
Recall
F-Value
Depth 1
Depth 2
Depth 3
Controlled vocabulary allows accuracy with few
data points
‹#›
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
•  1–2 days to get a new Car
Model live online vs. 2
weeks (~85% reduction)
•  95% reduction in backlog
•  API for making predictions
on unstructured vehicle
data using Edmunds’
ontology assets
25
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
NEW CAPABILITIES
•  Use data to highlight the
real issues
•  Change is difficult, agile
methods help users
embrace new
approaches
•  New processes must
adapt to workflows
26
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
TAKEAWAYS
working with
UNSTRUCTURED DATA
‹#›
Entity
Resolution
Concept
Discovery
Recognizing references to objects:
“BMW 435” = 2014 BMW 4 Series
435i xDrive 2dr Coupe AWD (3.0L
6cyl Turbo 8A)
AWD = 4WD = xDrive = quattro
Manual transmission = stick shift
Sentiment
Analysis
Positive comment
referred to exterior
design of
competitor’s model
‹#›
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
THANK YOU
John Akred
[email protected]
Robert Munro
[email protected]
Want these slides? Go to:
TO ADD
29
APPENDIX
30
How do we get the analysts’ feedback?
How many ways could we ask someone to distinguish the
right “Ford”?
Get it right, and save 90% of the labor ≈ 90% of total cost.
31
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Super-user, guided by model suggestions
32
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Targeting one label at a time
33
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Simple accept/reject annotation for fast
annotation
34
© 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.