Bijenkorf Data Vault Presentation.pptx - BI

Data Vault Automation at
de Bijenkorf
PRESENTED BY
ROB WINTERS
ANDREI SCORUS
Presentation agenda
◦ Project objec*ves ◦ Architectural overview ◦ The data warehouse data model ◦ Automa*on in the data warehouse ◦ Successes and failures ◦ Conclusions About the presenters
 Rob Winters Andrei Scorus  Head of Data Technology, the Bijenkorf  BI Consultant, Incentro  Project role:  Project role: ◦ Project Lead ◦ Main ETL Developer ◦  Systems architect and administrator ◦  ETL Developer ◦  Data modeler ◦  Modeling support ◦  Source system expert ◦  Developer (ETL, predic;ve models, reports) ◦  Stakeholder manager ◦ Joined project September 2014 ◦ Joined project November 2014 Project objectives
Technical Requirements ◦  Informa*on requirements ◦  Have one place as the source for all reports ◦  Security and privacy ◦  Informa*on management ◦  Integrate with produc*on •  One environment to quickly generate customer insights •  Then feed those insights back to produc;on ◦ Non-­‐func*onal requirements ◦  System quality ◦  Extensibility ◦  Scalability ◦  Maintainability ◦  Security ◦  Flexibility ◦  Low Cost •  Then measure the impact of those changes in near real ;m
Source system landscape
Source Type Number of Sources Examples Load Frequency Data Structure 2x/hour Par;al 3NF Oracle DB 2 Virgo ERP MySQL 3 Product DB, Web 10x/hour Orders, DWH 3NF (Web Orders), Improperly normalized Event bus 1 Web/email events 1x/minute Tab delimited with JSON fields Webhook 1 Transac;onal Emails 1x/minute JSON REST APIs 5+ GA, DotMailer 1x/hour-­‐1x/day JSON SOAP APIs 5+ AdWords, Pricing 1x/day XML Architectural overview
 Tools   AWS ◦  S3 ◦  Kinesis ◦  Elas;cache ◦  Elas;c Beanstalk ◦  EC2 ◦  DynamoDB   Open Source ◦  Snowplow Event Tracker ◦  Rundeck Scheduler ◦  Jenkins Con;nuous Integra;on ◦  Pentaho PDI   Other ◦ 
◦ 
◦ 
◦ 
HP Ver;ca Tableau Github RStudio Server DWH internal architecture
•  Tradi;onal three ;er DWH •  ODS generated automa;cally from staging •  Ops mart reflects data in original source form •  Helps offload queries from source systems •  Business marts materialized exclusively from vault Bijenkorf Data Vault overview
  Aligned to Data Vault 2.0   Model contains • Hash keys • Sales transac;ons • Hashes used for CDC • Customer and corporate loca;ons • Parallel loading • Maximum u;liza;on of available resources • Data unchanged in to the vault • Customers   Excluded from the vault ◦  Event streams ◦  Server logs ◦  Unstructured data • Products   Some sta*s*cs • Payment methods   18 hubs •  34 loading scripts • E-­‐mail • Phone • Product grouping • Campaigns • deBijenkorf card • Social media   27 links •  43 loading scripts   39 satellites •  43 loading scripts   13 reference tables •  1 script per table   Data volumes • ~1 TB base volume • 10-­‐12 GB daily • ~250 source tables Deep dive: Transactions in DV
• Transac;ons Deep dive: Customers in DV
• Same as link on customer Challenges encountered during data
modeling
Challenge Issue Details Resolu*on Source issues •  Source systems and original data •  Business keys rebuilt in staging unavailable for most informa;on prior to vault loading •  Data ohen transformed 2-­‐4 ;mes before access was available •  Business keys (ex. SKU) typically replaced with sequences Modeling returns •  Retail returns can appear in ERP •  Original model showed sale in 1-­‐3 ways across mul;ple tables state on line item satellite with inconsistent keys •  Revised model recorded •  Online returns appear as a state “nega;ve sale” transac;ons and change on original transac;on used a new link to connect to and may/may not appear in ERP original sale when possible Fragmented knowledge •  Informa;on about the systems was being held by mul;ple people •  Documenta;on was out-­‐of-­‐date •  Talking to as many people as possible and tes;ng hypotheses on the data Targeted benefits of DWH automation
Objec*ve Achievements Speed of development •  Integra;on of new sources or data from exis;ng sources takes 1-­‐2 steps •  Adding a new vault dependency takes one step Simplicity •  Five jobs handle all ETL processes across DWH Traceability •  Every record/source file is traced in the database and every row automa;cally iden;fied by source file in ODS Code simplifica*on •  Replaced most common key defini;ons with dynamic variable replacement File management •  Every source file automa;cally archived to Amazon S3 in appropriate loca;ons sorted by source, table, and date •  En;re source systems, periods, etc can be replayed in minutes Source loading automation
o  Design of loader focused on process abstrac;on, traceability, and minimiza;on of “moving parts” o  Final process consisted of two base jobs working in tandem: one for genera;ng incremental extracts from source systems, one for loading flat files from all sources to staging tables o  Replica;on was desired but rejected due to limited access to source systems Workflow of source integra*on Source tables duplicated in staging with addi;on of loadTs and sourceFile columns Metadata for source file added Loader automa;cally generates ODS, begins tracking source files for duplica;on and data quality Example: Add addi*onal table from exis*ng source
Query generator automa;cally executes full duplica;on on first execu;on and incrementals aherward CREATE TABLE stg_oms.customer ( customerId int , customerName varchar(500) , customerAddress varchar(5000) , loadTs ;mestamp NOT NULL , sourceFile varchar(255) NOT NULL ) ORDER BY customerId PARTITION BY date(loadTs) ; INSERT INTO meta.source_to_stg_mapping (targetSchema, targetTable, sourceSystem, fileNamePapern, delimiter, nullField) VALUES ('stg_oms','customer','OMS','OMS_CUSTOMER','TAB','NULL') ; Vault loading automation
o  Loader is fully metadata driven with focus on horizontal scalability and management simplicity o  To support speed of development and performance, variable-­‐driven SQL templates used throughout All Staging Tables Checked for Changes •  New sources automa;ca
lly added •  Last change epoch based on load stamps, advanced each ;me all dependenc
ies execute successfull
y •  Dependenc
ies declared at List of ;me of job Depende
crea;on nt Vault •  Load Loads priori;za;
Iden;fied on possible but not u;lized Loads Planned in Hub, Link, Sat Order •  Jobs parallelized across tables but serialized per job •  Dynamic job queueing ensures appropriat
e execu;on order Loads Executed •  Variables automa;ca
lly iden;fied and replaced •  Each load records performan
ce sta;s;cs and error messages Design goals for mart loading automation
Requirement Solu;on Benefit Simple, standardized models Metadata-­‐driven Pentaho PDI Easy development using parameters and variables Easily Extensible Plugin framework Rapid integra;on of new func;onality Rapid new job development Recycle standardized jobs and transforma;ons Limited moving parts, easy modifica;on Low administra;on overhead Leverage built in logging and tracking Easily integrated mart loading repor;ng with other ETL reports Data Information mart automation flow
Retrieve commands Get dependencies • Each dimension and fact is processed independently • Based on defined transforma;on, get all related vault tables: links, satellites or hubs • From the related tables, build a list of unique keys that have changed since the last update of the fact or dimension Retrieve changed • Store the data in the database un;l further processing data Execute transforma*ons Maintentance • Mul;ple Pentaho transforma;ons can be processed per command using the data captured in previous steps • Logging happens throughout the whole process • Cleanup aher all commands have been processed •  DV drives recommenda;on engine and customer recommenda;ons (updated nightly) •  Data pipeline supports near real ;me upda;ng of customer recommenda;ons based on web ac;vity Business Intelligence •  Provided first unified data model of customer ac;vity •  80% reduc;on in unique customer keys •  Allowed for segmenta;on of customers based on combina;on of in-­‐store and online ac;vity Personaliza;on Customer Analysis Primary uses of Bijenkorf DWH
•  DV-­‐based marts replace joining dozens of tables across mul;ple sources with single facts/ dimensions •  IT-­‐driven repor;ng being replaced with self-­‐service BI Biggest drivers of success
AWS Infrastructure Automa;on PDI framework Cost: En;re infrastructure for less than one server in the data center Speed: Enormous ;me savings aher ini;al investment Ease of use: Adding new commands takes at most 45 minutes Toolset: Most services available off the shelf, minimizing administra;on Simplicity: Able to run and monitor 40k+ queries per day with minimal effort Agile: Building the framework took 1 day Freedom: No dependency on IT for development support Auditability: Enforced tracking and archiving without developer involvement Low profile: Average memory usage of 250MB Scalability: Systems automa;cally scaled to match DWH demands Biggest mistakes along the way
Reliance on documenta;on and requirements over expert users •  Ini;al integra;on design was based on provided documenta;on/models which was rarely accurate •  Current users of sources should have been engaged earlier to explain undocumented caveats Late u;liza;on of templates and variables •  Variables were u;lized late in development, slowing progress significantly and crea;ng consistency issues •  Good ini;al design of templates will significantly reduce development ;me in mid/long run Aggressive overextension of resources •  We apempted to design and populate the en;re data vault prior to focusing on customer deliverables like reports (in addi;on to other projects) •  We have shihed focus to con;nuous release of new informa;on rather than wai;ng for completeness Primary takeaways
◦ Sources are like cars: the older they are, the more idiosyncrasies. Be cau;ous with design automa;on! ◦ Automa;on can enormously simplify/accelerate data warehousing. Don’t be afraid to roll your own ◦ Balance stateful versus stateless and monolithic versus fragmented architecture design ◦ Cloud based architecture based on column store DBs is extremely scalable, cheap, and highly performant ◦ A successful vault can create a new problem: gewng IT to think about business processes rather than system keys! Rob Winters Andrei Scorus [email protected] [email protected]