Big Data vs. big Data? Data? אנליטיקה עסקית בארגונים ממוקדי לקוח ד"ר אילן ששון [email protected] www.datascience.co.il 30/3/2015 מטרות מה זה ?Big Data Analytics מה זה ?Data Science DDDוחשיבותו בארגונים מוטי לקוח ושירות מגוון מקורות נתונים מושגים בסיסיים ביצירת Data Products המלצות וגישות להבניית יכולות אנליטיות מדעני נתונים ולמה זה עשוי לעניין ...אתכם? מושגי יסוד בכריית נתונים ניהול פרויקטים מוטי אנליטיקה עסקית )(CRIPS-DM Data Privacyולמה זה חשוב דוגמאות קוד Data/Text mining R Trend of Google Searches of “Big Data” and “Data science” over time showing the popularity of the terms Data Science – the connective tissue between big data processing technologies and data-driven decision making (DDD) (Provost & Fawcett, 2013) Terminology Data-Driven Decision-Making (DDD) – refers to the practice of basing decisions on the analysis of data, rather than purely on intuition. (Provost & Fawcett, 2013) Data Science – is a set of fundamental principles that support the extraction of information and knowledge form data. It involves principles, processes, and techniques for understanding phenomena via the (automated) analysis of data. Big Data Technologies are used to process and handle big data, and include preprocessing prior to implementing data mining techniques. The new approach to Business Analytics Why do we really care? • DDD affects firm performance → the more data-driven a firm is the more productive is with a 4%-6% increase and highly correlated with higher ROI, ROE, asset utilization and market value. (Brynjolfsson et al. Strength in numbers: How does datadriven decision making affect firm performance , 2013 MIT). • BD Technologies utilization correlates with significant additional productivity growth affects firm performance → 3% increase in productivity than the average firm. (Tambe P. Big data know-how and business value , 2012 NYU). Competitive Advantage What can I now do that I couldn’t do before, or do better than I could do before? 3 Principles of the new era of computing • Data will be the basis of competitive intelligence for any organization – companies, government entities, cites and individuals • Data in this new era – not limited resource • Changing how we make decision - Decisions will be based not on intuition or past experience, but on predictive analytics. • Changing how we create value - Organizations - private and public will become social enterprises. • Changing how we deliver value - Success will depend upon the ability to create products and services for individuals - not market segments. http://asmarterplanet.com/blog/2013/03/ibm-ceo-ginni-rometty-gaining-competitive-advantage-in-the-newera-of-computing.html Big Data Every Where! • Lots of data is being collected and warehoused – Transactional data – Web data, e-commerce – purchases at department/ grocery stores – Bank/Credit Card transactions – Social Network – Multi media content – Scientific data – Networks sensors – Mobile phones – User generated content – Internet of Things Data is becoming the new currency vital natural resource Datafication - taking all aspects of life and turning them into data (The rise of big data, 2013. Foreign Affairs) What to do with these data? Aggregation and Statistics: • Data warehouse • OLAP Indexing, Searching, and Querying: • Keyword based search • Pattern matching (RDF/XML) Knowledge discovery: • • • • Data mining Text mining Graph mining Statistical modeling Big Data … Big Assumptions • Collecting and using a lot of data rather than small samples (“N= All”) • Accepting messiness in your data • Giving up on knowing the causes Big Data Use Cases Big Data can play a significant economic role to • Private commerce • Public sector • National economies big Data– The enterprise perspective Enterprise data is big… but it is not Google-big OLTP ETL IT-Oriented OLAP Classic BI Boundary Dash-bored… Big Data Warehouse OLTP / Dark data/ Log / Social/ web ETL’ Augmented DWH + Extreme-ScaleAnalytics Business-Oriented הפתרון הקיים DWH - מה הם סוגי המוצרים הבנקאיים הנמכרים ביותר? מה היא התפלגות ההוצאות על פי יחידות מטה ? מה היא התפלגות הכנסות על פי מוצרים בנקאיים? באילו סוגי מוצרים קיימת מגמת עונתיות? רווחיות על פי מוצרים על פני מימד הזמן ומימד הסניפים ? מרחב הבעיה? מי הם הלקוחות הפוטנציאליים ביותר להלוואה מעל 300,000ש"ח? מה הם המאפיינים של לקוח נוטש? איך ניתן לקצר את תהליך הטיפול במתן אשראי ללקוח חדש? מה הם המאפיינים של לקוח רווחי? אילו מוצרים חדשים מומלץ להציע ללקוחות קיימים? כיצד ניתן לייעל תהליכים בארגון ? מרחב הפתרון Business Analytics to Business Intelligence Large Scale Data/Text Mining Discovery Based Analysis ~ ~ ~ ~ ~ תהליך אנליטי מבוסס גילוי כלים/אלגורתימים הפועלים על מרחב הנתונים חושפים תבניות חבויות תהליכי הקבצה ,ניבוי ואסוציאציה Unsupervised Learning Machine תהליכים המחייבים בסיס נתונים היסטורי גדול to from DHW/OLAP Verification Based Analysis ~ ~ ~ ~ תהליכים משלימים תהליך אנליטי מבוסס אימות משתמש מניח היפותיזה כלשהיא מופעלות טכניקות אישוש/סתירה תהליכים מבוססי משתמש – היכולת להניח הנחות נכונות,בחירת הכלים,ופרשנות התוצאות from Big Data Architecture & Pipeline מקור נתונים חיצוני/פנימי Real Time Analytics Streams Network/Sensor Internet of Things Video/Audio Entity Analytics Information Ingestion Unified Information Access (UIA) Master Data Data Integration Stream Processing • Exploration, Analytics Discovery Predictive Operative Descriptive Prescriptive Landing Area Zone & Archive Raw Data Structured Data Unstructured Data Text Analytics Data Mining Machine Learning Complex Event Processing Intelligence Analysis Decision Management BI & Predictive Analytics Reporting & Discovery Business Processes Data-Analytic Thinking One of the most critical aspects of data science is the support of data-analytic thinking throughout the organization → Data-oriented business environment • Basic understanding of basic principles → – In order to assess and envision opportunities accurately (data-analytics projects) – Professional advantage in being able to interact competently (dataanalytics team) – Business units must interact with data science team (domain knowledge) – Data science project require close interaction with business people responsible for decision making Conveying the message…. • Data mining is moving from the research arena into the pragmatic world of business • There is continuous effort of refining algorithms and coming up with new ones • Now with new developments in algorithms and architecture smallscale development teams can build large-scale projects • Practical data mining weighs the trade-offs between the most advanced and accurate model with the costs and complexity in realworld business environment • New analytics tools and platforms make data mining much more easier and powerful for people at all levels of expertise • Hadoop-based computing ecosystem is evolving rapidly, making project with very large-scale datasets much more affordable The Ladder Approach • Build a foundation – – – – – – Learn to think analytically (data mining models, visualization, statistics etc.) Develop a strategy and road map based on business needs (pick a theme) C-level management engagement (presentation) Adopt a step-by-step process (problem definition → results: CRISP-DM) Pick and learn a tool (R, Python etc.) Practice on small datasets • Build a portfolio – Deliverable POCs and pilot projects (3-5) – Quick-wins – Practice on small datasets – Write-up findings (storytelling) • Deliver solutions – Adopt technology infrastructure (HDFS, MapReduce, NoSQL Spark SQL…. etc.) – Ongoing revisions of models (data products) – Continue to apply advanced analytics Business Scope & Deliverables Rethinking the Business & IT Model Data Management & Business Analytics are Core Business Competencies o The Business Owns the Data o Recognize Analytics as a Business Driven and Owned Process o Technology is an Enabler Shift to Business Configurable and Controlled o Acknowledge the Difference between Software Development and Business Analytics o Redefine the IT Support Model to Enable The Business to Acquire, Assess, Analyze, Test, and Deploy Analytical Outcomes Change the IT funding & Financial Model o Current Infrastructure Model is Geared towards Legacy & Transactional Platform o Recognize Analytics as a Business Driven and Owned Process Technology is an Enabler 2013 אוקטוברIBM כנס ביג דאטהMetLife מצגת:מקור השקף Big Data Adoption התווית תוכנית עבודה (בחינת תרחיש )אחד או יותר למימוש Data- מפגשים קורס10 בןBig Data, Data Science & Data Mining קורס Business מפגשים לאנשי8 בןAnalytic Thinking «360°» הקמת קבוצת • • • • R&D Team Infrastructure & Operations Business Unit Analysts Business IT Support Team The Data Journey הערכה 80% :מהמידע בארגון אינו מובנה ואינו ממודל ולפיכך אינו זמין לניתוח ואנליזה בכלים הקיימים והמסורתיים בשלב ראשון לא נרתיח את האוקיינוס....... Big Data doesn’t have to be big – it can be managed and built incrementally. Big Data may or may not include social media (eventually it will). Big Data may or may not include external data (eventually it will). Sometimes information is good enough. Data Management before Business Analytics מה עושים כיום בארגון : OLAP .1דוחות מימדי מוצר שיווק תמחור .2מודלים של כריית נתונים ?... Internal Data מידע תפעולי קיים במחסן הנתונים New Internal Data )(Dark Data מידע קיים שלא מוגדר במחסן הנתונים ,מידע מובנה ,מיילים ,מידע טקסטואלי )סוכנים ,שמאים(.. New External Data מידע ממקורות חיצוניים :אינטרנט, מתחרים,רשתות חברתיות ,מידע סלולארי מבוסס מיקום ,טלמטיקה סנסורים ועוד Data Products • Motivation: turning data assets → products and services • A data product is an algorithm, software, application, presentation or reproducible report based on data analytics • A data product is the production output from a statistical analysis, data mining, text miming, AI etc. • Initially online companies : “A data product is a product that facilitates an – search algorithms (Google) end goal through the use of data.” – similar offerings (Amazon) – recommendations for “people you may know” (Facebook) DJ Patil • Developing and launching data products, particularly if you are an offline business → it won’t be second nature... Data-as-a-Service (DaaS) - a cloud strategy used to facilitate the accessibility of businesscritical data in a well-timed, protected and affordable manner → B2B "renting" data service The Model Assembly Line Do you have the data? Do you own the data? Data quality? Business model Type of analysis Competi tive adv. Do you have the data? Do you own the data? (legal issues, consider anonymized personal data) Is it high-quality and useful data? Do you have a business model? (bundling, selling, free) What types of analysis are you offering? (descriptive analytics vs. predictive analytics) Do you have differentiation or competitive advantage? (proprietary vs. commodity data) The Model Assembly Line: A case study of DaaS Cellular companies Do you have the data? ( מרכזי עריםLocation based) – מידע מיקומי מפתחי אפליקציות חברתיות מרכזי עסקים, איזורי בילוי, איזורי קניות, מרכז העיר- חלוקה גיאוגרפית חודשי/שבועי/ יומי- תדירות עדכון הנתונים Online/Batch - נגישות לנתונים פרטי,סיווג לקוח – עסקי Voice, SMS - סוג תקשורת Do you own the data? Data quality? ( עורקי תחבורה ראשייםLocation based) – מידע מיקומי עיריות ומוסדות תכנון ממשלתיים פרבר, עיר- חלוקה גיאוגרפית אוטוסטרדה, עירוני, מהיר בין עירוני- סוג כביש חודשי/שבועי/ יומי- תדירות עדכון הנתונים Online/Batch - נגישות לנתונים Business model Type of analysis Competi tive adv. Pricing Models Volume based model Quantity-based pricing (amount) Pay-per-call (PPCall) Data type based model based on the type or attribute of data Subscription based model an unlimited amount of data Implementations Approaches • The Full Service Approach: Relying on a 3rd party to develop and maintain the model • The Full Control Approach: In house model development and deployment • The Consultant Approach: Hybrid methodology Implementations Approaches • The Full Service Approach: Relying on a 3rd party to develop and maintain the model • The Full Control Approach: In house model development and deployment • The Consultant Approach: Hybrid methodology o Pros: o o o o the ideal solution for companies who are resource constrained the ideal solution for companies lacking technical and analytics staff the model development can rely on expertise provided by the vendor the quickest path to implementation o Cons: o reliance on the vendor to provide a solution without any independent review o not being able to make changes to the model directly o Internal staff is not trained to ensure attainment of desired results Implementations Approaches • The Full Service Approach: Relying on a 3rd party to develop and maintain the model • The Full Control Approach: In house model development and deployment • The Consultant Approach: Hybrid methodology o Pros: o the ideal solution for companies with analytics and IT resources o Helps to protect IP in case of a novel idea or product o This approach offers the most flexibility in making revisions or customizations to the model o Cons: o The firm can’t take advantage of any data or expertise accumulated by vendors and consultants o If a fundamental modeling error has been made, it may never be discovered o historically the slowest path to deployment, with successful implementations measured in years(?) Implementations Approaches • The Full Service Approach: Relying on a 3rd party to develop and maintain the model • The Full Control Approach: In house model development and deployment • The Consultant Approach: Hybrid methodology o Pros: Build your own core competencies coupled with high-end data science consultancy o the ideal solution for companies lacking depth in their analytics department, but who have available resources in systems and IT o There is a built-in “independent review” phase in this approach. o Companies are able to make changes directly to the model as needed o Cons: o If companies lack internal technical or analytical resources, they may be at the mercy of the vendor in the future should a model update or revision be needed. o Some companies attempt to update vendor models, but lack the in-depth knowledge of modeling techniques used. As a result, they may inadvertently make fundamental modeling errors o Continuous management attention Roles in Data Science • Data Scientist – Applied statistician X computer scientist Computer science Data Scientist (noun): Math better at statistics than any software engineer and better at software engineering than any other statistician Statistics Josh Wills Machine learning Domain expertise Communication and presentation skills Data visualization – No one person can be the perfect data scientists → A team ….? “…shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts to analyze big data…” (McKinsey, 2011) Data Scientist Skills required to exploit big data • Skills to work with business stakeholders to understand the business issue and context • Analytical and decision modeling skills for discovering relationships within data and proposing patterns • Data management skills are required to build the relevant dataset used for the analysis. • Broad combination of soft and technical skills Sample of Program Offerings DB - Databases BI – Business Intelligence, Data Warehousing ST – Advanced-Level Statistics BA – Business Analytics, Web Analytics DM – Data Mining, Machine Learning, Text Mining, Natural-Language Processing BD – Big Data Technologies, Visualization KM – Knowledge Management, Social-Web Analysis קוסמולוגים של היקום הדיגיטלי http://online.wsj.com/article/SB10001424127887323478304578332850293360468.html?mod=itp Building Models – Introduction A model captures the knowledge exhibited by the data and encodes it in some language…no model can perfectly represent the real world Automatic or semi-automatic extraction of • Interesting • Non-trivial • Implicit • Previously unknown • Potentially useful Forecasting what may happen in the future Classifying items into groups by recognizing patterns Clustering items into groups based on their attributes Associating what events are likely to occur together Sequencing what events are likely to lead to later events Building Models – Introduction Models fall into the categories of data mining: descriptive and predictive Predictive Tasks Use some variables to predict unknown or future values of other variables Descriptive Tasks Find human-interpretable patterns that describe the data Supervised learning Unsupervised learning Meta learning (ensemble learners) 31 Types of Data Mining Tasks • Affinity grouping (a.k.a. “associations”, “market-basket analysis”) – What items are commonly purchased together? • Similarity Matching – What other companies are like our best small business customers? • Description/Profiling – What does “normal behavior” look like? Unsupervised Many business problems have as an important component one of these DM tasks: • Clustering • Predictive Modeling (including causal modeling & link prediction) – Will customer X churn next month/default on her loan? – How much would prospect X spend? – Who might be good “friends” on our social networking site? 32 Supervised – Do my customers form natural groups? Data Mining vs. Deployment Merging Traditional & Big Data approaches Merging Traditional & Agile approaches Time to market – slow process Disconcert between the business people (consumers) and IT people (producers) The overall cost is high Breaking down the walls Discovery process and not a traditional SW development project Business owns the data…… Codification of The Process Extracting useful knowledge from data to solve business problems can be treated systematically by following a process with reasonably well-defined stages CRISP-DM - The Cross Industry Process for Data Mining (www.crisp-dm.org) (CRISP-DM; Shearer, 2000) Structured process with critical points: Human Intuition High-powered analytical tools A well-understood process that places a structure on a problem which still involves art… science + craft + creativity + common sense 36 CRISP-DM This process diagram makes explicit the fact that iteration is the rule rather than the exception… exception… not a linear process The point of actually using your results Both mathematical and logical 37 Preparatory activity what data? where is the data? accuracy and reliability of the data The most substantial components (65%) timeconsuming and laborintensive CRISP-DM Business Understanding A creative problem formulation - what is the problem ? Think carefully about the use scenario and the actual business need • What exactly do we want to do? • How exactly would we do it? • What parts of this use scenario constitute possible data mining models? Data Understanding It is important to understand the strengths and limitations of the data. Historical data often are collected for purposes unrelated to the current business problem. Estimating the costs and benefits of each data source Data having varying degrees of reliability • Cost of acquiring the data • Data manipulation • Data quality 38 CRISP-DM Data Preparation Pre-processing tasks • Data conversions • Data transformations (e.g., normalization, scaling etc.) • Missing values, Outliers • Redundant or non-informative features (i.e., feature selection, between-predictors correlations) • Dimensionality reduction techniques (e.g., PCA, SVD) Modeling The primary place where data mining techniques are applied to the data It is important to have some understanding of the fundamental ideas of data mining, including the sorts of techniques algorithms and tuning parameters. Evaluation The evaluation stage is to assess the data mining results rigorously and to gain confidence that they are valid and reliable before moving on. Measuring models performance and generalization 39 Basic Principles - Privacy • Collection limitation - Data should be obtained lawfully and fairly, while some very sensitive data should not be held at all. • Data quality - Data should be relevant to the stated purposes, accurate, complete, and up-to-date; proper precautions should be taken to ensure this accuracy. • Purpose specification - The purposes for which data will be used should be identified, and the data should be destroyed if it no longer serves the given purpose. • Use limitation - Use of data for purposes other than specified is forbidden. Source: the OECD (Organization for Economic Co-operation and Development (OECD), 1980). Data Science Course 41 Big Data אפליקציות ושימושים של Predictive and Descriptive הצגת מגוון מודלים לכריית נתונים : הכוללים בין היתרExploratory Data Analysis -וAnalytics Cluster Analysis – Association Analysis – Decision Trees & Random Forest – Support Vector Machine – Neural Networks – Anomaly Detection – : והצגת מושגי יסוד כדוגמתGraph mining ,Social Network Analysis Degree & Degree Distribution – Centrality, Betweeness, Closeness – ועודCentralization – Text לצורךNLP שיטות לכריית נתונים טקסטואליים מבוססות הצגת מושגי יסודCategorization Information Extraction ושיטות של ייצוג נתונים טקסטואליים מבוססיInformation Retrieval Bag-Of-Words כרייה והצגה של נתונים, לצורך תחקור סטטיסטיR שימוש בסביבת גישות ויזואליזציה וגרפיקה לאפליקציות מבוססות ניתוח נתונים ( ועודco-occurrences network, neighborhood graph ) טקסטואלי טכנולוגיות מתקדמות לניהול נתונים וארכיטקטורות אחסון ועיבוד לניהול פרויקטי אנליטיקה עסקיתCRISP-DM הצגת מודל • • • • • • • • Why R ? • R is a free and open source language and environment for statistical computing and graphics. • R is already the most popular amongst the leading software for statistical analysis. • Key features: – – – – – It’s a mature & widely used NYT Excellent graphics capabilities http://www.sr.bham.ac.uk/~ajrs/R/r-gallery.html Highly extensible, with over 4300 user-contributed packages It’s easy to use and has excellent online help and associated documentation http://cran.r-project.org/other-docs.html -Manuals, tutorials, etc. provided by users of R ביג דאטה הוא ייצוג של תהליך בעל מגמות אבולוציוניות: מורכבות גיוון והתמחות תודה על ההקשבה [email protected]
© Copyright 2024