What is Big Data? Mark Whitehorn, Co-Founder, Penguinsoft Consulting Ltd. Global Sponsor: It’s all about me… Prof Mark Whitehorn Chair of Analytics School of Computing University of Dundee Scotland Consultant Writer (author) 2 It’s all about me… Prof Mark Whitehorn Teach a Masters in BI And another in Data Science - Also research work 3 Actually, it isn’t all about me… Andy Cobley Chris Hillman Prof. Angus Lamond Dr. Yasmeen Ahmad 4 What is Big data? Is it really just a marketing campaign? http://www.perceptualedge.com/articles/visual_business_intelligence/big_data_big _ruse.pdf “If you’re like me, the mere mention of Big Data now turns your stomach….Why all the fuss? Why, indeed. Essentially, Big Data is a marketing campaign, pure and simple.” Stephen Few 5 Big data Clearly I am not like Stephen Few. I don’t believe I have a particular axe to grind, I simply find this interesting This talk is designed to try to explain: what Big Data is what characteristics we have found useful why it may be of interest to you a paradox 6 Data All computer applications manipulate data 7 Data So, in the ’60 and ‘70s we rapidly learnt to separate the data, and its manipulation, from the application 8 Data So, in the ’60 and ‘70s we rapidly learnt to separate the data, and its manipulation, from the application Which led directly to the development of database engines and, ultimately, relational ones (DB2, Oracle, SQL Server) 9 Data Data has always existed in two, very broad, flavours….. Data that is treated as small, discrete packages and is a good fit with the relational way of storing and querying data Data that is not as above 10 Data is stored in tables 11 LicenceNo Make Model Year Color CER 162C Triumph Spitfire 1965 Green EF 8972 Bentley Mk.VI 1946 Black YSK 114 Bentley Mk.VI 1949 Red Data is stored in tables Each table has a name Car 12 LicenceNo Make Model Year Color CER 162C Triumph Spitfire 1965 Green EF 8972 Bentley Mk.VI 1946 Black YSK 114 Bentley Mk.VI 1949 Red Data is stored in tables Car LicenceNo Make Model Year Color CER 162C Triumph Spitfire 1965 Green EF 8972 Bentley Mk.VI 1946 Black YSK 114 Bentley Mk.VI 1949 Red Data is atomic 13 Data is stored in tables Columns Car 14 LicenceNo Make Model Year Color CER 162C Triumph Spitfire 1965 Green EF 8972 Bentley Mk.VI 1946 Black YSK 114 Bentley Mk.VI 1949 Red Data is stored in tables Columns Car Rows 15 LicenceNo Make Model Year Color CER 162C Triumph Spitfire 1965 Green EF 8972 Bentley Mk.VI 1946 Black YSK 114 Bentley Mk.VI 1949 Red Data is stored in tables Car LicenceNo Make Model Year Color CER 162C Triumph Spitfire 1965 Green EF 8972 Bentley Mk.VI 1946 Black YSK 114 Bentley Mk.VI 1949 Red Each row represents a unique entity in the ‘real’ world…… 16 17 Data The manipulation consists typically of sub-setting the data by rows and columns and then maybe doing some sums: SELECT Make, Model (chooses the columns) FROM Car Where Year < 1947 (chooses the rows) 18 Data Note that this kind of manipulation is treating the data as atomic, which is fine, because the relational model assumes atomicity of data Note also, that the rows are unordered 19 Data Data has always existed in two, very broad, flavors….. Data that is inherently atomic and is a good fit with the relational way of storing and querying data Data that is not as above 20 Examples Examples of ‘other’ data: 21 Images Music Word docs Sensor data Web logs Twitter Machines Point of Sale Mass spectrometers What’s in a name? So, what do we call the ‘rest’? 22 Un-structured? Semi-structured? Multi-structured? Non-relational? Non-tabular? What’s in a name? What about: Big data? 23 Other definitions? VVVvvvv 24 Volume Variety Velocity Value Very interesting Various other variations beginning with V….. Big Data – not new? So why have we focused, for the last 30 years, almost exclusively on the first flavor? Because it: is easy (relatively easy – Jim Gray*) represents a significant proportion of the available data *Jim Gray and Andreas Reuter - Transaction Processing: Concepts and Techniques (1993) Turning Award 1998 25 Big Data has come of age Two factors have changed Rise of the Machines Increase in data capture There is a great synergy here We are acquiring far more big data and we now have the computational power to extract the information it contains 26 Big Data is hard Of the 3 Vs, perhaps the most important is Variability We often want to look inside the data Frequently non-atomic Need custom functions for virtually every operation “Find the rotating wing aircraft in the image” “Identify the best customer” “What does the blog sphere think of our company?” 27 Big Data Examples Log file Mass spec. Images 28 Big Data The problem here is that the order of the rows is significant We want to know which page views lead to other page views Of course we CAN do that in SQL, but it may not be efficient to do so 29 Big Data Examples Log file Mass spectrometer Image 30 31 Big Data The problem here is that the order of the rows is significant (as before) And the number of rows is likely to be overwhelming 32 SQL-MapReduce, Reduce Function 33 335.2094368 0 335.2105961 0 335.2117553 0 335.2129146 53.024086 335.2140739 184.1607361 335.2152332 264.3601074 335.2163925 259.6187134 335.2175518 239.7870178 335.2187111 313.8243713 335.2198704 490.8760071 335.2210297 634.064209 335.222189 589.8432007 335.2233483 351.9743347 335.2245077 65.21440887 335.225671 0 336.890869 0 336.892037 75.75605011 336.893205 179.8110657 336.894373 247.535553 336.895541 225.6489563 336.8967091 140.6246338 337.1257588 0 337.1280972 86.48993683 337.1292664 170.0835876 337.1304357 215.8146362 337.1316049 188.9733276 337.1327741 110.2854233 337.1912444 0 337.192414 0 337.1935835 143.2112122 337.1947531 357.401123 337.1959227 467.1167297 337.1970923 411.569458 337.1982619 245.5514221 337.1994315 80.80451202 Data output from Mass Spectrometer Detecting centroids of peaks is highly complex using SQL as it is not a set based operation Almost 800 lines of complex SQL SELECT file_id ,scan_id ,ren_tm ,ms_lvl ,mz ,i AS n_ ,SUM(i) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS p_i ,(CASE WHEN (i > 0) THEN 1 ELSE 0 END) AS Ind ,(Ind - SUM(ind) OVER (PARTITION BY file_id, *ms_lvl, ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING)) ,(weighted_peak_mz chrg) /ren_tm 700000.000000000000000 AS delta_mz ,CAST((CASE WHEN = 1 THEN CSUM(1,Ind) ,CASEBWHEN ( WHEN B = 0 AND Ind = 1WHEN THEN 0 (CASE ELSE NULL END) AS DECIMAL(38,0))SUM((weighted_peak_mz AS CurveID * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) FROM dd_stg.mzml BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y' WHERE ms_lvl = 1 ELSE NULL ) WITH DATA END) = 'Y' PRIMARY INDEX (mz) OR SELECT file_id,scan_id,ren_tm,ms_lvl,mz ,i (CASE WHEN ,CASE WHEN ind = 1 THEN SUM(CurveID+Mark) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz, ind ROWS UNBOUNDED PRECEDING) SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 1 FOLLOWING AND 1 ELSE NULL END AS CurveNum FOLLOWING) FROM (SELECT file_id,scan_id,ren_tm,ms_lvl,mz,n_I AS i BETWEEN,A.ren_tm ((weighted_peak_mz SELECT A.file_id ,A.ren_tm ,A.scan_id ,A.ms_lvl ,A.CurveNum A.Weighted_Peak_mz ,A.sum_i * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y' ,CASE ELSE NULL ,A.ren_tm - B.ren_tm AS Diff_Ren_Tm WHEN END) = 'Y' ,A.Weighted_Peak_mz - B.Weighted_Peak_mz AS Diff_WP ( (CASE OR ,B.CurveNum AS L_CurveNum WHEN n_i - p_i > 0 THEN 1 (CASE WHEN ,B.Weighted_Peak_mz AS L_Weighted_Peak_mz WHEN n_i - p_i < 0 THEN -1 SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 2 PRECEDING AND 2 ,B.ren_tm AS L_ren_tm ELSE 0 ,B.sum_iPRECEDING) AS L_Sum_I END) BETWEEN ((weighted_peak_mz * chrg) FROM DD_STG.S2_WEIGHTED_CURVE AS - delta_mz) AND ((weighted_peak_mz * chrg) A + delta_mz) THEN 'Y' - B.Weighted_Peak_mz AS Diff_WP SUM(CASE ,A.Weighted_Peak_mz ELSE NULL INNER JOIN DD_STG.S2_WEIGHTED_CURVE AS B ,B.CurveNum AS L_CurveNum WHEN n_i - p_i > 0 THEN 1 END) = 'Y' ON (A.Weighted_Peak_mz - B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000 ,B.Weighted_Peak_mz AS L_Weighted_Peak_mz WHEN n_i - p_i 0 THEN -1 OR=<B.ren_tm AND A.ren_tm ,B.ren_tm AS L_ren_tm ELSE 0 (CASE WHEN AND A.CurveNum <> B.CurveNum ,B.sum_i L_Sum_I END) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BYAS mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AND B.max_i > (0.66667 * A.max_i) FROM DD_STG.S2_WEIGHTED_CURVE AS A ) = 2 THEN 1 ELSE 0 INNER JOIN DD_STG.S2_WEIGHTED_CURVE AS B ON (A.Weighted_Peak_mz - B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000 END AS Mark AND A.ren_tm = B.ren_tm ,Ind AND A.CurveNum <> B.CurveNum ,B AND B.max_i > (0.66667 * A.max_i) ,CurveID LEFT JOIN 34 ) AS DD_TAB.CHARGE_STATES AS ON J C CAST(J.Diff_WP AS DECIMAL(18,2)) = CAST(C.chrg_mz_diff AS DECIMAL(18,2)) Procedural code uses 2 loops for same result while (inputIterator.advanceToNextRow()) { currIntensity=inputIterator.getDoubleAt(5); maxIntensity=0.0; //Initialise Temp Array for (int i=0; i <= 50; i++){ curveArray[0][i]=0; curveArray[1][i]=0; } if (overlapFlag==1){ count = 1; } else { count = 0; } //Find start of Curve, lastintensity is 0 //or previous lastintensity is higher than lastintensity – overlapping peaks (double peak curve) if (currIntensity > 0 && lastIntensity == 0 || overlapFlag==1){ //Populate Temp Array with Curve points and find maxIntensity to derive threshold while (currIntensity > 0){ if(maxIntensity < currIntensity) maxIntensity=currIntensity; if (overlapFlag==1){ overlapFlag=0; curveArray[0][count-1]=overlapMZ; curveArray[1][count-1]=overlapIntensity; PI = overlapIntensity; currIntensity=inputIterator.getDoubleAt(5); } curveArray[0][count]=inputIterator.getDoubleAt(4); curveArray[1][count]=inputIterator.getDoubleAt(5); count++; inputIterator.advanceToNextRow(); PI2 = PI; PI = currIntensity; currIntensity=inputIterator.getDoubleAt(5); 35 if (currIntensity > PI && PI2 > PI){ //Overlapping Peak found, store MZ and Intensity and start new Curve for next Iteration overlapFlag=1; overlapMZ=inputIterator.getDoubleAt(4); overlapIntensity=inputIterator.getDoubleAt(5); break; } } //Process Temp Array to create intermediate metrics while (curveArray[1][curveCount] > 0){ if (curveArray[1][curveCount] > intensityThreshold){ if (maxMZ < curveArray[0][curveCount]){ maxMZ=curveArray[0][curveCount]; } if (minIntensity > curveArray[1][curveCount] || minIntensity == 0){ minIntensity=curveArray[1][curveCount]; } if (minMZ > curveArray[0][curveCount] || minMZ == 0){ minMZ=curveArray[0][curveCount]; } sumIntensity=sumIntensity+curveArray[1][curveCount]; sumMZ=sumMZ+curveArray[0][curveCount]; sumMZByIntensity=sumMZByIntensity+(curveArray[0][curveCount]*curveArray[1] [curveCount]); curvePoints++; } curveCount++; } Custom analysis and custom visualisation – vital tools in understanding big data 36 36 Big Data Examples Log file Mass spec. Images 37 What is Big Data? Examples Log file Mass spec. Images BIG DATA Big Data The problem here is that this data is not atomic A picture is worth a thousand words In other words, we don’t know what question we may ask in the future 40 Summary so far… Just as you can always fit an aircraft engine into a car chassis, you can always put Big Data in a table, but it reaches a stage where it is no longer the most effective solution to do so The analysis is not sub-setting the data by rows and columns We are often interested in order Each class of big data usually requires a (lovingly hand-crafted) custom analysis 41 Big Data Which means that we are going to be storing the data without imposing a schema upon it – in other words, we are going to be storing the data “schema-less” 42 Paradox The relational model guarantees that any question can be asked of the data and that a consistent answer will be delivered. How does that work? Big data doesn’t impose a schema on the data, the data is stored as schema-less. One reason for this is that a schema would restrict the questions that we can ask of the data. 43 How does that work? Paradox The paradox is that we are saying that : if we impose a strict schema we can ask (and answer) any question we impose no schema we can ask (and answer) any question 44 Paradox The relational model doesn’t restrict the questions that can be asked of data This is essentially true as long as (note the qualification there!): we are treating the data as atomic. We are not looking for order in the rows. We can subset by row and column and we can do difficult sums on the end result. So, does the relational model allow us to drill inside the atomic data? 45 Well, no, but relational database engines often do The model assumes atomic data A query that finds all the last names of the employees paid more than $40,000 is relational A query that finds all the employees where the third letter of their last name is ‘c’ is not relational Paradox So, if the data is atomic and is treated as atomic, and order in unimportant, then the relational model allows any question to be asked 46 Paradox Storing data in an ‘unstructured’ way allows you to ask any question of the data This is essentially true as long as (note the qualification here!): we are prepared to design new functions for every new type of query that we want to run So, imagine that you have some satellite images. They are stored without a schema being imposed by the database engine We want to find rotating wing aircraft – so we write an algorithm that does that Now we want to find all the penguins – we need another custom algorithm So, a schema-less database allows any question to be asked of the data – as long as we are prepared to write a new custom algorithm for each new type of query 47 Case Study Oil Rig data Gone fishing Sensor data 48 48 Case Study Twitter Who loves you? Social/text/sentiment 49 49 Case Study Big Data in the Life Sciences World The massed spectrometers Why would anyone do that? 50 50 Lessons learned Engagement Choose you battles – look for an area where you can gain competitive advantage Choose your platform carefully Programming – algorithm development Data scientists Custom algorithms Custom visualisations 51 51 Questions? Global Sponsor: Thank You for Attending Global Sponsor:
© Copyright 2024