Download Report

What is Big Data?
Mark Whitehorn, Co-Founder, Penguinsoft Consulting Ltd.
Global Sponsor:
It’s all about me…
Prof Mark Whitehorn
Chair of Analytics
School of Computing
University of Dundee
Scotland
Consultant
Writer (author)
2
It’s all about me…
Prof Mark Whitehorn
Teach a Masters in BI
And another in
Data Science
- Also research work
3
Actually, it isn’t all about me…
Andy Cobley
Chris Hillman
Prof. Angus Lamond
Dr. Yasmeen Ahmad
4
What is Big data?
Is it really just a marketing campaign?
http://www.perceptualedge.com/articles/visual_business_intelligence/big_data_big
_ruse.pdf
“If you’re like me, the mere mention of Big Data now turns your stomach….Why all
the fuss? Why, indeed. Essentially, Big Data is a marketing campaign, pure and
simple.”
Stephen Few
5
Big data
Clearly I am not like Stephen Few.
I don’t believe I have a particular axe to grind, I simply find this interesting
This talk is designed to try to explain:
 what Big Data is
 what characteristics we have found useful
 why it may be of interest to you
 a paradox
6
Data
All computer applications manipulate data
7
Data
So, in the ’60 and ‘70s we rapidly learnt to separate the data, and its manipulation, from
the application
8
Data
So, in the ’60 and ‘70s we rapidly learnt to separate the data, and its manipulation, from
the application
Which led directly to the development of database engines and, ultimately, relational
ones (DB2, Oracle, SQL Server)
9
Data
Data has always existed in two, very broad, flavours…..
 Data that is treated as small, discrete packages and is a good fit with the
relational way of storing and querying data
 Data that is not as above
10
Data is stored in tables
11
LicenceNo
Make
Model
Year
Color
CER 162C
Triumph
Spitfire
1965
Green
EF 8972
Bentley
Mk.VI
1946
Black
YSK 114
Bentley
Mk.VI
1949
Red
Data is stored in tables
Each table has a name
Car
12
LicenceNo
Make
Model
Year
Color
CER 162C
Triumph
Spitfire
1965
Green
EF 8972
Bentley
Mk.VI
1946
Black
YSK 114
Bentley
Mk.VI
1949
Red
Data is stored in tables
Car
LicenceNo
Make
Model
Year
Color
CER 162C
Triumph
Spitfire
1965
Green
EF 8972
Bentley
Mk.VI
1946
Black
YSK 114
Bentley
Mk.VI
1949
Red
Data is atomic
13
Data is stored in tables
Columns
Car
14
LicenceNo
Make
Model
Year
Color
CER 162C
Triumph
Spitfire
1965
Green
EF 8972
Bentley
Mk.VI
1946
Black
YSK 114
Bentley
Mk.VI
1949
Red
Data is stored in tables
Columns
Car
Rows
15
LicenceNo
Make
Model
Year
Color
CER 162C
Triumph
Spitfire
1965
Green
EF 8972
Bentley
Mk.VI
1946
Black
YSK 114
Bentley
Mk.VI
1949
Red
Data is stored in tables
Car
LicenceNo
Make
Model
Year
Color
CER 162C
Triumph
Spitfire
1965
Green
EF 8972
Bentley
Mk.VI
1946
Black
YSK 114
Bentley
Mk.VI
1949
Red
Each row represents a unique entity in the ‘real’
world……
16
17
Data
The manipulation consists typically of sub-setting the data by rows and columns
and then maybe doing some sums:
SELECT Make, Model (chooses the columns)
FROM Car
Where Year < 1947 (chooses the rows)
18
Data
Note that this kind of manipulation is treating the data as atomic, which is fine, because
the relational model assumes atomicity of data
Note also, that the rows are unordered
19
Data
Data has always existed in two, very broad, flavors…..
 Data that is inherently atomic and is a good fit with the relational way of storing and
querying data
 Data that is not as above
20
Examples
Examples of ‘other’ data:







21
Images
Music
Word docs
Sensor data
Web logs
Twitter
Machines
 Point of Sale
 Mass spectrometers
What’s in a name?
So, what do we call the ‘rest’?





22
Un-structured?
Semi-structured?
Multi-structured?
Non-relational?
Non-tabular?
What’s in a name?
What about:
 Big data?
23
Other definitions?
VVVvvvv






24
Volume
Variety
Velocity
Value
Very interesting
Various other variations beginning with V…..
Big Data – not new?
So why have we focused, for the last 30 years, almost
exclusively on the first flavor?
Because it:
 is easy (relatively easy – Jim Gray*)
 represents a significant proportion of the available data
*Jim Gray and Andreas Reuter - Transaction Processing: Concepts and
Techniques (1993)
Turning Award 1998
25
Big Data has come of age
Two factors have changed
 Rise of the Machines
 Increase in data capture
There is a great synergy here
 We are acquiring far more big data and we now have the computational power
to extract the information it contains
26
Big Data is hard
Of the 3 Vs, perhaps the most important is Variability
We often want to look inside the data
 Frequently non-atomic
 Need custom functions for virtually every operation
 “Find the rotating wing aircraft in the image”
 “Identify the best customer”
 “What does the blog sphere think of our company?”
27
Big Data
Examples
 Log file
 Mass spec.
 Images
28
Big Data
The problem here is that the order of the rows is significant
We want to know which page views lead to other page views
Of course we CAN do that in SQL, but it may not be efficient to do so
29
Big Data
Examples
 Log file
 Mass spectrometer
 Image
30
31
Big Data
The problem here is that the order of the rows is significant
(as before)
And the number of rows is likely to be overwhelming
32
SQL-MapReduce, Reduce Function
33
335.2094368
0
335.2105961
0
335.2117553
0
335.2129146
53.024086
335.2140739
184.1607361
335.2152332
264.3601074
335.2163925
259.6187134
335.2175518
239.7870178
335.2187111
313.8243713
335.2198704
490.8760071
335.2210297
634.064209
335.222189
589.8432007
335.2233483
351.9743347
335.2245077
65.21440887
335.225671
0
336.890869
0
336.892037
75.75605011
336.893205
179.8110657
336.894373
247.535553
336.895541
225.6489563
336.8967091
140.6246338
337.1257588
0
337.1280972
86.48993683
337.1292664
170.0835876
337.1304357
215.8146362
337.1316049
188.9733276
337.1327741
110.2854233
337.1912444
0
337.192414
0
337.1935835
143.2112122
337.1947531
357.401123
337.1959227
467.1167297
337.1970923
411.569458
337.1982619
245.5514221
337.1994315
80.80451202
Data output from Mass Spectrometer
Detecting centroids of peaks is highly complex using SQL as it is not
a set based operation
Almost 800 lines of complex SQL
SELECT file_id ,scan_id ,ren_tm ,ms_lvl ,mz ,i AS n_
,SUM(i) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz ASC ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS p_i
,(CASE WHEN (i > 0) THEN 1 ELSE 0 END)
AS Ind
,(Ind - SUM(ind) OVER (PARTITION
BY file_id, *ms_lvl,
ORDER BY mz ASC ROWS
BETWEEN 1 PRECEDING AND 1 PRECEDING))
,(weighted_peak_mz
chrg) /ren_tm
700000.000000000000000
AS delta_mz
,CAST((CASE WHEN
= 1 THEN
CSUM(1,Ind)
,CASEBWHEN
(
WHEN B = 0 AND Ind
= 1WHEN
THEN 0
(CASE
ELSE NULL END) AS DECIMAL(38,0))SUM((weighted_peak_mz
AS CurveID
* chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 1 PRECEDING AND 1
PRECEDING)
FROM dd_stg.mzml
BETWEEN ((weighted_peak_mz * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y'
WHERE ms_lvl = 1
ELSE
NULL
) WITH DATA
END)
= 'Y'
PRIMARY INDEX (mz)
OR
SELECT file_id,scan_id,ren_tm,ms_lvl,mz
,i
(CASE WHEN
,CASE WHEN ind = 1 THEN SUM(CurveID+Mark) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BY mz, ind ROWS UNBOUNDED PRECEDING)
SUM((weighted_peak_mz * chrg)) OVER (PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 1 FOLLOWING AND 1
ELSE NULL END AS CurveNum
FOLLOWING)
FROM
(SELECT file_id,scan_id,ren_tm,ms_lvl,mz,n_I
AS i
BETWEEN,A.ren_tm
((weighted_peak_mz
SELECT A.file_id ,A.ren_tm ,A.scan_id ,A.ms_lvl ,A.CurveNum A.Weighted_Peak_mz
,A.sum_i * chrg) - delta_mz) AND ((weighted_peak_mz * chrg) + delta_mz) THEN 'Y'
,CASE
ELSE NULL
,A.ren_tm - B.ren_tm AS
Diff_Ren_Tm
WHEN
END) = 'Y'
,A.Weighted_Peak_mz
- B.Weighted_Peak_mz AS Diff_WP
( (CASE
OR
,B.CurveNum
AS
L_CurveNum
WHEN n_i - p_i
> 0 THEN 1
(CASE WHEN
,B.Weighted_Peak_mz
AS
L_Weighted_Peak_mz
WHEN n_i - p_i
< 0 THEN -1
SUM((weighted_peak_mz
* chrg)) OVER
(PARTITION BY file_id, ms_lvl ORDER BY Weighted_peak_mz, scan_id ROWS BETWEEN 2 PRECEDING AND 2
,B.ren_tm
AS
L_ren_tm
ELSE 0
,B.sum_iPRECEDING)
AS
L_Sum_I
END) BETWEEN ((weighted_peak_mz * chrg)
FROM
DD_STG.S2_WEIGHTED_CURVE
AS - delta_mz) AND ((weighted_peak_mz * chrg)
A + delta_mz) THEN 'Y'
- B.Weighted_Peak_mz
AS Diff_WP
SUM(CASE ,A.Weighted_Peak_mz
ELSE
NULL
INNER JOIN
DD_STG.S2_WEIGHTED_CURVE
AS
B
,B.CurveNum
AS
L_CurveNum
WHEN n_i - p_i
> 0 THEN 1
END)
= 'Y'
ON
(A.Weighted_Peak_mz
- B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000
,B.Weighted_Peak_mz
AS
L_Weighted_Peak_mz
WHEN n_i - p_i
0 THEN -1
OR=<B.ren_tm
AND
A.ren_tm
,B.ren_tm
AS
L_ren_tm
ELSE 0
(CASE
WHEN
AND
A.CurveNum <> B.CurveNum
,B.sum_i
L_Sum_I
END) OVER (PARTITION BY file_id, ms_lvl, ren_tm ORDER BYAS
mz ASC ROWS BETWEEN
1 PRECEDING AND 1 PRECEDING)
AND B.max_i > (0.66667 * A.max_i)
FROM
DD_STG.S2_WEIGHTED_CURVE
AS
A
) = 2 THEN 1 ELSE 0
INNER JOIN
DD_STG.S2_WEIGHTED_CURVE
AS
B
ON
(A.Weighted_Peak_mz - B.Weighted_Peak_mz) BETWEEN 0.00000 AND 1.000000
END AS Mark
AND
A.ren_tm = B.ren_tm
,Ind
AND
A.CurveNum <> B.CurveNum
,B
AND B.max_i > (0.66667 * A.max_i)
,CurveID
LEFT JOIN
34
)
AS
DD_TAB.CHARGE_STATES AS
ON
J
C
CAST(J.Diff_WP AS DECIMAL(18,2)) = CAST(C.chrg_mz_diff AS DECIMAL(18,2))
Procedural code uses 2 loops for same result
while (inputIterator.advanceToNextRow()) {
currIntensity=inputIterator.getDoubleAt(5);
maxIntensity=0.0;
//Initialise Temp Array
for (int i=0; i <= 50; i++){
curveArray[0][i]=0;
curveArray[1][i]=0;
}
if (overlapFlag==1){
count = 1;
} else {
count = 0;
}
//Find start of Curve, lastintensity is 0
//or previous lastintensity is higher than lastintensity –
overlapping peaks (double peak curve)
if (currIntensity > 0 && lastIntensity == 0 || overlapFlag==1){
//Populate Temp Array with Curve points and find maxIntensity to derive threshold
while (currIntensity > 0){
if(maxIntensity < currIntensity)
maxIntensity=currIntensity;
if (overlapFlag==1){
overlapFlag=0;
curveArray[0][count-1]=overlapMZ;
curveArray[1][count-1]=overlapIntensity;
PI = overlapIntensity;
currIntensity=inputIterator.getDoubleAt(5);
}
curveArray[0][count]=inputIterator.getDoubleAt(4);
curveArray[1][count]=inputIterator.getDoubleAt(5);
count++;
inputIterator.advanceToNextRow();
PI2 = PI;
PI = currIntensity;
currIntensity=inputIterator.getDoubleAt(5);
35
if (currIntensity > PI && PI2 > PI){
//Overlapping Peak found, store MZ and Intensity and start new Curve for next
Iteration
overlapFlag=1;
overlapMZ=inputIterator.getDoubleAt(4);
overlapIntensity=inputIterator.getDoubleAt(5);
break;
}
}
//Process Temp Array to create intermediate metrics
while (curveArray[1][curveCount] > 0){
if (curveArray[1][curveCount] > intensityThreshold){
if (maxMZ < curveArray[0][curveCount]){
maxMZ=curveArray[0][curveCount];
}
if (minIntensity > curveArray[1][curveCount] || minIntensity == 0){
minIntensity=curveArray[1][curveCount];
}
if (minMZ > curveArray[0][curveCount] || minMZ == 0){
minMZ=curveArray[0][curveCount];
}
sumIntensity=sumIntensity+curveArray[1][curveCount];
sumMZ=sumMZ+curveArray[0][curveCount];
sumMZByIntensity=sumMZByIntensity+(curveArray[0][curveCount]*curveArray[1]
[curveCount]);
curvePoints++;
}
curveCount++;
}
Custom analysis and custom visualisation
– vital tools in understanding big data
36
36
Big Data
Examples
 Log file
 Mass spec.
 Images
37
What is Big Data?
Examples
 Log file
 Mass spec.
 Images
BIG DATA
Big Data
The problem here is that this data is not atomic
A picture is worth a thousand words
In other words, we don’t know what question we may ask in the future
40
Summary so far…
Just as you can always fit an aircraft engine into a car
chassis, you can always put Big Data in a table, but it
reaches a stage where it is no longer the most effective
solution to do so
The analysis is not sub-setting the data by rows and columns
We are often interested in order
Each class of big data usually requires a (lovingly hand-crafted) custom analysis
41
Big Data
Which means that we are going to be storing the data
without imposing a schema upon it – in other words, we
are going to be storing the data “schema-less”
42
Paradox
The relational model guarantees that any question can be
asked of the data and that a consistent answer will be
delivered.

How does that work?
Big data doesn’t impose a schema on the data, the data is
stored as schema-less. One reason for this is that a
schema would restrict the questions that we can ask of the
data.

43
How does that work?
Paradox
The paradox is that we are saying that :
if we impose a strict schema we can ask (and answer) any
question
we impose no schema we can ask (and answer) any
question
44
Paradox
The relational model doesn’t restrict the questions that can be
asked of data

This is essentially true as long as (note the qualification there!):


we are treating the data as atomic.
We are not looking for order in the rows.

We can subset by row and column and we can do difficult sums on the end
result.

So, does the relational model allow us to drill inside the atomic data?




45
Well, no, but relational database engines often do
The model assumes atomic data
A query that finds all the last names of the employees paid more than $40,000 is
relational
A query that finds all the employees where the third letter of their last name is ‘c’ is not
relational
Paradox
So, if the data is atomic and is treated as atomic, and order
in unimportant, then the relational model allows any
question to be asked
46
Paradox
Storing data in an ‘unstructured’ way allows you to ask any
question of the data

This is essentially true as long as (note the qualification here!):


we are prepared to design new functions for every new type of query that we want
to run
So, imagine that you have some satellite images. They are stored without a
schema being imposed by the database engine
 We want to find rotating wing aircraft – so we write an algorithm that does that
 Now we want to find all the penguins – we need another custom algorithm
 So, a schema-less database allows any question to be asked of the data – as
long as we are prepared to write a new custom algorithm for each new type of
query
47
Case Study
Oil Rig data
Gone fishing
Sensor data
48
48
Case Study
Twitter
Who loves you?
Social/text/sentiment
49
49
Case Study
Big Data in the Life Sciences World
The massed spectrometers
Why would anyone do that?
50
50
Lessons learned
Engagement
Choose you battles – look for an area where you can
gain competitive advantage
Choose your platform carefully
Programming – algorithm development
Data scientists
 Custom algorithms
 Custom visualisations
51
51
Questions?
Global Sponsor:
Thank You for Attending
Global Sponsor: