Download Report

Medical Diagnosis for Cardiac Arrests using Data
Mining and Bayes’ Theorem
Dr. Tariq Mohammed Saied Al Tayee
Mr.Viswan Vimbi
Department of Information Technology
College of Applied Sciences
Sohar, Sultanate of Oman
[email protected]
Department of Information Technology
College of Applied Sciences
Sohar, Sultanate of Oman
[email protected]
Abstract—Data Mining is the process of employing one or more
techniques to automatically analyze and extract knowledge from
data contained within a database. It can be used in all branches
of industries including health care management, where from a
voluminous data of a medical data set data mining enables
prediction or early detection of a disease like cardiac arrests.
The goal of this paper is to show that the data mining
technique using Bayes Theorem helps in predicting cardiac
disorders. Bayes Theorem is a technique to estimate the
likelihood of a property given the set of data as input. The
medical data sets like age, sex, blood pressure, blood sugar, etc.
helps in early detection or predicting cardiac problems in
patients. Medical practitioners usually take decisions based on
their intuitions and experience rather than the knowledge rich
data hidden in the database. This affects the quality of service
provided to patients. The technique used here tries to establish
that the data mining technique used evolves significant
knowledge like the medical data sets related to cardiac disorders.
The results obtained shows that the system designed using Bayes
Theorem with the support of databases of patient records, a
knowledge rich environment could be created and can help to
significantly improve the quality of clinical diagnosis by early
detection or prediction of cardiac disorders.
Numerous studies have been performed in the past describing
the significance of data mining in the detection of heart related
diseases using Bayes’ Theorem. The results gathered in these
researches differentiate each other in one or another way.
Some of the works in this respect are discussed in this research
paper. In the study of Palaniappan and Awang (2008), a
specific model of Intelligent Heart Disease Prediction System
(IHPDS) was designed using the data mining technique of
Bayes’ Theorem and the results obtained were quite specific to
the objective of the tool by facilitating in the determination of
important information related to heart disease.
II. BACKGROUND
A. Cardiac Disorders
It is observed that Cardiac disorders are diseases that affect the
cardiovascular system that involves heart or blood vessels, and
technically refers to the diseases that influence the cardio
vascular system of human beings [1]. This disorder remains
the biggest cause of the deaths around the world while these
diseases usually influence the older adults. To prevent the
disorder, it is essential to take necessary considerations on its
detection and treatment [2]. Therefore, different data mining
tools were being used particularly Bayes’ theorem to detect
cardiac disorders [3].
Keywords—data Mining, Bayes Theorem, K fold, supervised
learning, classification, clustering
I. INTRODUCTION
B. Bayes’Theorem
Bayes’ Theorem, related to probability theory, is used to
calculate the chance that an instance belongs to a target class
given a summary (or conjunction) of attribute values.
According to this theorem, the predicted class for a tested
instance – defined by a set of attribute values as follows V = v1
˄ v2 ˄ ….. ˄ vn – is that one having the highest conditional
probability. The theorem is demonstrated as follows:
Data mining is referred as an ability to analyze data sets for
gaining significant and in-depth knowledge about any field of
interest. It is an influential technological method assisting the
medical industry in keeping the relevant and important
information within their data warehouses more effectively. It
can be applied to the defense, medical, science and other
industries [1, 5, 7, 8, 9]. Statistical data mining methods like
Bayes’ Theorem is extensively used in data mining processes
provides the chance to revise the probabilities of the incidents
by obtaining more information. This method is used as a tool
for the detection of the cardiac arrests in modern health care
centers. Cardiac arrests, also known as, circulatory arrests, can
be explained as a sudden halt in blood circulation due to the
failure of heart to contract [11]. The impact of cardiac arrests,
in some cases, is reversible if treated early. However, in most
cases, cardiac arrests often leads to death within minutes.
In the theorem, T is the hypothesis required for testing and E
demonstrates the confirmation or disconfirmation of the theory
[2]. For any proposition ‘S’ the P(S) is the level of subjective
probability that the S is true and P(T) demonstrates the
effective estimation of the probability of the theory before the
1
technique in the form of the Bayes’ theorem assists effectively
to prevent these situations as in the Intelligent Heart Disease
Prediction System using the Naive Bayes’ relating to the
Bayes’ theorem. It provides the analysis of the complex
medical data such as age, sex, blood pressure and blood sugar
that can effectively predict the occurrence of patients getting
cardiac disorders [2]. The medical data sets with significant
variables such as age, sex, blood pressure, blood sugar are quite
helpful in the early prediction of the cardiac problems in
patients. Moreover, it facilitates significant knowledge, for
example the patterns, relationships between medical factors
related to heart disease to be established.
consideration of new piece of evidence that is known as the
prior probability of T [3].
For example, the equation can be decoded as follows:





P (T|E) = chance of having cardiac arrest (T) given a
positive symptom (E).
P (E|T) = chance of a symptom (E) given that a
patient has cardiac disorder.
P (T) = chance of having cardiac disorder.
P (¬T) = chance of not having cardiac disorder.
P (E|¬T) = chance of a positive symptom (E) given
that a patient has not cardiac disorder.
E. Data Processing
It is observed that cleaning and filtering of data from huge
amount data is quite complex while the existence of accuracy is
of great importance during the data processing of the disease
information [1]. In order to make the data suitable for the data
mining process, it is required to be transformed effectively,
while the Bayes’ data mining system provides effective
utilization of the data where, in the system, the data is changed
in to reliable data sets with the suitable characteristics [2]. The
heart disease data warehouse is refined by decreasing the
duplicate records and providing the required accurate outcomes
and is made suitable for clustering of the cardiac disease data
from the heart disease data warehouse [3].
It is observed that the applications of the theorem is widespread
and is not restricted to the financial and mathematical field
only [2]. However, it is identified that the Bayes’ theorem can
be utilized to determine the accuracy of the medical test results
by taking into consideration the way it is predictable that a
person is to have the disease with the general accuracy of the
tests [6].
C. Early Detection of Important Patterns related to Cardiac
Disease from Heart Disease Data Warehouse using Bayes’
Theorem
From the beginning of data mining, there has been guidance
from the requirement to solve the practical issues related to
cardiac disease [1]. The amount of data concerning the
cardiovascular factors and diseases has been stored in very
large databases and hence it is quite essential to attain the
accurate results from the analysis. It is observed that Bayes’
Theorem are widely used as the effective data mining analysis
to attain medical decision-making concerning the classification
and diagnosing. There are several situations where the
decision are required to be reliable and effective [2].
III. EXPERIMENT METHODOLODY
We have taken 14 attributes from the medical dataset [10] as
shown in table 1 with chronic disorder being the diagnosis
attribute.
It is observed from the one of the research study [3] that the
Bayes’ networks are utilized significantly in detecting the
varying patterns of cardiac disease from the huge disease data,
as it provides the diagnostic reasoning of making probabilistic
inferences of the disease in the conditions of uncertainty [3]. It
is observed that the intelligent heart disease prediction system
uses the data mining technique Naive Bayes’ using the .NET
platform in order to attain accurate predicted results. According
to the research studies conducted by Soni et al. [4], the Bayes’
theorem provides 86.53 accuracy of the results concerning the
predictable patterns of cardiac disease.
D. Identification of the significance of Medical Data Set
Problems
The health care industry collects large amounts of healthcare
data that unluckily are not mined to find out hidden
information for efficient decision-making. The discovery of
the hidden patterns and the relationship among them for
accurate identification of data patterns concerning the medical
data sets of the cardiac disease are significant for the analysis
of cardiac diseases [3, 6]. Hence, the advanced data mining
2
Sl.
No.
1
2
3
4
Attribute
Description
Age
Sex
Chest Pain
Blood Pressure
5
6
Cholesterol
Fasting Blood Sugar
7
ECG
8
9
Heart Rate
Induced Angina
10
Old Peak
11
Slope
12
Thal
13
CA
In years
Male = 1, Female = 0
Types of chest pain
Blood pressure taken
during rest
mg/dl
true = 1 when >
120mg/dl, false = 0
Electrocardiographic
results during rest
Maximum heart rate
1
if
experiencing
angina, 0 if not
0 – no depression, 1 –
yes depression
of peak exercise ST
segment
Value=3:Normal, value
6:fixed defect, value 7:
reversible defect
Number
of
major
vessels colored by
fluoroscopy (value 0-3)
14
the variable ‘age’ with the range of 57 years, variable ‘sex’ of
participants of study, the variable blood sugar with the range
of 118, the variable ‘chest pain’ of individuals with the range
of 71, the variable blood pressure with 49 and heart rate of 58
are suffering from chronic disorder [4].
Angiographic disease
status
Table 1: Attribute of Heart Disease Data Sets
For conducting the experiment of analyzing early prediction of
the occurrence of cardiac arrests the data sets of age, sex,
blood pressure, blood sugar, chest pain and heart rate in the
Naive Bayes’ classifier is used which in turn applies the Bayes
theorem for analysis and decision making. The significance of
the Naive Bayes’ classifier is that it provides accurate and
reliable results based on small amount of data for the training
phase of the software. Only categorical attributes were used
and for simplicity the number of attributes (in table 1) were
reduced to 7 (table 2). The reduced dataset is fed to the
classification model using the K fold cross validation.
Sl. No.
1
2
3
4
5
6
7
Chronic disorder
Attribute
Description
Age
In years
Sex
Male = 1, Female = 0
Chest Pain
Types of chest pain
Blood
Blood pressure taken
Pressure
during rest
Fasting Blood true = 1 when >
Sugar
120mg/dl, false = 0
Heart Rate
Maximum heart rate
Chronic
Angiographic disease
disorder
status
Table 2: Reduced Attributes list
The data set of 200 records of 150 males and 50 females with
the 6 attributes is used for the experiment. The diagnosis
attribute, chronic disorder, is the class identifier with the value
0 demonstrating no cardiac arrests ailments while the value 1
demonstrates the presence of cardiac ailments.
Figure 3 Pre-processed using Weka tool
The classification used is supervised learning method to
extract the model demonstrating the significant data classes or
to predict the future trends where this method is largely
utilized in pattern recognition and artificial intelligence and
have extensive effective recognition in medical diagnosis.
Hence, the study uses Naive Bayes classification through
clustering in order to diagnose the presence of heart disease in
patients, as it is observed that Naive Bayes performed with
good prediction probability of 95 % using different attributes
[2].
In the research study this classifier assumes that the attributes
are independent and the learning speed and classification
speed are the significant advantage of the Bayes classifier.
Fig 4(a)
IV. RESULTS AND ANALYSIS
Experiments were conducted with Weka tool. The selected
data set were pre-processed and filtered with supervised
classification to obtain the diagnostic classifier (fig 3) and
figures 4(a) to 4(g) visualizes the 6 attributes influenced by
the diagnostic classifier – chronic disorder. From the figures
Fig 4(b)
Figures 4(a) to 4(g) depicts the influence of
the diagnostic classifier – chronic disorder –
on the attributes sex, age, blood pressure,
blood sugar, chest pain and heart rate.
3
analysis is made from a sample size of 200 patients from which
information with respect to age, sex, blood sugar, blood
pressure, chest pain and heart rate were collected. The data was
entered in the data mining software. The findings depict that if
the patient’s age ranges up to 57 years, its gender is male,
blood pressure level is identified as 49, blood sugar level as
118, chest pain is equal to 71, and heart rate is equal to 58, then
there is a strong possibility that the patients with such
conditions possess a significant chance to have cardiac
disorder. Furthermore, these findings depict that the efficiency
of data mining cannot be undermined in detecting the cardiac
and other chronic disorders rather take precautions for dealing
with the issue. These graphs and tables additionally helped in
predicting the influence of one component of the individual on
other and comparatively an aggregate impact of all the
individual characteristics upon the disorder [1, 2, 6].
Fig 4(c)
Fig 4(d)
The above results of the patients’ data concerning age, sex,
blood pressure, blood sugar, chest pain, heart rate and chronic
disorder patterns records with the correlation coefficient of
0.8532 demonstrates the accuracy of the results obtained from
the data of the patients using Bayes theorem. It demonstrates
the significant impact of age, sex, blood pressure, blood sugar,
chest pain, heart rate and chronic disorder factors on the
decision of cardiac arrests.
V. CONCLUSION
Fig 4(e)
It is concluded from the findings of the research study that data
mining is an effective process of applying different techniques
to automatically evaluate and extract the desired reliable data
within the database. It helped to predict the early detection of
the disease, cardiac arrests, from a large amount of data. The
paper demonstrated that using the data mining technique Naive
Bayes in the analysis of heart disease patients assist in
predicting cardiac disorders.
Fig 4(f)
The results analysis of the patients’ data concerning the age,
sex, blood pressure, blood sugar, chest pain, and chronic
disorder patterns records the correlation results obtained from
the patients data using the Bayes theorem, where the theorem
demonstrates accuracy of the predicted decision. However the
results does not show the intensity of cardiac arrests or pain in
a prediction. We intend to continue our future work in the
study of fuzziness in mining data for the range of intensity of
cardiac pain or arrests.
REFERENCES
[1] B.Patil, S. & Y.S.Kumaraswamy (2009), Intelligent and Effective Heart
Attack Prediction System Using Data Mining and Artificial Neural
Network, European Journal of Scientific Research 31(4), pp.642-56.
[2] S. Palaniappan. & R. Awang (2008), Intelligent Heart Disease Prediction
System Using Data Mining Techniques, IJCSNS International Journal of
Computer Science and Network 8(8).
Fig 4(g)
It can be analyzed from the above results that the concerns
given to the various patient attributes can help the physicians in
determining the impact of cardiac disorders. The above
[3] Sitar-Taut, D.-A. & Sitar-Taut, a.-V. (2010), Overview on How Data
Mining Tools May Support Cardiovascular Disease, Journal of Applied
Computer Science & Mathematics, 8(4), pp.1-24.
4
[4] J. Soni, U. Ansari & D. Sharma (2011), Predictive Data Mining for
Medical Diagnosis: An Overview of Heart Disease Prediction,
International Journal of Computer Applications, 17(8), pp.43-48.
[5] Frank Lemke, Johann-Adolf Mueller (2003), Medical data analysis using
self-organizing data mining technologies, Systems Analysis Modelling
Simulation, 43(10), pp.1399-1408.
[6] Latha Parthiban & R. Subramanian (2008), Intelligent Heart Disease
Prediction System using CANFIS and Genetic Algorithm, International
Journal of Biological, Biomedical and Medical Sciences, 3(3).
[7] I. H. Witten & E. Frank, 2005, Data Mining: Practical Machine Learning
Tool and Techniques, Morgan Kaufmann Publishers, San Francisco.
[8] J. Han & M. Kamber, 2006, Data Mining: Concepts and Techniques,
Morgan Kaufmann Publishers, San Francisco.
[9] M. H. Dunham, 2003, Data Mining: Introductory and Advanced Topics,
Pearson Education, Pearson Education, United States.
[10] Asuncion, A. & Newman, D.J. (2007), UCI Machine Learning
Repository
[http://www.ics.uci.edu/-mlearn/MLRepository.html]. Irvine, C. A:
University of California, School of Information and Computer Science.
[11] Jameson J.N. St. C, Dennis L Kasper, Harrison Tinsley Randolph,
Braunwald Eugene, Fauci Anthony S, Hauser Stephen L, Longo Dan L
(2005), Harrison’s Principles of Internal Medicine, New York, McGrawHill Medical Publication Division, ISBN 0-07-140235-7.
5