Download Report

26 | P
age
Australian Journal of Information Technology and Communication Volume II Issue I
ISSN 2203-2843
A novel algorithm applied to filter spam e-mails
using Machine Learning Techniques
Maninder Singh#1, Ranjan Sharma*2
#1
#1
Guru Nanak Dev University, Amritsar (Punjab)
Corresponding author:[email protected]
*2
[email protected]
Abstract— Email spam is one of the major problems of today`s
Internet, bringing financial damage to companies and annoying
individual users. Among the approaches developed to stop spam,
filtering is an important and popular one. In this paper, a novel
algorithm approach is applied to filter span emails. Another spam
base dataset is obtained from UCI repository for evaluating its
performance. From the results the proposed algorithm is
outperforms other existing algorithms.
Keywords—Spam mail-filter, WEKA, MATLAB, mail classification,
privacy protection, information security, Machine learning, Data
mining, Decision trees, clustering
I. INTRODUCTION
Email has become one of the fastest and most economical
forms of communication in all aspects of everyday life [1].
However this involvement is diminishing by the growth and
availability of the emails. Nowadays, a typical user receives
about 20-40 email messages every day. Mass unsolicited
electronic mail, often known as spam, has recently increased
enormously and has become a serious threat to society as well as
the Internet. The flooding of spam consumes not only computer,
storage and network resources but also human time and attention
to dismiss unwanted emails. Thus, users spend a significant part
of their working time on processing email. Since the cost of the
spam is borne mostly by the recipient, many individual and
business people send bulk messages in the form of spam [2]. Not
only SPAM is flooding our mailboxes but locating important and
vital information among the huge number of emails has turned
into a laborious and time consuming daily activity. The amount
of spam sent over the Internet has been rising dramatically in
recent years and no decline is to be expected in the near future.
Therefore, email management is an important and growing
problem for individuals and organizations.
II. RELATED RESEARCH
N Jindal and Liu (2007) [4] has proposed mining of opinions
from product reviews, forum posts and blogs as an important
research topic with many applications. Existing research has
been focused on extraction, classification and summarization of
opinions from these sources. The issue in the context of product
reviews has been studied. There is still no published study on this
topic, although Web page spam and email spam have been
investigated extensively. Review spam is quite different from
Web page spam and email spam, and thus requires different
detection techniques.
III. PROBLEM STATEMENT
In this e-world, most of the transactions and business is taking
place through e-mails [3]. Nowadays, email becomes a powerful
tool for communication as it saves a lot of time and cost. But,
due to social networks and advertisers, most of the emails
contain unwanted information called spam. Even though lot of
algorithms has been developed for email spam classification, still
none of the algorithms produces 100% accuracy in classifying
spam emails. Spam, also known as Unsolicited Commercial
Email, is generated by sending unsolicited commercial messages
to many recipients without their permission. The spammers use a
computer program to check almost every website on the internet.
The program looks at the code of every web page, it looks for an
email address and it collects and saves your email address to the
spammers database of millions of harvested addresses.
IV. INTRODUCTION TO DATA MINING
Data mining is a powerful new technology with great potential
to help companies focus on the most important information in the
data they have collected about the behaviour of their customers
and potential customers. It discovers information within the data
that queries and reports can't effectively reveal.
Generally, data mining (sometimes called data or knowledge
discovery) is the process of analyzing data from different
perspectives and summarizing it into useful information information that can be used to increase revenue, cuts costs, or
both. Data mining software is one of a number of analytical tools
for analyzing data. It allows users to analyze data from many
different dimensions or angles, categorize it, and summarize the
relationships identified. Technically, data mining is the process
of finding correlations or patterns among dozens of fields in
large relational databases.
V. ROLE OF DATA MINING IN VARIOUS FIELDS
27 | P
age
Australian Journal of Information Technology and Communication Volume II Issue I
Although data mining is still in its infancy, companies in a
wide range of industries - including retail, finance, health care,
manufacturing transportation, and aerospace - are already using
data mining tools and techniques to take advantage of historical
data. By using pattern recognition technologies and statistical
and mathematical techniques to sift through warehoused
information, data mining helps analysts recognize significant
facts, relationships, trends, patterns, exceptions and anomalies
that might otherwise go unnoticed.
For businesses, data mining is used to discover patterns and
relationships in the data in order to help make better business
decisions. Data mining can help spot sales trends, develop
smarter marketing campaigns, and accurately predict customer
loyalty. Specific uses of data mining include:
1. Market segmentation: Identify the common characteristics
of customers who buy the same products from your company.
2. Customer churn: Predict which customers are likely to
leave your company and go to a competitor.
3. Fraud detection: Identify which transactions are most likely
to be fraudulent.
ISSN 2203-2843
Output: A decision tree.
Method:
Step 1: Create a node N
Step 2: If tuples in D are all of the same class,c then
Step 3: Return N as a leaf node labelled with the class C,
Step 4: If attribute_list is empty then
Step 5: Return N as a leaf node labeled with the majority class in
D,//majority voting
Step 6: Apply attribute_selection_method to find the “best”
splitting_criterion;
Step 7: Label node N with splitting_criterion
Step 8: If splitting_attribute is discrete valued and
Step 9: Attribute_list
-splitting_attribute;
4. Direct marketing: Identify which prospects should be
included in a mailing list to obtain the highest response rate.
Step 10: For each outcome j of splitting_criterion
5. Interactive marketing: Predict what each individual
accessing a Web site is most likely interested in seeing.
Step 11: Let Dj be the set of data tuples in D satisfying the
outcome j;
6. Market basket analysis: Understand what products or
services are commonly purchased together; e.g., beer and
diapers.
Step 12: If Dj is empty then;
7. Trend analysis: Reveal the difference between a typical
customer this month and last.
Step 13: Attach a leaf labelled with the majority class in D to
node N;
Step 14: Else attach the node returned by generate_decision_tree
to node N;
VI. PROPOSED ALGORITHM
Following steps are included in the proposed algorithm:
Step 15: Return N;
Algorithm: Decision Tree:
VII.
RESULTS AND DISCUSSION
Firstly, The Bar Graph shows the comparison of comparative
analysis of different algorithms against the percentage instances
using WEKA. The plot depicted that the proposed J48 algorithm
has lowest error and highest accuracy. The correctly classified
are highest in proposed J48 algorithm.
Algorithm: Generate_decision_tree. Generate a decision tree
from the training tuples of data partition D.
Input:
Step 1: Data partition,D,which is a set of training tuples and
their associated class labels:
Step 2: Attribute_list,the set of candidate attributes:
Step 3: Attribute_selection_method, a procedure to determine
the splitting criterion that best” partitions the data tuples into
individual classes. This criterion consists of a splitting_attribute
and possibly, either a split point or splitting subset.
The outcome is shown below as Fig. 1.
28 | P
age
Australian Journal of Information Technology and Communication Volume II Issue I
Fig. 2 Comparative analysis in (Percentage parameters).
ISSN 2203-2843
Fig. 3 Comparative analysis.
The next figure shows the comparative analysis of different
algorithms against the different parameters. The plot shows that
the proposed J48 algorithm has lowest error and highest accuracy.
The kappa statistic shows highest value of 0.8812 in proposed
J48 algorithm. The outcome is shown below as Fig. 2.
Fig. 2 Comparative
analysis of algorithms in kappa statistic, mean
absolute error, root mean squared error.
Fig.3 shows the comparative analysis of different algorithms
against the different parameters. The plot shows that the
proposed J48 algorithm has highest accuracy. Recall, F-Measure
and ROC Area parameter is highest in the proposed J48
algorithm.
Fig.4 Accuracy in MATLAB
Accuracy of 99.97% is achieved in it with an error rate of
0.0217.The previous result achieved by J48 algorithm in
spambase dataset was 92.195 of the correctly classified instance.
Various parameters have been modified with the invent of other
new and modified features.
29 | P
age
Australian Journal of Information Technology and Communication Volume II Issue I
VIII.
CONCLUSION
As the technology of machine learning continues to develop and
mature, learning algorithms need to be brought to the desktops of
people who work with data and understand the application
domain from which it arises. It is necessary to get the algorithms
out of the laboratory and into the work environment of those who
can use them. Mining frequent item sets for the association rule
mining from the large transactional database is a very crucial
task. There are many approaches that have been discussed, which
have scope for improvement. WEKA is a significant step in the
transfer of machine learning technology into the workplace.
WEKA has proved itself to be a useful and even essential tool in
the analysis of real world data sets. It reduces the level of
complexity involved in getting real world data into a variety of
machine learning schemes and evaluating the output of those
schemes. It has also provided a flexible aid for machine learning
research and a tool for introducing people to machine learning in
an educational environment. This research work focuses on
improving the performance of the e-mail spam classification rate
using the integrated MATLAB and WEKA tool. The proposed
scenario has shown accuracy rate of 99.97% due to MATLAB’s
rich learning rate. Thus the proposed approach has shown the
significant improvement over the available techniques. In near
future, we will use different kind of data sets to validate the
proposed work. However only J48 algorithms has been
considered in this work, so in near future some more machine
learning algorithms will be considered.
[1]
ISSN 2203-2843
IX. REFERENCES
Jianchao Han,Juan C. Rodriguez,Mohsen Beheshti
“Discovering Decision Tree Based Diabetes Prediction
Model”
International
Conference,
ASEA
2009,Communications in Computer and Information
Science Volume 30, 2009, pp 99-109. ISSN-18650929_Springer.
[2]
Benevenuto, Fabrıcio, Gabriel Magno, Tiago Rodrigues,
and Virgılio Almeida. "Detecting spammers on twitter."
In Collaboration, electronic messaging, anti-abuse and
spam conference (CEAS), vol. 6, p. 12. 2010.
[3]
Salama, Gouda I., M. B. Abdelhalim, and MagdyAbdelghanyZeid. "Experimental comparison of classifiers for
breast cancer diagnosis." In Computer Engineering &
Systems (ICCES), 2012 Seventh International Conference
on, pp. 180-185. IEEE, 2012.
Jindal, Nitin, and Bing Liu. "Analyzing and detecting
review spam." In Data Mining, 2007. ICDM 2007. Seventh
IEEE International Conference on, pp. 547-552. IEEE,
2007.
[4]
[5]
Maninder Singh " A REVIEW ON DATA MINING
ALGORITHMS." In IJCSITR, Vol. 2, Issue 2, pp: (8-14),
2014.
.