Handling Internet Traffic Using Big data Analytics

International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
HANDLING INTERNET TRAFFIC USING BIG DATA ANALYTICS
Ranganatha T.G1 , Narayana H.M 2 ,
1
2
PG Student, Department of Computer Science and Engineering, M.S.E.C, [email protected]
Associate Professor, Department of C. S. E, M.S.E.C, [email protected]
ABSTRACT
Handling internet traffic in these days is not so easy, but the explosive growth of internet traffic is hard
to collect, store and analyze internet traffic on a single machine. Hadoop has become a popular framework for
massive data analytics. It facilitates scalable data processing and storage services on a dis tributed computing
system consisting of commodity hardware. In this paper, I present a Hadoop based traffic analysis and control
system, which accepts input from Wire shark (Log File), and output in form of summary which contains entire
internet traffic details and I also implemented the congestion control algorithm to control the online network
traffic in the internet.
Keywords: Single machine, Hadoop, Commodity Hardware and Wire shark.
----------------------------------------------------------------------------------------------------------------------------1. INTRODUCTION
Internet has made great progress in daily life and brought much more convenience to much more people
daily life in recent years, fact that it is still provides a kind of BOF (Best of Effort) service to application has never
been changed since its invention.
Mininet is a network emulator, an instant virtual network our laptop, It runs a collection of end -hosts,
Switches, Routers and Links on a single Linux kernel. It uses a lightweight virtualization to make a single system
look like a complete network, System, Code, running in the same kernel.
Open daylight is a controller used to control the flows running in the Mininet, Mininet will connect to controller and
setup a „n‟ tree topology.
Wire shark is a tool used to capture packets in the network. It‟s a free open source packet analyzer used for
network trouble shooting, analysis, software and Communication protocol development and education.
Hadoop was originally designed for batch oriented processing jobs, Such as creating web pages indices or analyzing
log data. Hadoop widely used by IBM, Yahoo!! , Face book, Twitter etc., to develop and execute large -scale
analytics or applications for huge data sets. Apache Hadoop is a platform that provides pragmatic, cost-effective,
scalable infrastructure for building many of the types of applications described earlier. Made up of a distributed file
system called the Hadoop Distributed File system (HDFS) and a computation layer that implements a processing
paradigm called Map Reduce, Hadoop is an open source, batch data processing system for enormous amounts of
data. We live in a flawed world, and Hadoop is designed to survive in it by not only tolerating hardware and
software failures, but also treating them as first-class conditions that happen regularly. Hadoop uses a cluster of
plain old commodity servers with no specialized hardware or network infrastructure to form a single, logical, storage
and compute platform, or cluster, that can be shared by multiple individuals or groups. Computation in Hadoop Map
Reduce is performed in parallel, automatically, with a simple abstraction for developers that obviate complex
synchronization and network programming. Unlike many other distributed data proc essing systems, Hadoop runs the
user-provided processing logic on the machine where the data lives rather than dragging the data across the network;
a huge win for performance.
The main contribution of my work exists in designing, Implementing and controlling of internet traffic
through big-data analytics. Firstly I have created a virtual network by using the Mininet tool, instantly makes virtual
network on my laptop it contains switches, routers, and hosts. It can be controlled by using open daylight controller.
To capture the packets flow from the virtual network we use a Wire shark like tool. we capture the packet log file
and save it in a text file and then log file is given as input to the Hadoop to process the large data of log file and we
will visualized the summary report that contains the flow analysis details which as sender ip , destination ip and it
also have the size of byte sent. By using that file we control the traffic by using the congestion control algorithm to
control the online traffic.
IJRISE| www.ijrise.org|[email protected][187-191]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
The main objective of the work includes: To design and implement a Traffic flow identification system using Hadoop.
 The traffic flow identification system will be very useful for network administrator to monitor faults and
also to plan for the future.
2. BACKGROUND WORK
Over the past few years , a lot of tools have been developed and widely used for monitoring the internet
traffic. Mininet is tool widely used to setup a virtual network in your laptop. So that we can able to simulate using
these Mininet tool to identify the flow of the packets in the virtual network. Wire shark is a popular traffic analyzer
that offer‟s user friendly graphics interface. Tcpdump is also most popular tool of capturing and analyzing the
internet traffic. Open daylight is an controller used to control of the packets in the Mininet virtual network. It acts
like an controller to control the packets form where to where the packets needs to be sent that can be controlled by
the Open daylight tool.
Most of the map reduce applications on Hadoop are developed to analyze large text, log files or web, in this
we firstly packet processing or analysis for Hadoop that analyzes trace file in blocks it will process the block of file
and give the result in parallel in distributed environment.
Methodol ogy
For flow analysis we use some of the map Reduce algorithm to reassemble the flows in the network. We
use the k-means algorithm for efficient clustering of the packets with in the network.
For flow control we use the congestion control algorithm for the control of internet traffic with the Hadoop.
By this we able to control the packets flow control very easily and in the very effective manner.
3. LITERATURE SURVEY
A lot of research is done to measure the performance of the internet traffic usin g Hadoop. S.J Shafer, S.
Rixner, and Alan L [2]. Cox discuss about performance of distributed Hadoop file system. Hadoop is most accepted
framework for managing huge amount of data in distributed environment. Hadoop makes use of user-level file
system in distributedmanner. The HDFS (Hadoop Distributed File System) is a portable across both hardware and
software platforms. In this paper a detailed performance analysis of HDFS was done and it displays a lot of
performance issues. Initially the issue was on architectural bottleneck that exist in the Hadoop implementation
which resulted in the inefficient usage of HDFS. The second limitation was based on portability limitations which
limited the java implementation from using the features of naive platform. This paper tries to find solution for the
bottleneck and portability problems in the HDFS.
T. Benson, A. Akella, and D. A. Maltz,[3] wrote a paper on “Network traffic characteristics of data centers
in the wild” In this paper the researcher conduct an experien tial report of the network in few data centers which
belongs to differenttypes of organizations, enterprise, and university. In spite of the great concern in developing
network for data centers,only few information is known about the characteristic of network-level traffic. In this
paper they gather informationabout the SNMP topology, its statistics and also packet level traces. They examine the
packet-level and flow-leveltransmission properties. They observe the influence of the network traffic on the network
utilization, congestion, linkutilization and also packet drops.
A. W. Moore and K. Papagiannaki [4] give traffic classification on basis of full packet payload. In this
paper a comparison was done between port-based classification and content- based classification. The data used for
comparison was full payload packet traces which were collected from the internet site. The output of the comparison
showed that the traffic classified based on the utilization of the well-known ports. The paper also proved that port
based classification can identify 70% of the overall traffic. L. Bernaille, R. Teixeira[5] tells that port -based
classification is not a reliable method to do analysis. This paper proposes a technique that depends on the
observation of the first five packets of TCP connection to identify the application.
J.Erman, M.Arlitt, and A.Mahanthi [5] Traffic Classification Using Clustering Algorithms In this paper we
evaluated three different clustering algorithms, namely K-Means, DBSCAN, and Auto Class, for the network
traf_classi_cation problem. Our analysis is based on each algorithm's ability to produce clusters that have a high
predictive power of a single traffic class, and each algorithm's ability to generate a minimal number of clusters that
IJRISE| www.ijrise.org|[email protected][187-191]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
contain the majority of the connections. The results showed that the Auto Class algorithm produces the best overall
accuracy.
4. EXISTING SYSTEM
Today the internet users are growing very rapidly. Each and every person is utilizing the internet through
one or the other way. so, possibly internet traffic may also increases. single it is not so easy to handle the very big
internet traffic. And storing these large data and processing these large data is not possible to handle by single
system.
The problem with this is handling the internet traffic using single server is not scalable to handle bigger
network and may be chances of single point of failure.
5. PROPOSED SYSTEM
Figure 1 System Architecture
I.Overview
Handling Internet Traffic Consists of Three main components namely Mininet (Network), Wire
shark and Hadoop Cluster. Figure:-1 shows the key components required for Flow Analysis. The functions of the
above 3 Components are described below:
Mininet:
Mininet is the tool used to setup of the network. Mininet is a network emulator it creates a realistic virtual
network, running real kernel, switch and application code on a single machine(VM, cloud or native), it uses light
weight virtualization to make a single system look like a complete network.
Wire shark:
Wire shark is the tool used to capture, filter and inspect packet flow in the network. A network analysis tool
formerly known as Ethereal, captures packets in real time and display them in human re adable format. Wire shark
includes filters, color-coding and other features that let you dig deep into network traffic and inspect individual
packets.
Hadoop Cluster: Here it consists of two parts namely
1.
2.
Flow analysis will have the entire detail of traffic analysis of the big-data. In this it will take care of mapreduce and clustering algorithms to make a way to obtain the output.
Flow control will be taking care of how to control the large amount of traffic without collision and loss of
packets.
IJRISE| www.ijrise.org|[email protected][187-191]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
I. Setting Up of Network Through Mininet.
Mininet is a toll used to connect the virtual network with in a laptop. So, that we can able to connect n -number of
switches between the sender and destination hosts.
Figure 2:- Mininet Setup
In the above figure host1 and host2 are the source and destination of the virtual network with in the laptop. S1
and S2 are the switches that are present in between the hosts and corresponding paths are controlled by using the
Opendaylight controller I can able to control the virtual network setup in the computer. Flows and operations on the
network can be modified or changed by the Opendaylight control.
II. Capturing Packet Flow using Wire shark
Wire shark is the tool used to capture and modify the packet flow with in the network.
After setting up of network through mininet, then next step is used to capture the packet flows from source host
to destination host in between some switches is used to connect the end hosts.
In this wire shark tool is used to capture the packet flows d etails and it is in the form of log Files, and the collected
log Files is stored in text file, and further it is processed to the next step.
III. Packet Flows can be analyzed in Hadoop
For traffic trace collected with wire shark like tools, there is no obvious flow information available, so the first
step before analysis is to recover flows from the individual packets. Our system implements a bunch of MapReduce
applications, include flow builder, which can quickly and efficiently reassemble flows and conversation, even if they
are stored across several traffic files. The second step is flow clustering aims to extract groups of flows that share
common or similar characteristics and patterns within the same clusters. In this paper I have choose the k-means
algorithm to identify different groups of flows according to their statistical data.
IV. Congestion control Algorithm to control flows in Hadoop
As by looking the above the flows cannot be handled through single system if huge no of traffic is came then it
cannot be handled so we planned to control the flows with in the network through this lot of congestion in the
network packet can be avoided.
Figure 3:- Flows controlled in Hadoop
In the above figure 3 show the entire network how the flows can be able control. I have implemented the
congestion control algorithm to control the flows from source host to destination hosts.
If the packet is of byes exceeds above some range then the path form host -1 to host-2 can be changed or
else the packet bytes doesn‟t exceeds the range the old path form host-1 to host-2 can be used for packet
transmission.
Here it will check for bytes I wrote the algorithm that will handle control on only bytes.
Case-1: If byte >= specified then the path is changed accordingly and
IJRISE| www.ijrise.org|[email protected][187-191]
International Journal of Research In Science & Engineering
Volume: 1 Special Issue: 2
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Case-2: If bytes<specified then the path is same old one.
6. CONCLUSION
In this paper, I have presented the work on handling internet traffic using big data analytics with Hadoop.
Setting up of the network and obtaining the trace file and input trace file given as input to the Hadoop and flow
analysis can be done. And we also implemented the congestion control algorithm to control the internet traffic
analysis. Flow control can be done by using congestion control algorithm to control the internet traffic in the
Hadoop cluster.
REFERENCES
1. M. Yu, L. Jose, and R. Miao,” Software defined traffic measurement with open sketch,” in Proceedings 10th
USENIX Symposium on Networked Systems Design and Implementation NSDI, vol, 13, 2013.
2. Scscc J. Shafer, S. Rixner, and Alan L. Cox, “The Hadoop Distribution Filesystem: Balancing Portability and
Performance”, in Proceedings of
the 10th ACM SIGCOMM conference on Internet measurement ACM 2010.
3. T. Benson, A. Akella, and D. A. Maltz, “Network traffic characteristics of data c enters in the wild,” in
Proceedings of the 10th ACM SIGCOMM conference on Internet measurement. ACM, 2010, pp. 267–280.
4. A.W. Moore and K. Papagiannaki,” Toward the accurate identification of network applications,” in Passive and
Active network Measurement.
Springer, 2005, pp.41-54.
5.J. Erman, M. Arlitt, and A. Mahanti, “Traffic classification usingclustering algorithms,” in Proceedings of the
2006 SIGCOMMworkshop on Mining network data. ACM, 2006, pp. 281–286.
6. Apache Hadoop Website, http://hadoop.apache.org/
7. Yuanjun Cai, Bin Wu, XinweiZhang, Min Luo and Jinzhao Su, “Flow identification and characteristics mining
from internet traffic with hadoop” in 978-1-4799-4383-8/1 at IEEE 2014
IJRISE| www.ijrise.org|[email protected][187-191]