International Journal of Research In Science & Engineering Volume: 1 Special Issue: 2 e-ISSN: 2394-8299 p-ISSN: 2394-8280 HANDLING INTERNET TRAFFIC USING BIG DATA ANALYTICS Ranganatha T.G1 , Narayana H.M 2 , 1 2 PG Student, Department of Computer Science and Engineering, M.S.E.C, [email protected] Associate Professor, Department of C. S. E, M.S.E.C, [email protected] ABSTRACT Handling internet traffic in these days is not so easy, but the explosive growth of internet traffic is hard to collect, store and analyze internet traffic on a single machine. Hadoop has become a popular framework for massive data analytics. It facilitates scalable data processing and storage services on a dis tributed computing system consisting of commodity hardware. In this paper, I present a Hadoop based traffic analysis and control system, which accepts input from Wire shark (Log File), and output in form of summary which contains entire internet traffic details and I also implemented the congestion control algorithm to control the online network traffic in the internet. Keywords: Single machine, Hadoop, Commodity Hardware and Wire shark. ----------------------------------------------------------------------------------------------------------------------------1. INTRODUCTION Internet has made great progress in daily life and brought much more convenience to much more people daily life in recent years, fact that it is still provides a kind of BOF (Best of Effort) service to application has never been changed since its invention. Mininet is a network emulator, an instant virtual network our laptop, It runs a collection of end -hosts, Switches, Routers and Links on a single Linux kernel. It uses a lightweight virtualization to make a single system look like a complete network, System, Code, running in the same kernel. Open daylight is a controller used to control the flows running in the Mininet, Mininet will connect to controller and setup a „n‟ tree topology. Wire shark is a tool used to capture packets in the network. It‟s a free open source packet analyzer used for network trouble shooting, analysis, software and Communication protocol development and education. Hadoop was originally designed for batch oriented processing jobs, Such as creating web pages indices or analyzing log data. Hadoop widely used by IBM, Yahoo!! , Face book, Twitter etc., to develop and execute large -scale analytics or applications for huge data sets. Apache Hadoop is a platform that provides pragmatic, cost-effective, scalable infrastructure for building many of the types of applications described earlier. Made up of a distributed file system called the Hadoop Distributed File system (HDFS) and a computation layer that implements a processing paradigm called Map Reduce, Hadoop is an open source, batch data processing system for enormous amounts of data. We live in a flawed world, and Hadoop is designed to survive in it by not only tolerating hardware and software failures, but also treating them as first-class conditions that happen regularly. Hadoop uses a cluster of plain old commodity servers with no specialized hardware or network infrastructure to form a single, logical, storage and compute platform, or cluster, that can be shared by multiple individuals or groups. Computation in Hadoop Map Reduce is performed in parallel, automatically, with a simple abstraction for developers that obviate complex synchronization and network programming. Unlike many other distributed data proc essing systems, Hadoop runs the user-provided processing logic on the machine where the data lives rather than dragging the data across the network; a huge win for performance. The main contribution of my work exists in designing, Implementing and controlling of internet traffic through big-data analytics. Firstly I have created a virtual network by using the Mininet tool, instantly makes virtual network on my laptop it contains switches, routers, and hosts. It can be controlled by using open daylight controller. To capture the packets flow from the virtual network we use a Wire shark like tool. we capture the packet log file and save it in a text file and then log file is given as input to the Hadoop to process the large data of log file and we will visualized the summary report that contains the flow analysis details which as sender ip , destination ip and it also have the size of byte sent. By using that file we control the traffic by using the congestion control algorithm to control the online traffic. IJRISE| www.ijrise.org|[email protected][187-191] International Journal of Research In Science & Engineering Volume: 1 Special Issue: 2 e-ISSN: 2394-8299 p-ISSN: 2394-8280 The main objective of the work includes: To design and implement a Traffic flow identification system using Hadoop. The traffic flow identification system will be very useful for network administrator to monitor faults and also to plan for the future. 2. BACKGROUND WORK Over the past few years , a lot of tools have been developed and widely used for monitoring the internet traffic. Mininet is tool widely used to setup a virtual network in your laptop. So that we can able to simulate using these Mininet tool to identify the flow of the packets in the virtual network. Wire shark is a popular traffic analyzer that offer‟s user friendly graphics interface. Tcpdump is also most popular tool of capturing and analyzing the internet traffic. Open daylight is an controller used to control of the packets in the Mininet virtual network. It acts like an controller to control the packets form where to where the packets needs to be sent that can be controlled by the Open daylight tool. Most of the map reduce applications on Hadoop are developed to analyze large text, log files or web, in this we firstly packet processing or analysis for Hadoop that analyzes trace file in blocks it will process the block of file and give the result in parallel in distributed environment. Methodol ogy For flow analysis we use some of the map Reduce algorithm to reassemble the flows in the network. We use the k-means algorithm for efficient clustering of the packets with in the network. For flow control we use the congestion control algorithm for the control of internet traffic with the Hadoop. By this we able to control the packets flow control very easily and in the very effective manner. 3. LITERATURE SURVEY A lot of research is done to measure the performance of the internet traffic usin g Hadoop. S.J Shafer, S. Rixner, and Alan L [2]. Cox discuss about performance of distributed Hadoop file system. Hadoop is most accepted framework for managing huge amount of data in distributed environment. Hadoop makes use of user-level file system in distributedmanner. The HDFS (Hadoop Distributed File System) is a portable across both hardware and software platforms. In this paper a detailed performance analysis of HDFS was done and it displays a lot of performance issues. Initially the issue was on architectural bottleneck that exist in the Hadoop implementation which resulted in the inefficient usage of HDFS. The second limitation was based on portability limitations which limited the java implementation from using the features of naive platform. This paper tries to find solution for the bottleneck and portability problems in the HDFS. T. Benson, A. Akella, and D. A. Maltz,[3] wrote a paper on “Network traffic characteristics of data centers in the wild” In this paper the researcher conduct an experien tial report of the network in few data centers which belongs to differenttypes of organizations, enterprise, and university. In spite of the great concern in developing network for data centers,only few information is known about the characteristic of network-level traffic. In this paper they gather informationabout the SNMP topology, its statistics and also packet level traces. They examine the packet-level and flow-leveltransmission properties. They observe the influence of the network traffic on the network utilization, congestion, linkutilization and also packet drops. A. W. Moore and K. Papagiannaki [4] give traffic classification on basis of full packet payload. In this paper a comparison was done between port-based classification and content- based classification. The data used for comparison was full payload packet traces which were collected from the internet site. The output of the comparison showed that the traffic classified based on the utilization of the well-known ports. The paper also proved that port based classification can identify 70% of the overall traffic. L. Bernaille, R. Teixeira[5] tells that port -based classification is not a reliable method to do analysis. This paper proposes a technique that depends on the observation of the first five packets of TCP connection to identify the application. J.Erman, M.Arlitt, and A.Mahanthi [5] Traffic Classification Using Clustering Algorithms In this paper we evaluated three different clustering algorithms, namely K-Means, DBSCAN, and Auto Class, for the network traf_classi_cation problem. Our analysis is based on each algorithm's ability to produce clusters that have a high predictive power of a single traffic class, and each algorithm's ability to generate a minimal number of clusters that IJRISE| www.ijrise.org|[email protected][187-191] International Journal of Research In Science & Engineering Volume: 1 Special Issue: 2 e-ISSN: 2394-8299 p-ISSN: 2394-8280 contain the majority of the connections. The results showed that the Auto Class algorithm produces the best overall accuracy. 4. EXISTING SYSTEM Today the internet users are growing very rapidly. Each and every person is utilizing the internet through one or the other way. so, possibly internet traffic may also increases. single it is not so easy to handle the very big internet traffic. And storing these large data and processing these large data is not possible to handle by single system. The problem with this is handling the internet traffic using single server is not scalable to handle bigger network and may be chances of single point of failure. 5. PROPOSED SYSTEM Figure 1 System Architecture I.Overview Handling Internet Traffic Consists of Three main components namely Mininet (Network), Wire shark and Hadoop Cluster. Figure:-1 shows the key components required for Flow Analysis. The functions of the above 3 Components are described below: Mininet: Mininet is the tool used to setup of the network. Mininet is a network emulator it creates a realistic virtual network, running real kernel, switch and application code on a single machine(VM, cloud or native), it uses light weight virtualization to make a single system look like a complete network. Wire shark: Wire shark is the tool used to capture, filter and inspect packet flow in the network. A network analysis tool formerly known as Ethereal, captures packets in real time and display them in human re adable format. Wire shark includes filters, color-coding and other features that let you dig deep into network traffic and inspect individual packets. Hadoop Cluster: Here it consists of two parts namely 1. 2. Flow analysis will have the entire detail of traffic analysis of the big-data. In this it will take care of mapreduce and clustering algorithms to make a way to obtain the output. Flow control will be taking care of how to control the large amount of traffic without collision and loss of packets. IJRISE| www.ijrise.org|[email protected][187-191] International Journal of Research In Science & Engineering Volume: 1 Special Issue: 2 e-ISSN: 2394-8299 p-ISSN: 2394-8280 I. Setting Up of Network Through Mininet. Mininet is a toll used to connect the virtual network with in a laptop. So, that we can able to connect n -number of switches between the sender and destination hosts. Figure 2:- Mininet Setup In the above figure host1 and host2 are the source and destination of the virtual network with in the laptop. S1 and S2 are the switches that are present in between the hosts and corresponding paths are controlled by using the Opendaylight controller I can able to control the virtual network setup in the computer. Flows and operations on the network can be modified or changed by the Opendaylight control. II. Capturing Packet Flow using Wire shark Wire shark is the tool used to capture and modify the packet flow with in the network. After setting up of network through mininet, then next step is used to capture the packet flows from source host to destination host in between some switches is used to connect the end hosts. In this wire shark tool is used to capture the packet flows d etails and it is in the form of log Files, and the collected log Files is stored in text file, and further it is processed to the next step. III. Packet Flows can be analyzed in Hadoop For traffic trace collected with wire shark like tools, there is no obvious flow information available, so the first step before analysis is to recover flows from the individual packets. Our system implements a bunch of MapReduce applications, include flow builder, which can quickly and efficiently reassemble flows and conversation, even if they are stored across several traffic files. The second step is flow clustering aims to extract groups of flows that share common or similar characteristics and patterns within the same clusters. In this paper I have choose the k-means algorithm to identify different groups of flows according to their statistical data. IV. Congestion control Algorithm to control flows in Hadoop As by looking the above the flows cannot be handled through single system if huge no of traffic is came then it cannot be handled so we planned to control the flows with in the network through this lot of congestion in the network packet can be avoided. Figure 3:- Flows controlled in Hadoop In the above figure 3 show the entire network how the flows can be able control. I have implemented the congestion control algorithm to control the flows from source host to destination hosts. If the packet is of byes exceeds above some range then the path form host -1 to host-2 can be changed or else the packet bytes doesn‟t exceeds the range the old path form host-1 to host-2 can be used for packet transmission. Here it will check for bytes I wrote the algorithm that will handle control on only bytes. Case-1: If byte >= specified then the path is changed accordingly and IJRISE| www.ijrise.org|[email protected][187-191] International Journal of Research In Science & Engineering Volume: 1 Special Issue: 2 e-ISSN: 2394-8299 p-ISSN: 2394-8280 Case-2: If bytes<specified then the path is same old one. 6. CONCLUSION In this paper, I have presented the work on handling internet traffic using big data analytics with Hadoop. Setting up of the network and obtaining the trace file and input trace file given as input to the Hadoop and flow analysis can be done. And we also implemented the congestion control algorithm to control the internet traffic analysis. Flow control can be done by using congestion control algorithm to control the internet traffic in the Hadoop cluster. REFERENCES 1. M. Yu, L. Jose, and R. Miao,” Software defined traffic measurement with open sketch,” in Proceedings 10th USENIX Symposium on Networked Systems Design and Implementation NSDI, vol, 13, 2013. 2. Scscc J. Shafer, S. Rixner, and Alan L. Cox, “The Hadoop Distribution Filesystem: Balancing Portability and Performance”, in Proceedings of the 10th ACM SIGCOMM conference on Internet measurement ACM 2010. 3. T. Benson, A. Akella, and D. A. Maltz, “Network traffic characteristics of data c enters in the wild,” in Proceedings of the 10th ACM SIGCOMM conference on Internet measurement. ACM, 2010, pp. 267–280. 4. A.W. Moore and K. Papagiannaki,” Toward the accurate identification of network applications,” in Passive and Active network Measurement. Springer, 2005, pp.41-54. 5.J. Erman, M. Arlitt, and A. Mahanti, “Traffic classification usingclustering algorithms,” in Proceedings of the 2006 SIGCOMMworkshop on Mining network data. ACM, 2006, pp. 281–286. 6. Apache Hadoop Website, http://hadoop.apache.org/ 7. Yuanjun Cai, Bin Wu, XinweiZhang, Min Luo and Jinzhao Su, “Flow identification and characteristics mining from internet traffic with hadoop” in 978-1-4799-4383-8/1 at IEEE 2014 IJRISE| www.ijrise.org|[email protected][187-191]
© Copyright 2024