1 Oluwatosin Alabi1, Joe Beckman2, Dheeraj Gurugubelli3 1 Purdue University, [email protected]; 2Purdue University, [email protected]; 3 Purdue University, [email protected] Keywords: Data Spillage, Hadoop Clouds, Data Carving, Data Wiping, Digital Forensics Data spillage in Hadoop clusters has been identified by the National Security Agency as a security threat because sensitive information that is stored on these clusters and spilled has the potential to be seen by those without permission to access such data. Military and other government entities perceive data spillage as a security threat when sensitive information is introduced onto one or more unauthorized platforms. This project focuses on tracking sensitive information spilled through the introduction of a sensitive document onto a non-sensitive Hadoop cluster by a user. The goal of the project is to contain the spilled data more quickly and aid in the secure removal of that data from the impacted cluster. We seek to establish a procedure to respond to this type of data spillage event by tracking the spillage through a forensic process within a Hadoop environment on a cloud infrastructure. Oluwatosin Alabi: Hadoop cluster configuration and system evaluation Dheeraj Gurugubelli: Cyber forensics analysis and evaluation ALL: Experimental design and data analysis Joe Beckman: Data Management and recommendation of policy related controls Project Deliverables: 1. Final Project Report: Hadoop: Understanding, Preventing, and Mitigating the Impacts of Data Spillage from Hadoop Clusters using Information Security Controls. Expected Delivery Date: 12/8/2014. 2. Project Poster: The project team will generate a poster for presentation of the project at the annual CERIAS Symposium. Expected Delivery Date: 3/24 - 3/25/2015. 3. Research Conference Presentation/Journal Paper: 3.1. Network and Distributed System Security Symposium/International Journal of Security and Its Applications. Expected Delivery Date: 1/1/2015 3.2. Storage Network Industry Association Data Storage Innovation Conference. Expected Delivery: April 7-9, 2015, Santa Clara, CA, USA. Table of Contents Executive Summary.........................................................................................................................3 1.Introduction...................................................................................................................................4 1.1 Scope.................................................................................................................................4 1.2 Significance.......................................................................................................................4 1.3 Problem Statement............................................................................................................5 1.4 Assumptions......................................................................................................................5 1.5 Limitations........................................................................................................................5 2. Literature Review.........................................................................................................................5 2.1 Hadoop Distributed File System and Data Spillage..........................................................5 2.2 Digital Forensics and Cloud Environments......................................................................7 3 Approach/ Methodology.............................................................................................................10 4.1 Data Management Plan...................................................................................................11 4. Results and Conclusions............................................................................................................11 5. Schedule.....................................................................................................................................12 6. Budget....................................................................................................................................12 7. Final Discussion and Future Directions.................................................................................13 8. Bibliography..........................................................................................................................13 9. Biographical sketches of the team members..........................................................................14 3 Executive Summary Data spillage in Hadoop clusters has been identified by the National Security Agency as a security threat because sensitive information that is stored on these clusters and spilled has the potential to be seen by those without permission to access such data. Military and other government entities perceive data spillage as a security threat when sensitive information is introduced onto one or more unauthorized platforms. As the use of Hadoop clusters to manage large amounts of data both inside and outside of government grows, the ability to locate and remove data effectively and efficiently in Hadoop cluster will become increasingly important. This project focuses on tracking classified information spillage in Hadoop based clusters in order to contain the situation quicker and aid in data wiping process. The goal of this project is to establish a procedure to handle data spillage by tracking the spilled data through the Hadoop Distributed File System (HDFS) and applying digital forensics processes to aid in the removal of spilled data within an impacted Hadoop cluster on a cloud infrastructure. Our approach to the problem is novel because we merged the HDFS structure understanding with digital forensic procedures to identify, acquire, analyze and report the locations of spilled data and any remnants on virtual disks. Toward that end, we used the procedural framework illustrated below to track and analyze user-introduced data spillage in the Hadoop-based cloud environment. By following this procedure, we were able to image each impacted cluster node for analysis, locate all occurrences of a .pdf file after it was loaded onto the Hadoop Distributed File System (HDFS) cluster, and and recover the file once it was deleted from the cluster using HDFS commands. Data Spillage in Hadoop Clouds Oluwatosin Alabi1, Joe Beckman2, Dheeraj Gurugubelli3 1 Purdue University, [email protected]; 2Purdue University, [email protected]; 3 Purdue University, [email protected] Keywords: Data Spillage, Hadoop Clouds, Data Carving, Data Wiping, Digital Forensics 1. Introduction Data spillage in Hadoop clusters has been identified by the National Security Agency as a security threat because sensitive information that is stored on these clusters and spilled has the potential to be seen by those without permission to access such data. Military and other government entities perceive data spillage as a security threat when sensitive information is introduced onto one or more unauthorized platforms. This project focuses on tracking sensitive information in the Hadoop Distributed File System (HDFS) as a result of the introduction of a sensitive document onto a non-sensitive Hadoop cluster by a user. The goal of the project is to find all occurrences of the spilled data on the Hadoop cluster, which is defined in this project as a user-introduced sensitive file and image the impacted cluster nodes for forensic analysis, in order to aid in removing that sensitive data from the impacted Hadoop cluster. We seek to establish a procedure that can be used to respond to this type of data spillage event by tracking the spilled information using digital forensics processes within a Hadoop environment on a cloud infrastructure. 1.1 Scope The scope of this project addresses the locating of all instances of deleted sensitive data on an impacted, virtualized Hadoop cluster and preservation of that information on the impacted node images for forensic analysis under the following technical constraints: 1.2 • Hadoop Cluster: To fit the study within the required time line, the number of nodes and size of each node had to be carefully chosen. This study used an 8 nodes cluster with a storage space of 80 GB on each node. • Virtualization: The Hadoop cluster used in this experiment was created using VMWare ESXI software. Therefore, all nodes in the cluster exist not as physical servers, rather as virtual machines. • Storage Disk Type: This study is valid only when the storage units used are Hard Disk Drive’s (HDD). Significance Data spillage is a critical threat to the confidentiality of sensitive data. As use of increasingly large data sets grows, use of Hadoop clusters to handle these data sets is also growing, and with it, the potential for data spillage events. In this context, the ability to completely remove sensitive data from a Hadoop cluster is critical, especially in areas that impact national security. Without a process to effect the complete removal of sensitive data from 5 Hadoop clusters, user-induced data spillage events in Hadoop clusters could impact the privacy of billions of people worldwide, and in classified United States government settings, the national security of the United States. 1.3 Problem Statement Data spillage in Hadoop clusters has been identified by the National Security Agency as a security threat because sensitive information spilled onto unauthorized Hadoop clusters has the potential to be seen by those without permission to access such data, which has the potential to negatively impact the national security of the United States. 1.4 Assumptions This study assumes that the following are true. • The drives used for storage in the cluster are Hard Disk Drives. • Access Data's FTK imager version 2.6, Forensics Toolkit version 5 would correctly represent the file structure of a forensics image file and PDF files used in this study. • The file load on to HDFS is successful and that the location of the spilled data on data nodes can be known from the metadata in the logs of the name node. 1.5 Limitations The following are limitations of this study. • Time seriously constrained this research. Forensic procedures such as image acquisition and analysis consumed, on average, 13-18 hours time was spent on each of the nodes during this process. • The computing infrastructure available for digital forensics processing is limited, a higher transfer and processing capability could have significantly reduced the amount of time spent processing the impacted cluster nodes. • This research addressed a subset of the configurable options available when deploying a virtual machine-based Hadoop cluster. We believe that these configuration options had little or no impact on the results of the study, but re-running the experiment with different configuration options was outside of the scope of this study. • This research used VMware as a virtual machine platform. Other products may produce different results. • The Vmware ESXi platform was used to acquire the .vmdk images from the data store. Changes in the versions of that software could produce different results. 2. 2.1 Literature Review Hadoop Distributed File System and Data Spillage Big data sets are classified by the following attributes (Ramanathan et al., 2013): high volume (data size), variety (multiple sources and data types), and velocity (rate in which new information is added into the data set) and, value (the utility and quality of the data set). A growing number of organizations are implementing data repositories using private cloud infrastructures to facilitate shared access to their data repositories, referred to as a data lake (EMC White Paper, 2014), and high performance computing resources. The National Institute of Standards and Technology (NIST) has defined cloud computing as “a model for enabling convenient, on demand network access to a shared pool of configurable resources (e.g. networks, servers, storage, application, and services) that can be rapidly provisioned and released with minimal management effort or service provided interaction” (Mell & Grance, 2011). Due to the trend toward the use of cloud computing infrastructures for data storage and processing, companies must address new and different security risks than those associated with traditional data storage and processing systems. One particular security risk is the loss of over control sensitive and protected data within an organization’s information technology infrastructure, which characterizes data leakage. Data leakage is defined as the accidental or intentional distribution of classified or private information to an unauthorized entity (Anjali, Geetanjali, Shivlila, R. Shetkar, & B., 2013). Data spillage, a specific type of data leakage, occurs when classified or sensitive information is moved onto an unauthorized or undesignated compute node or memory media (e.g. disk). The ability to control the privacy of sensitive information, then, is a critical component of protecting national infrastructure and security. Currently, research related to understanding and determining incident response techniques for dealing with data spillage within Hadooop’s distributed files system, HDFS, is limited. Hadoop is the open source implementation of the GoogleTM MapReduce parallel computing program framework. There are two major components of a Hadoop system: the HDFS file system for data storage, and parallel computing data processing framework. The HDFS file system is among a number of distributed file system such as PVFS, Lustre, and Google File System (GFS). Unlike PVFS and Lustre. RAID is not used as part of the data protection mechanism . Instead, HDFS replicates data over multiple nodes, called DataNodes, to ensure reliability (DeRoos, 2014). The HDFS file system architecture is designed after the Unix file system which stores files as blocks. Each block stored in a DataNode can be composed of data of size 64MB or 128MB as defined by system administrator. Each group of blocks consists of metadata descriptions that are stored by the NameNode. The NameNode manages the storage of file locations and monitors the availability of DataNodes in the system, as described in Figure 1 below. Although there are number of system level configurations that system administrators can implement to help secure Hadoop systems, they do not eliminate data spillage incidents related to user error (The Apache Software Foundation, 2014). This study will focus specifically on a case in which a user loads a confidential or sensitive file onto a Hadoop cluster that is not authorized to store or process that classified or sensitive data. 7 Figure 1. Hadoop Distributed File system 2.2 Digital Forensics and Cloud Environments Lu and Lin (2010) researched techniques for providing secure data provenance in cloud computing environments. They proposed a scheme characterized by a) providing the information confidentiality on sensitive documents stored in cloud b) anonymous authentication on user access, and c) provenance tracking on disputed documents. Lu and Lin (2010) were pioneers in proposing a feasible security schema to ensure confidentiality of sensitive data stored on cloud environments. Research has been conducted to identify technical issues in digital forensics investigations performed on cloud-based computing platforms. The authors of this research argue that, due to the decentralized nature of clouds and of data processing in the cloud, traditional digital investigative approaches of evidence collection and recovery are not practical in cloud environments (Birk, D., & Wegener, C., 2011). We propose to extend existing research by applying cloud-based digital forensics frameworks for use in incident management within Hadoop clusters. Our research addresses this process in the context of Hadoop distributed file system (HDFS), where published research is currently limited. The theoretical framework used in the analysis portion of our investigation is the cloud forensics framework used and published by Martini and Choo (2014). These authors validated the cloud forensics framework outlined by McKemmish (1999) and the National Institute of Standards and Technology (NIST) for conducting digital forensics investigations (Kent, Chevalier, Grance, & Dang, 2006; Martini & Choo, 2014). The framework, depicted in Figure 2, describes a four stage iterative process that includes identification, collection, analysis and reporting of digital artifacts within HDFS data storage system. Figure 2. Digital investigation process overview used to guide this study. 2.3 Data Carving When a file is deleted in most file systems including HDFS, only the reference to that data, called a pointer, is deleted but the data itself remains. In the FAT file system, for example, when a file is deleted the file’s directory entry is simply changed to reflect that the space that the data occupies is then unallocated. The first character of a file name is switched with a marker. The actual file data is still left unchanged. The file data remnants are still present with an exception of overwriting with data (Carrier, B. & Spafford, E., 2003). Similarly in HDFS, when a file is deleted, only the pointer to the file on the name node is deleted. The data remnants still remain unchanged on the data nodes until that data is overwritten. According to the Digital Forensic Research Workshop, “Data carving is the process of extracting a collection of data from a larger data set. Data carving techniques frequently occur during a digital investigation when the unallocated file system space is analyzed to extract files. The files are 'carved' from the unallocated space using file type-specific header and footer values. File system structures are not used during the process,” (Garfinkle, 2007). Simply stated, file carving is the process of extracting the remnants of data from a greater storage space. Data carving techniques are an important part of digital investigations. Digital forensics examiners commonly look for data remnants in unallocated file system space. Beek, C. (2011) wrote a white paper explaining data carving concepts in which he referred to several data carving tools. In his paper, Beek also explained the difference between data carving and data recovery. According to Beek, Data recovery is the carving of data based on the file system structure, which would not be useful on a system format. Further, Beek states that the file system used to retrieve data is not important to the data retrieval process. In the case examined by our research, in order to identify the nodes that need to be carved we are dependent on the HDFS. Simson and Garfinkel (2007) proposed a file carving taxonomy which includes the following suggested types of file carving. • Block-Based Carving • Statistical Carving 9 • • • • • • • • Header/Footer Carving Header/Maximum (file) size Carving Header/Embedded Length Carving File structure based Carving Semantic Carving Carving with Validation Fragment Recovery Carving Repackaging Carving Every file stored on disk has a file type and each file type is associated with a header and footer values. For example, a pdf file starts with “%PDF” and ends with “%EOF” and can be discovered using a string search in the disk space with their HEX values header and footer signatures. There are a number of techniques to “carve” data remnants of deleted files from a disk and search string method is one of them (Povar, D., & Bhadran, V. K., 2011). The BoyerMoore searching algorithm, described in R. S. Boyer and J. S. Moore's 1977 paper “A Fast String Searching Algorithm” is one of the best known ways to perform sub string search in given search space. The header and footer signature for some common file types are shown in Figure 3 below: Extension Header (Hex) Footer (Hex) DOC D0 CF 11 E0 A1 B1 1A E1 57 6F 72 64 2E 44 6F 63 75 6D 65 6E 74 2E XLS D0 CF 11 E0 A1 B1 1A E1 FE FF FF FF 00 00 00 00 00 00 00 00 57 00 6F 00 72 00 6B 00 62 00 6F 00 6F 00 6B 00 PPT D0 CF 11 E0 A1 B1 1A E1 50 00 6F 00 77 00 65 00 72 00 50 00 6F 00 69 00 6E 00 74 00 20 00 44 00 6F 00 63 00 75 00 6D 00 65 00 6E 00 74 ZIP 50 4B 03 04 14 50 4B 05 06 00 JPG FF D8 FF E0 00 10 4A 46 49 46 00 01 01 D9 (“Better To Use File size Check”) GIF 47 49 46 38 39 61 4E 01 53 00 C4 21 00 00 3B 00 PDF 25 50 44 46 2D 31 2E 25 25 45 4F 46 Figure 3: Headers and Footers for Common File Types Povar and Bhadran (2011) proposed an algorithm to extract or data carve PDF files. The algorithm contains 6 steps as quoted below. Step 1. Look for the header signature (%PDF) Step 2. Check for the version number [file offset 6-8] Step 3. If version no. > 1.1 go to Step4, else go to Step6 Step 4. Search for the string “Linearized” in first few bytes of the file Step 5. If it finds the string in Step 4, then length of the file is preceded by a “/L ” character sequence. Carved file size = embedded length;// 479579, else go to step 6. Step 6. Use search algorithms to find footer signature (%%EOF). Searching will be continued until the carved file size<=User specified file size. 2.4 Contributions of Literature Frameworks and processes exist in literature that support digital forensics processes in virtual and cloud environments. These artifacts are not, however, specific to Hadoop environments. To guide our efforts to locate deleted data in HDFS, image the impacted disks for digital forensic analysis, and support the secure removal of remnants of sensitive data within the cluster, we will extend the cloud forensics framework from NIST's Mell and Grance (2011) and McKemmish (1999), pictured in Figure 4 below, to support digital forensics operations in HDFS. Figure 4: Cloud Cyberforensics framework adapted for HDFS 3. Approach/ Methodology Following the scoping of the project and preparation of the Hadoop cluster environment, the basic four stages of digital forensics analysis are outlined below with some of the practical considerations related to conducting forensic investigation in a cloud computing environment, such as HDFS. 1. Evidence Source Identification and Preservation: The focus of this stage is to identify potential and preserving potential sources of evidence. In the case of HDFS, the name node and data nodes are consider as points to considering when identifying where evidence may be located. The name node can be used to locate where and how a file is distributed within the distributed file system’s data nodes. We must also consider other components related to the removal of evidence or system administration such as the trash folder and the virtual machine’s system acquired snapshots taken to help capture the state and data stored within a virtual machine at specified points in time. 11 2. Collection and Recovery: This stage is aimed at the collection of the data/information identified and preserved during stage one. In a virtual environment the access to the physical media is limited and we instead collect an image of the disk contents that were provisioned to the virtural machine (VM) stored as a Virtural Machine Disk (VMDK) file format. These VMDK files are retrieved from the VM’s local disk accessed through the vSphere client’s Datastore. The VMDK file must be converted to a Macintosh DISKDOUBLE (dd) file format as part of the imaging processes steps needed for stage three. 3. Examination and Analysis: This stage is aimed at examining the forensic data and analyzing that data, using a scientifically sound process, to help gather facts related to the incident under investigation. In our investigation we divide this into two steps: a. Data remanence: capturing information/data stored in the VMDK file and inspect for relevant factual evidence related to the incident under investigation. Here we use the forensics tools from Guidance Software, EnCase, and from AccessData Forensic Toolkit (FTK) to help process and analyze the, vmdk files retrieved from the HDFS cluster nodes. The aim here is data recovery of any trace evidence that may relate to the data spillage incident and the removal sensitive information from the data nodes. b. Data Recovery: Here the efforts are made to examine the image files using EnCase, FTK and Stellar forensic investigation tool to recover that deleted from the VM managed by the filesystem. Both EnCase and FTK have capabilities to allow raw disk images to be loaded and processed as dd file types. Stellar requires that images be mounted in Windows as a logical drive using a third party tool (Buchanan-Wollaston, Storer, & Glisson, 2013). 4. Reporting and Presentation: This relates to the legal presentation of the collected evidence and investigation. In the case of this project we are concern with the reporting of evidence and the practical steps that were implemented to capture/recover evidence using a scientifically-sound process that is repeatable. This will be documented and presented to the project Technical Directors and other stakeholders. 4.1 Data Management Plan Managing data throughout the research life cycle is essential: in order for this experiment to be replicated for further testing, as guidance for various federal agencies, and for re-testing as new versions of Hadoop are they are released. The ability to publish and re-create this work involves the thorough documentation of processes and configuration settings. Initial data management will document the initial hardware and software specifications of the hardware and software used, the configuration processes used in configuring both the Hadoop cluster and the physical and virtual nodes, and a detailed description of the tagged data and the process of loading it into the cluster. Any software or data used will be backed up to PURR when such backups are technically feasible and will not violate applicable laws. Results of forensic analysis will also be thoroughly documented, and uploaded to PURR. PURR will serve the data management requirements of this project well because it provides data storage and tools for uploading of project data, communication of that data, and other services such as: data security, fidelity, backup, and mirroring. Purdue Libraries also offers at no cost to project researchers consulting services in order to facilitate selection and uploading of data, inclusive of the generating application and necessary metadata that will ensure proper long-term data stewardship. Documentation of results and processes will also be uploaded to the team's shared space on Google Docs as a backup to PURR. The contact information of the project's appointed data manager will also be uploaded to PURR to facilitate long term access to project data. 4. Results and Conclusions Using the framework and digital forensics processes detailed above, we successfully located and recovered from the impacted DataNodes the deleted .pdf file that we had designated as sensitive. Four tests were conducted on the four images acquired during the acquisition process. For the purpose of testing a Known Evidence File (KEF) was introduced on to the Hadoop Distributed Files System. The KEF used is a PDF document. Test 1 Metadata Search from Image 1: NameNode.vmdk (Before deletion of the KEF) The aim of this test was to find metadata from the NameNode image before deletion of the KEF. The expected outcome was to be able to find the metadata indicating the nodes to which the KEF has been replicated. FTK Imager was able to mount the directory structure of the VM image added as evidence. The event logging file, hdfs-audit.log was located. The replication of the KEF was logged. The logs indicated the source path of the KEF, the IP address of the name node, the disk ID’s of the DataNodes to which the data was replicated. Thus, the DataNodes that need to be imaged were identified. Figure 5: A snapshot of the hdfs-audit.log 13 Test 2 Data Carving from Image 2: DataNode.vmdk (Before deletion of the KEF replica) The aim of this test was to find the KEF in the file structure of the DataNode and identify the path and the physical location at which it has been copied to. The expected outcome was to be able to locate the complete KEF. FTK Imager was able to mount the directory structure of the VM image added as evidence. The KEF i.e. The Probability and Statistics book in .PDF format was located. Figure 6: Path to the KEF pdf. Figure 7: The physical location of the KEF pdf Figure 8: The KEF in Natural View in FTK Imager 15 The KEF file is located in the dfs directory within the dn folder in the 35 th subdirectory. Thus, it’s confirmed that the file has been replicated to the datanodes indicated in the metadata on the NameNode. The physical location of the KEF is known and could be used in data wiping process. Test 3 Log Search from Image3: NameNode.vmdk (After deletion of the KEF) The aim of this test was to find logs from the NameNode image after deletion of the KEF. The expected outcome was to be able to find logs indicating the deletion of the KEF. FTK Imager was able to mount the directory structure of the VM image added as evidence. The event logging file, hdfs-audit.log was located. The deletion of the KEF was logged. The logs indicated that the KEF has been deleted and moved to .\trash. Thus, the KEF was successfully deleted and replicas were deleted on the DataNodes. Figure 9: The hdfs-audit.log confirming the delete of the KEF Test 4 Data Carving from Image 4: DataNode.vmdk (After deletion of the KEF replica) The aim of this test was to carve the KEF from the DataNode image after deletion of the KEF. The expected outcome was to be able to find and carve the KEF. The Forensic Toolkit 5.3 was able to mount the VM image and process the image added as evidence file. Though there are many carving techniques available as mentioned in the literature review, the Header/Footer carving method was used for the purpose of this test. On searching the node image with 25 50 44 46 (the hex header for PDF), a lot of pdf files were found and finally lead to the discovery of the KEF pdf. The KEF was found at the exact same path found from test 2 which is a deviation from the expected location. When a file is deleted, the pointer to the file is deleted and the space is labeled as unallocated space. But the KEF was contradictorily found at the same path location. There is proof from test 3 indicating that the files has been deleted, which implies its been deleted from the file structure. But from the results it is found in the dfs directory. Thus, the KEF pdf was carved. Figure 10: KEF found in the Header/Footer Search 17 Figure 11: Using the Save Selection option the hex was exported Figure 12: The file found is confirmed to be KEF 5. Schedule The chart above is the proposed time line for this project, and represents completed items with green cells. Current progress, represented by the yellow cell, is behind original projections. The project team has faced significant and unanticipated challenges during the analysis of the cluster nodes, including reloading data onto the cluster, and re-analyzing the new data. 6. 6.1 Budget Proposed Budget Team Hours 3 members, 12 weeks, $1,984/member: $5,592.00 Fringe Benefits 3 members, $750/member: $2,250.00 Purdue Indirect 1, $15,677.86 * .54: $8,466.00 Conference Travel 3 members, $2,000.00/member: $6,000.00 TOTAL: $22,308.00 6.2 Actual Expenditures Team Hours 3 members, 14 weeks $2,315/member: $6,944.00 Fringe Benefits 3 members, $750/member: $2,250.00 Purdue Indirect 1, $15,677.86 * .54: $8,466.00 USB 1TB External Hard Disk Drive: $70.00 Conference Travel 3 members, $2,000.00/member: $6,000.00 TOTAL: $23,730.00 19 6.3 Discussion of Discrepancies Actual expenditures exceeded the proposed budget due to the greater number of hours spent working on the project by team members, and the purchase of an external hard disk drive used to store node images. Team members worked more hours on the project than initially budgeted due to the challenges of locating the text file on the data node images. These challenges forced the loading onto the Hadoop cluster of a second file, which was in .pdf format. Following this load, the cluster nodes were re-imaged and re-examined. In order to store several 80GB node images for examination and data management, the team required a 1TB external hard drive in order to speed the examination process, and because the PURR system listed in the initial proposal as the storage medium for these images had insufficient storage for archiving the images. 7. Final Discussion and Future Directions This study illuminated in greater depth several important issues involved in locating sensitive data in Hadoop clusters. Many of these issues were not related directly to Hadoop itself, but to the environment in which Hadoop is run. Because the study was limited in physical resources, the team had to navigate issues with the virtual environment, as well as those related to Hadoop itself. Though it was unexpected to the team, the changes required to the analysis of the data nodes based on the virtualization of the environment appear to add challenge to the analysis process. As a result, several areas of future work related to this study exist. It is important to also understand how the process of locating removed data and preserving that data for forensic examination on a physical Hadoop cluster differs from the same process working on a virtual cluster. Further, different virtualization techniques may also require changes to the process. Beyond finding and preserving the evidence of deleted sensitive information in the Hadoop environment, an organization that makes use of these techniques is also likely interested in the ability to remove all traces of the sensitive data to a specification that is more robust than simply issuing commands to Hadoop to remove the data. In these cases, the procedures discussed in this study would need to be augmented to include the destruction of remaining traces of data using DOJ or other data removal standards. Finally, because data is replicated across several nodes within the Hadoop cluster based on Hadoop configuration settings, a data removal process that involved removing impacted nodes from the cluster for data removal could be extremely costly in performance and time. Future work should also include studies around the ability of the organization to remove data within the Hadoop cluster to a desired specification without removing nodes from their duties in the cluster. The full answer to the questions posed by this problem will involve not only finding and preserving data from Hadoop clusters, but the live removal of remaining traces of data to a specification selected by the organization, and the automation of that process. 8. Bibliography Anjali, N. B., Geetanjali, P. R., Shivlila, P., R. Shetkar, S., & B., K. (2013). Data leakage detection. International Journal of Computer and Mobile Computing, 2(May), 283–288. Apache Software Foundation. Apache Hadoop 2.6.0 - Cluster Setup. 2014. Retrieved October 12, 2014 from: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoopcommon/ClusterSetup.html Beek, C. (2011). Introduction to file carving. White paper. McAfee. Birk, D., & Wegener, C. (2011, May). Technical issues of forensic investigations in cloud computing environments. In Systematic Approaches to Digital Forensic Engineering (SADFE), 2011 IEEE Sixth International Workshop on (pp. 1-10). IEEE. Buchanan-Wollaston, J., Storer, T., & Glisson, W. (2013). Comparison of the Data Recovery Function of Forensic Tools. In Springer (Ed.), Advances in Digital Forensics IX (IX., pp. 331–347). Springer Berlin Heidelberg. Carrier, B., & Spafford, E. H. (2003). Getting physical with the digital investigation process. International Journal of digital evidence, 2(2), 1-20. DeRoos, D. (2014). Hadoop for Dummies. Wiley & Sons. New York. pp. 10-14. EMC White Paper. (2014). Security and compliance for scale-out hadoop data lakes. Garfinkel, S. L. (2007). Carving contiguous and fragmented files with fast object validation. digital investigation, 4, 2-12. Kent, A. K., Chevalier, S., Grance, T., & Dang, H. (2006). Guide to Integrating Forensic Techniques into Incident Response. Gaithersburg, MD: NIST Special Publication 800-8. Retrieved from http://cybersd.com/sec2/800-86Summary.pdf Lu, R., Lin, X., Liang, X., & Shen, X. S. (2010, April). Secure provenance: the essential of bread and butter of data forensics in cloud computing. InProceedings of the 5th ACM Symposium on Information, Computer and Communications Security (pp. 282-292). ACM. Martini, B., & Choo, K.-K. R. (2014). Distributed filesystem forensics: XtreemFS as a case study. Digital Investigation, 1–19. doi:10.1016/j.diin.2014.08.002 McKemmish, R. (1999). What is forensic computing? (pp. 1–6). Retrieved from http://aic.gov.au/documents/9/C/A/%7B9CA41AE8-EADB-4BBF-989464E0DF87BDF7%7Dti118.pdf Mell, P., & Grance, T. (2009). The NIST definition of cloud computing. National Institute of Standards and Technology, 53(6), 50 Povar, D., & Bhadran, V. K. (2011). Forensic data carving. In Digital Forensics and Cyber Crime (pp. 137-148). Springer Berlin Heidelberg. Ramanathan, A., Pullum, L., Steed, C. A., Quinn, S. S., Chennubhotla, C. S., & Parker, T. (2013). Integrating Heterogeneous Healthcare Datasets and Visual Analytics for Disease Biosurveillance and Dynamics. 9. Biographical sketches of the team members Oluwatosin Alabi is a doctoral research assistant for Drs. Dark and Springer. Her research background is in system modeling and analysis that allows her to approach this project from a statistical-based modeling approach. Over the summer, she worked on developing a data motion power consumption model for multicore systems for the Department of Energy (DOE) . In addition, she is working towards growing expertise and experience in big data analytics under her adviser Dr. Springer, the head of the Discovery Advancements Through Analytics (D.A.T.A.) Lab. 21 Joe Beckman is a Ph.D. student specializing in secure data structures for national health data infrastructure. His most recent experience includes security and privacy analysis of laws, policies, and information technologies for the Office of the National Coordinator for Health Information Technology within the Department of Health and Human Services. Dheeraj Gurugubelli is currently pursuing his second masters at Purdue University in Computer Science and Information Technology. Dheeraj is a qualified professional with diverse interests and experiences. Prior to his graduate studies at Warwick University, Dheeraj spent four years studying computer science engineering at MVGR College of Engineering. Continuing his education, he graduated with a master in Cyber-security and management degree from University of Warwick, UK. While at Warwick, Dheeraj spent his days building a prototype tool for HP cloud compliance automation tool semantic analysis phase at Hewlett Packard Cloud and Security labs. Further, Dheeraj worked at Purdue University as a Research Scholar researching in the domain of cyber security and digital forensics. He is an active member of IEEE.
© Copyright 2025