EMC Isilon Hadoop Starter Kit for Pivotal HD Release 3.0 EMC November 04, 2014 Contents 1 EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Installation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Prepare Network Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Prepare Isilon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Prepare NFS Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Prepare VMware Big Data Extensions . . . . . . . . . . . . . . . . . . . . . . . 1.9 Prepare Hadoop Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . 1.10 Install Pivotal Command Center . . . . . . . . . . . . . . . . . . . . . . . . . . 1.11 Deploy a Pivotal HD Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.12 Initialize HAWQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.13 Adding a Hadoop User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.14 Functional Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.15 Searching Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.16 Where To Go From Here . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.17 Known Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.18 Legal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.19 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 3 5 6 8 13 23 24 25 29 31 32 32 33 37 40 40 40 41 i ii CHAPTER 1 EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions This document describes how to create a Hadoop environment utilizing Pivotal HD and an EMC Isilon Scale-Out NAS for HDFS accessible shared storage. VMware Big Data Extensions is used to provision and manage the compute resources. #RememberRuddy Note: The latest version of this document is available at http://hsk-phd.readthedocs.org/. 1.1 Introduction IDC published an update to their Digital Universe study in December 2012 and found that the rate of digital data creation is not only continuing to grow, but the rate is actually accelerating. By the end of this decade we will create 40 Zettabytes of new digital information yearly or the equivalent of 1.7MB of digital information for every man, woman, and child every second of every day. This information explosion is creating new opportunities for our businesses to leverage digital information to serve their customers better, faster, and most cost effectively through Big Data Analytics applications. Hadoop technologies can be cost effective solutions and can manage structured, semi-structured and unstructured data unlike traditional solutions such as RDBMS. The need to track and analyze consumer behavior, maintain inventory and space, target marketing offers on the basis of consumer preferences and attract and retain consumers, are some of the factors pushing the demand for Big Data Analytics solutions using Hadoop technologies. According to a new market report published by Transparency Market Research (http://www.transparencymarketresearch.com) “Hadoop Market - Global Industry Analysis, Size, Share, Growth, Trends, and Forecast, 2012- 2018,” the global Hadoop market was worth USD 1.5 billion in 2012 and is expected to reach USD 20.9 billion in 2018, growing at a CAGR of 54.7% from 2012 to 2018. Hadoop like any new technology can be time consuming, and expensive for our customers to get deployed and operational. When we surveyed a number of our customers, two main challenges were identified to getting started: confusion over which Hadoop distribution to use and how to deploy using existing IT assets and knowledge. Hadoop software is distributed by several vendors including Pivotal, Hortonworks, and Cloudera with proprietary extensions. In addition to these distributions, Apache distributes a free open source version. From an infrastructure perspective many Hadoop deployments start outside the IT data center and do not leverage the existing IT automation, storage efficiency, and protection capabilities. Many customers cited the time it took IT to deploy Hadoop as the primary reason to start with a deployment outside of IT. This guide is intended to simplify Hadoop deployments by creating a shared storage model with EMC Isilon scale out NAS and Pivotal HD, reduce the time to deployment, and the cost of deployment leveraging tools that can automate Hadoop cluster deployments. 1 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 1.1.1 Audience This document is intended for IT program managers, IT architects, Developers, and IT management to easily deploy Hadoop with automation tools and leverage EMC Isilon for HDFS shared storage. It can be used by somebody who does not yet have an EMC Isilon cluster by downloading the free EMC Isilon OneFS Simulator which can be installed as a virtual machine for testing and training. However, this document can also be used by somebody who will be installing in a production environment as best-practices are followed whenever possible. 1.1.2 Apache Hadoop Projects Apache Hadoop is an open source, batch data processing system for enormous amounts of data. Hadoop runs as a platform that provides cost-effective, scalable infrastructure for building Big Data analytic applications. All Hadoop clusters contain a distributed file system called the Hadoop Distributed File System (HDFS), a computation layer called MapReduce. The Apache Hadoop project contains the following subprojects: • Hadoop Distributed File System (HDFS) – A distributed file system that provides high-throughput access to application data. • Hadoop MapReduce – A software framework for writing applications to reliably process large amounts of data in parallel across a cluster. Hadoop is supplemented by an ecosystem of Apache projects, such as Pig, Hive, Sqoop, Flume, Oozie, Whirr, HBase, and Zookeeper that extend the value of Hadoop and improves its usability. Version 2 of Apache Hadoop introduces YARN, a sub-project of Hadoop that separates the resource management and processing components. YARN was born of a need to enable a broader array of interaction patterns for data stored in HDFS beyond MapReduce. The YARN-based architecture of Hadoop 2.0 provides a more general processing platform that is not constrained to MapReduce. For full details of the Apache Hadoop project see http://hadoop.apache.org/. 1.1.3 Pivotal HD with HAWQ Open-source Hadoop frameworks have rapidly emerged as preferred technology to grapple with the explosion of structured and unstructured big data. You can accelerate your Hadoop investment with an enterprise-ready, fully supported Hadoop distribution that ensures you can harness—and quickly gain insight from—the massive data being driven by new apps, systems, machines, and the torrent of customer sources. HAWQ™ adds SQL’s expressive power to Hadoop to accelerate data analytics projects, simplify development while increasing productivity, expand Hadoop’s capabilities and cut costs. HAWQ can help your organization render Hadoop queries faster than any Hadoop-based query interface on the market by adding rich, proven, parallel SQL processing facilities. HAWQ leverages your existing business intelligence products, analytic products, and your workforce’s SQL skills to bring more than 100X performance improvement (in some cases up to 600X) to a wide range of query types and workloads. The world’s fastest SQL query engine on Hadoop, HAWQ is 100 percent SQL compliant. The first true SQL processing for enterprise-ready Hadoop, Pivotal HD with HAWQ is available to you as software only or as an appliance-based solution. More information on Pivotal HD and HAWQ can be found on http://www.pivotal.io/big-data/pivotal-hd. 1.1.4 Isilon Scale-Out NAS For HDFS EMC Isilon is the only scale-out NAS platform natively integrated with the Hadoop Distributed File System (HDFS). Using HDFS as an over-the-wire protocol, you can deploy a powerful, efficient, and flexible data storage and analytics 2 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 ecosystem. In addition to native integration with HDFS, EMC Isilon storage easily scales to support massively large Hadoop analytics projects. Isilon scale-out NAS also offers unmatched simplicity, efficiency, flexibility, and reliability that you need to maximize the value of your Hadoop data storage and analytics workflow investment. 1.1.5 Overview of Isilon Scale-Out NAS for Big Data The EMC Isilon scale-out platform combines modular hardware with unified software to provide the storage foundation for data analysis. Isilon scale-out NAS is a fully distributed system that consists of nodes of modular hardware arranged in a cluster. The distributed Isilon OneFS operating system combines the memory, I/O, CPUs, and disks of the nodes into a cohesive storage unit to present a global namespace as a single file system. The nodes work together as peers in a shared-nothing hardware architecture with no single point of failure. Every node adds capacity, performance, and resiliency to the cluster, and each node acts as a Hadoop namenode and datanode. The namenode daemon is a distributed process that runs on all the nodes in the cluster. A compute client can connect to any node through HDFS. As nodes are added, the file system expands dynamically and redistributes data, eliminating the work of partitioning disks and creating volumes. The result is a highly efficient and resilient storage architecture that brings all the advantages of an enterprise scale-out NAS system to storing data for analysis. Unlike traditional storage, Hadoop’s ratio of CPU, RAM, and disk space depends on the workload—factors that make it difficult to size a Hadoop cluster before you have had a chance to measure your MapReduce workload. Expanding data sets also makes sizing decisions upfront problematic. Isilon scale-out NAS lends itself perfectly to this scenario: Isilon scale-out NAS lets you increase CPUs, RAM, and disk space by adding nodes to dynamically match storage capacity and performance with the demands of a dynamic Hadoop workload. An Isilon cluster optimizes data protection. OneFS more efficiently and reliably protects data than HDFS. The HDFS protocol, by default, replicates a block of data three times. In contrast, OneFS stripes the data across the cluster and protects the data with forward error correction codes, which consume less space than replication with better protection. 1.2 Environment 1.2.1 Versions The test environment used for this document consists of the following software versions: • Pivotal HD 2.1.0 • Pivotal Command Center 2.3.0 • Pivotal HAWQ 1.2.1.0 • Isilon OneFS 7.2.0 • VMware vSphere Enterprise 5.5.0 • VMware Big Data Extensions 2.0 1.2.2 Hosts A typical Hadoop environment composed of many types of hosts. Below is a description of these hosts. 1.2. Environment 3 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 List of Hosts in Hadoop Environments 4 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 Host (Host Name) DNS/DHCP Server VMware vCenter VMware Big Data Extensions vApp Server (bde) VMware ESXi Linux workstation (workstation) Isilon nodes (isiloncluster1-1, 2, 3, ...) Isilon InsightIQ Pivotal Command Center (hadoopmanager-server-0) Hadoop Master (mycluster1-master-0) Hadoop Worker Node (mycluster1-worker-0, 1, 2, ...) Description It is recommended to have a DNS and DHCP server that you can control. This will be used by all other hosts for DNS and DHCP services. VMware vCenter will manage all ESXi hosts and their VMs. It is required for Big Data Extensions (BDE). This will run the Serengeti components to assist in deploying Hadoop clusters. Physical machines will run the VMware ESXi operating system to allow virtual machines to run in them. You should have a Linux workstation with a GUI that you can use to control everything with. You will need one or more Isilon Scale-Out NAS nodes. For functional testing, you can use a single Isilon OneFS Simulator node instead of the Isilon appliance. The nodes will be clustered together into an Isilon Cluster. Isilon nodes run the OneFS operating system. This is optional licensed software from Isilon and can be used to monitor the health and performance of your Isilon cluster. It is recommend for any performance testing. This host will manage your Hadoop cluster. This will run Pivotal Command Center which will deploy and monitor the Hadoop components. In smaller environments, you may choose to deploy Pivotal Command Center to your Hadoop Master host. This host will the YARN Resource Manager, Job History Server, Hive Metastore Server, etc.. In general, it will run all “master” services except for the HDFS Name Node. There will be any number of worker nodes, depending on the compute requirements. Each node will run the YARN Node Manager, HBase Region Server, etc.. During the installation process, these nodes will run the HDFS Data Node but these will become idle like the HDFS Name Node. Due to the many hosts involved, all commands that must be typed are prefixed with the standard system prompt identifying the user and host name. For example: [user@workstation ~]$ ping myhost.lab.example.com 1.3 Installation Overview Below is the overview of the installation process that this document will describe. 1. Confirm prerequisites. 2. Prepare your network infrastructure including DNS and DHCP. 3. Prepare your Isilon cluster. 4. Prepare VMware Big Data Extensions (BDE). 5. Use BDE to provision multiple virtual machines. 6. Install Pivotal Command Center. 7. Use Pivotal Command Center to deploy Pivotal HD to the virtual machines. 8. Initialize HAWQ. 9. Perform key functional tests. 1.3. Installation Overview 5 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 1.4 Prerequisites 1.4.1 Isilon • For low-capacity, non-performance testing of Isilon, the EMC Isilon OneFS Simulator can be used instead of a cluster of physical Isilon appliances. This can be downloaded for free from http://www.emc.com/getisilon. Refer to the EMC Isilon OneFS Simulator Install Guide for details. Be sure to follow the section for running the virtual nodes in VMware ESX. Only a single virtual node is required but adding additional nodes will allow you to explore other features such as data protection, SmartPools (tiering), and SmartConnect (network load balancing). • For physical Isilon nodes, you should have already completed the console-based installation process for your first Isilon node and added at least two other nodes for a minimum of 3 Isilon nodes. • You must have OneFS version 7.1.1.0 with patch-130611 or version 7.2.0.0 and higher. • You must obtain a OneFS HDFS license code and install it on your Isilon cluster. You can get your free OneFS HDFS license from http://www.emc.com/campaign/isilon-hadoop/index.htm. • It is recommended, but not required, to have a SmartConnect Advanced license for your Isilon cluster. • To allow for scripts and other small files to be easily shared between all nodes in your environment, it is highly recommended to enable NFS (Unix Sharing) on your Isilon cluster. By default, the entire /ifs directory is already exported and this can remain unchanged. This document assumes that a single Isilon cluster is used for this NFS export as well as for HDFS. However, there is no requirement that the NFS export be on the same Isilon cluster that you are using for HDFS. 1.4.2 VMware • VMware vSphere Enterprise or Enterprise Plus • VMware vSphere vCenter should be installed and functioning. • All hosts referenced in this document can be virtualized with VMware ESXi. • At a minimum for non-performance testing, you will need a single VMware ESXi host with 72 GB of RAM. • VMDK files we be used for the operating system and applications for each host. Additionally, VMDKs will be used for temporary files needed to process Hadoop jobs (e.g. MapReduce spill and shuffle files). VMDKs can be located on physical disks attached to the ESXi host or on a shared NFS datastore hosted on an Isilon cluster. If using an Isilon cluster, be aware that the VMDK I/O will compete with the HDFS I/O, reducing the overall performance. • All ESXi hosts must be managed by vCenter. • As a requirement for the VMware Big Data Extensions vApp, you must define a vSphere cluster. 1.4.3 Networking • For the best performance, a single 10 Gigabit Ethernet switch should connect to at least one 10 Gigabit port on each ESXi host running a worker VM. Additionally, the same switch should connect to at least one 10 Gigabit port on each Isilon node. • A single dedicated layer-2 network can be used to connect all ESXi hosts and Isilon nodes. Although multiple networks can be used for increased security, monitoring, and robustness, it adds complications that should be avoided when possible. 6 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 • At least an entire /24 IP address block should be allocated to your lab network. This will allow a DNS reverse lookup zone to be delegated to your own lab DNS server. • If not using DHCP for dynamic IP address allocation, a static set of IP addresses must be assigned to Big Data Extensions so that it may allocate them to new virtual machines that it provisions. You will usually need just one IP address for each virtual machine. • If using the EMC Isilon OneFS Simulator, you will need at least two static IP addresses (one for the node’s ext-1 interface, another for the SmartConnect service IP). Each additional Isilon node will require an additional IP address. • At a minimum, you will need to allocate to your Isilon cluster one IP address per Access Zone per Isilon node. In general, you will need one Access Zone for each separate Hadoop cluster that will use Isilon for HDFS storage. For the best possible load balancing during an Isilon node failure scenario, the recommended number of IP addresses is give by the formula below. Of course, this is in addition to any IP addresses used for non-HDFS pools. # of IP addresses = 2 * (# of Isilon Nodes) * (# of Access Zones) For example, 20 IP addresses are recommended for 5 Isilon nodes and 2 Access Zones. • This document will assume that Internet access is available to all servers to download various components from Internet repositories. 1.4.4 DHCP and DNS • A DHCP server on this subnet is highly recommended. Although a DHCP server is not a requirement for a Hadoop environment, this document will assume that one exists. • A DNS server is required and you must have the ability to create DNS records and zone delegations. • Although not required, it is assumed in this document that a Windows Server 2008 R2 server is running on this subnet providing DHCP and DNS services. To distinguish this DNS server from a corporate or public DNS server outside of your lab, this will be referred to as your lab DNS server. • It is recommended, and this document will assume, that the DHCP server creates forward and reverse DNS records when it provides IP address leases to hosts. The procedure for doing this is documented later in this document. • It is recommend that a forward lookup DNS zone be delegated to your lab DNS server. For instance, if your company’s domain is example.com, then DNS requests for lab.example.com, and all subdomains of lab.example.com, should be delegated to your lab DNS server. • It is recommended that your DNS server delegate a subdomain to your Isilon cluster. For instance, DNS requests for subnet0-pool0.isiloncluster1.lab.example.com or isiloncluster1.lab.example.com should be delegated to the Service IP defined on your Isilon cluster. • To allow for a convenient way of changing the HDFS Name Node used by all Hadoop applications and services, it is recommended to create a DNS record for your Isilon cluster’s HDFS Name Node service. This should be a CNAME alias to your Isilon SmartConnect zone. Specify a TTL of 1 minute to allow for quick changes while you are experimenting. For example, create a CNAME record for mycluster1-hdfs.lab.example.com that targets subnet0-pool0.isiloncluster1.lab.example.com. If you later want to redirect all HDFS I/O to another cluster or a different pool on the same Isilon cluster, you simply need to change the DNS record and restart all Hadoop services. 1.4. Prerequisites 7 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 1.4.5 Other • You will need one Linux workstation which you will use to perform most configuration tasks. No services will run on this workstation. – This should have a GUI and a web browser. – This must have Python 2.6.6 or higher 2.x version. – CentOS 6.5 has been used for testing but most other systems should also work. However, be aware that Centos 6.4, supplied with BDE 2.0, must be upgraded to support Python 2.6.6. – sshpass must be installed. This document shows how to install sshpass on Centos 6.x. • Several useful scripts and file templates are provided in the archive file isilon-hadoop-tools-x.x.tar.gz. Download the latest version from https://github.com/claudiofahey/isilon-hadoop-tools/releases. • Time synchronization is critical for Hadoop. It is highly recommended to configure all ESXi hosts and Isilon nodes to use NTP. In general, you do not need to run NTP clients in your VMs. 1.5 Prepare Network Infrastructure 1.5.1 Configure DHCP and DNS on Windows Server 2008 R2 DHCP and DNS servers are often configured incorrectly in a lab environment. This section will show you some key steps to configure a Windows Server 2008 R2 server for DHCP and DNS. Other operating systems can certainly be used as long as they operate the same way from the client perspective. 1. Open Server Manager. 2. Forward Lookup Zone (a) Create a forward lookup zone for your lab domain (lab.example.com). 8 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 (b) To avoid possible DNS dynamic update issues, select the option to allow unsecure dynamic updates. This may not be required in your environment and the best settings for a production deployment should be carefully considered. 1.5. Prepare Network Infrastructure 9 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 3. Create a reverse lookup zone for your lab subnet (128.111.10.in-addr.arpa). If you have multiple subnets, this will be the subnet used by your worker VMs. 4. Configure your DHCP server for dynamic DNS. (a) Right-click on IPv4 (see the screen shot below), then click Properties. (b) Configure the DNS settings as shown below. These settings will allow the DHCP server to automatically create DNS A and PTR records for DHCP clients. 10 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 (c) Specify the credential used by the DHCP service to communicate with the DNS server. 1.5. Prepare Network Infrastructure 11 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 5. Create a new DHCP scope for your lab subnet. (a) Set the following scope options for your environment. 12 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 6. Test for a proper installation. (a) A CentOS or Redhat Linux machine configured for DHCP should specify the DHCP_HOSTNAME parameter. In particular, the file /etc/sysconfig/network should contain “DHCP_HOSTNAME=myhost” and it should not specify “HOSTNAME=myhost”. It is this setting that allows the DHCP server to automatically create DNS records for the host. After editing this file, restart the network with “service network restart”. (b) A Windows machine only needs to be configured for DHCP on an interface. (c) Confirm that the forward DNS record myhost.lab.example.com is dynamically created on your DNS server. You should be able to ping the FQDN from all hosts in your environment. [user@workstation ~]$ ping myhost.lab.example.com PING myhost.lab.example.com (10.111.128.240) 56(84) bytes of data. 64 bytes from myhost.lab.example.com (10.111.128.240): icmp_seq=1 ttl=64 time=1.47 ms (d) Confirm that the reverse DNS record for that IP address can be properly resolved to myhost.lab.example.com. [user@workstation ~]$ dig +noall +answer -x 10.111.128.240 240.128.111.10.in-addr.arpa. 900 IN PTR myhost.lab.example.com. 1.6 Prepare Isilon 1.6.1 Assumptions This document makes the assumptions listed below. These are not necessarily requirements but they are usually valid and simplify the process. • It is assumed that you are not using a directory service such as Active Directory for Hadoop users and groups. • It is assumed that you are not using Kerberos authentication for Hadoop. 1.6.2 SmartConnect For HDFS A best practices for HDFS on Isilon is to utilize two SmartConnect IP address pools for each access zone. One IP address pool should be used by Hadoop clients to connect to the HDFS name node service on Isilon and it should use the dynamic IP allocation method to minimize connection interruptions in the event that an Isilon node fails. Note that dyanamic IP allocation requires a SmartConnect Advanced license. A Hadoop client uses a specific SmartConnect IP address pool simply by using its zone name (DNS name) in the HDFS URI (e.g. hdfs://subnet0pool1.isiloncluster1.lab.example.com:8020). 1.6. Prepare Isilon 13 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 A second IP address pool should be used for HDFS data node connections and it should use the static IP allocation method to ensure that data node connections are balanced evenly among all Isilon nodes. To assign specific SmartConnect IP address pools for data node connections, you will use the “isi hdfs racks modify” command. If IP addresses are limited and you have a SmartConnect Advanced license, you may choose to use a single dynamic pool for name node and data node connections. This may result in uneven utilization of Isilon nodes. If you do not have a SmartConnect Advanced license, you may choose to use a single static pool for name node and data node connections. This may result in some failed HDFS connections immediately after Isilon node failures. For more information, see EMC Isilon Best Practices for Hadoop Data Storage (http://www.emc.com/collateral/whitepaper/h12877-wp-emc-isilon-hadoop-best-practices.pdf). 1.6.3 OneFS Access Zones Access zones on OneFS are a way to select a distinct configuration for the OneFS cluster based on the IP address that the client connects to. For HDFS, this configuration includes authentication methods, HDFS root path, and authentication providers (AD, LDAP, local, etc.). By default, OneFS includes a single access zone called System. If you will only have a single Hadoop cluster connecting to your Isilon cluster, then you can use the System access zone with no additional configuration. However, to have more than one Hadoop cluster connect to your Isilon cluster, it is best to have each Hadoop cluster connect to a separate OneFS access zone. This will allow OneFS to present each Hadoop cluster with its own HDFS namespace and an independent set of users. For more information, see Security and Compliance for Scale-out Hadoop Data Lakes (http://www.emc.com/collateral/white-paper/h13354-wp-security-compliance-scale-out-hadoop-data-lakes.pdf). To view your current list of access zones and the IP pools associated with them: isiloncluster1-1# isi zone zones list Name Path -----------System /ifs -----------Total: 1 isiloncluster1-1# isi networks list pools -v subnet0:pool0 In Subnet: subnet0 Allocation: Static Ranges: 1 10.111.129.115-10.111.129.126 Pool Membership: 4 1:10gige-1 (up) 2:10gige-1 (up) 3:10gige-1 (up) 4:10gige-1 (up) Aggregation Mode: Link Aggregation Control Protocol (LACP) Access Zone: System (1) SmartConnect: Suspended Nodes : None Auto Unsuspend ... 0 Zone : subnet0-pool0.isiloncluster1.lab.example.com Time to Live : 0 Service Subnet : subnet0 Connection Policy: Round Robin Failover Policy : Round Robin 14 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 Rebalance Policy : Automatic Failback To create a new access zone and an associated IP address pool: isiloncluster1-1# mkdir -p /ifs/isiloncluster1/zone1 isiloncluster1-1# isi zone zones create --name zone1 \ --path /ifs/isiloncluster1/zone1 isiloncluster1-1# isi networks create pool --name subnet0:pool1 \ --ranges 10.111.129.127-10.111.129.138 --ifaces 1-4:10gige-1 \ --access-zone zone1 --zone subnet0-pool1.isiloncluster1.lab.example.com \ --sc-subnet subnet0 --dynamic Creating pool ‘subnet0:pool1’: OK Saving: OK Note: If you do not have a SmartConnect Advanced license, you will need to omit the –dynamic option. To allow the new IP address pool to be used by data node connections: isiloncluster1-1# isi hdfs racks create /rack0 --client-ip-ranges \ 0.0.0.0-255.255.255.255 isiloncluster1-1# isi hdfs racks modify /rack0 --add-ip-pools subnet0:pool1 isiloncluster1-1# isi hdfs racks list Name Client IP Ranges IP Pools -------------------------------------------/rack0 0.0.0.0-255.255.255.255 subnet0:pool1 -------------------------------------------Total: 1 1.6.4 Sharing Data Between Access Zones Access zones in OneFS provide a measure of multi-tenancy in that data within one access zone cannot be accessed by another access zone. In certain use cases, however, you may actually want to make the same dataset available to more than one Hadoop cluster. This can be done by using fully-qualified paths to refer to data in other access zones. To use this approach, you will configure your Hadoop jobs to simply access the datasets from a common shared HDFS namespace. For instance, you would start with two independent Hadoop clusters, each with its own access zone on Isilon. Then you can add a 3rd access zone on Isilon, with its own IP addresses and HDFS root, and containing a dataset that is shared with other Hadoop clusters. 1.6.5 User and Group IDs Isilon clusters and Hadoop servers each have their own mapping of user IDs (uid) to user names and group IDs (gid) to group names. When Isilon is used only for HDFS storage by the Hadoop servers, the IDs do not need to match. This is due to the fact that the HDFS wire protocol only refers to users and groups by their names, and never their numeric IDs. In contrast, the NFS wire protocol refers to users and groups by their numeric IDs. Although NFS is rarely used in traditional Hadoop environments, the high-performance, enterprise-class, and POSIX-compatible NFS functionality of Isilon makes NFS a compelling protocol for certain workflows. If you expect to use both NFS and HDFS on your Isilon cluster (or simply want to be open to the possibility in the future), it is highly recommended to maintain 1.6. Prepare Isilon 15 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 consistent names and numeric IDs for all users and groups on Isilon and your Hadoop servers. In a multi-tenant environment with multiple Hadoop clusters, numeric IDs for users in different clusters should be distinct. For instance, the user sqoop in Hadoop cluster A will have ID 610 and this same ID will be used in the Isilon access zone for Hadoop cluster A as well as every server in Hadoop cluster A. The user sqoop in Hadoop cluster B will have ID 710 and this ID will be used in the Isilon access zone for Hadoop cluster B as well as every server in Hadoop cluster B. 1.6.6 Configure Isilon For HDFS Note: In the steps below, replace zone1 with System to use the default System access zone or you may specify the name of a new access zone that you previously created. 1. Open a web browser to the your Isilon cluster’s web administration page. If you don’t know the URL, simply point your browser to https://isilon_node_ip_address:8080, where isilon_node_ip_address is any IP address on any Isilon node that is in the System access zone. This usually corresponds to the ext-1 interface of any Isilon node. 2. Login with your root account. You specified the root password when you configured your first node using the console. 3. Check, and edit as necessary, your NTP settings. Click Cluster Management -> General Settings -> NTP. 16 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 1. SSH into any node in your Isilon cluster as root. 2. Confirm that your Isilon cluster is at OneFS version 7.1.1.0 or higher. isiloncluster1-1# isi version Isilon OneFS v7.1.1.0 ... 3. For OneFS version 7.1.1.0, you must have patch-130611 installed. You can view the list of patches you have installed with: isiloncluster1-1# isi pkg info patch-130611: This patch allows clients to use version 2.4 of the Hadoop Distributed File System (HDFS) with an Isilon cluster. 4. Install the patch if needed: [user@workstation ~]$ scp patch-130611.tgz root@mycluster1-hdfs:/tmp isiloncluster1-1# gunzip < /tmp/patch-130611.tgz | tar -xvf isiloncluster1-1# isi pkg install patch-130611.tar Preparing to install the package... 1.6. Prepare Isilon 17 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 Checking the package for installation... Installing the package Committing the installation... Package successfully installed. 5. Verify your HDFS license. isiloncluster1-1# isi license Module License Status ------------------HDFS Evaluation Configuration ------------Not Configured Expiration Date --------------September 4, 2014 6. Create the HDFS root directory. This is usually called hadoop and must be within the access zone directory. isiloncluster1-1# mkdir -p /ifs/isiloncluster1/zone1/hadoop 7. Set the HDFS root directory for the access zone. isiloncluster1-1# isi zone zones modify zone1 \ --hdfs-root-directory /ifs/isiloncluster1/zone1/hadoop 8. Increase the HDFS daemon thread count. isiloncluster1-1# isi hdfs settings modify --server-threads 256 9. Set the HDFS block size used for reading from Isilon. isiloncluster1-1# isi hdfs settings modify --default-block-size 128M 10. Create an indicator file so that we can easily determine when we are looking your Isilon cluster via HDFS. isiloncluster1-1# touch \ /ifs/isiloncluster1/zone1/hadoop/THIS_IS_ISILON_isiloncluster1_zone1 11. Extract the Isilon Hadoop Tools to your Isilon cluster. This can be placed in any directory under /ifs. It is recommended to use /ifs/isiloncluster1/scripts where isiloncluster1 is the name of your Isilon cluster. [user@workstation ~]$ scp isilon-hadoop-tools-x.x.tar.gz \ root@isilon_node_ip_address:/ifs/isiloncluster1/scripts isiloncluster1-1# tar -xzvf \ /ifs/isiloncluster1/isilon-hadoop-tools-x.x.tar.gz \ -C /ifs/isiloncluster1/scripts isiloncluster1-1# mv /ifs/isiloncluster1/scripts/isilon-hadoop-tools-x.x \ /ifs/isiloncluster1/scripts/isilon-hadoop-tools 12. Execute the script isilon_create_users.sh. This script will create all required users and groups for the Hadoop services and applications. Warning: The script isilon_create_users.sh will create local user and group accounts on your Isilon cluster for Hadoop services. If you are using a directory service such as Active Directory, and you want these users and groups to be defined in your directory service, then DO NOT run this script. Instead, refer to the OneFS documentation and EMC Isilon Best Practices for Hadoop Data Storage (http://www.emc.com/collateral/white-paper/h12877-wp-emc-isilon-hadoop-best-practices.pdf). Script Usage: isilon_create_users.sh –dist <DIST> [–startgid <GID>] [–startuid <UID>] [–zone <ZONE>] dist This will correspond to your Hadoop distribution - phd startgid Group IDs will begin with this value. For example: 501 startuid User IDs will begin with this value. This is generally the same as gid_base. For example: 501 18 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 zone Access Zone name. For example: System isiloncluster1-1# bash \ /ifs/isiloncluster1/scripts/isilon-hadoop-tools/onefs/isilon_create_users.sh \ --dist phd --startgid 501 --startuid 501 --zone zone1 13. Execute the script isilon_create_directories.sh. This script will create all required directories with the appropriate ownership and permissions. Script Usage: isilon_create_directories.sh –dist <DIST> [–fixperm] [–zone <ZONE>] dist This will correspond to your Hadoop distribution - phd fixperm If specified, ownership and permissions will be set on existing directories. zone Access Zone name. For example: System isiloncluster1-1# bash \ /ifs/isiloncluster1/scripts/isilon-hadoop-tools/onefs/isilon_create_directories.sh \ --dist phd --fixperm --zone zone1 14. Map the hdfs user to the Isilon superuser. This will allow the hdfs user to chown (change ownership of) all files. Warning: The command below will restart the HDFS service on Isilon to ensure that any cached user mapping rules are flushed. This will temporarily interrupt any HDFS connections coming from other Hadoop clusters. isiloncluster1-1# isi zone zones modify --user-mapping-rules=’’hdfs=>root’’ \ --zone zone1 isiloncluster1-1# isi services isi_hdfs_d disable ; \ isi services isi_hdfs_d enable The service ‘isi_hdfs_d’ has been disabled. The service ‘isi_hdfs_d’ has been enabled. 1.6.7 Create DNS Records For Isilon You will now create the required DNS records that will be used to access your Isilon cluster. 1. Create a delegation record so that DNS requests for the zone isiloncluster1.lab.example.com are delegated to the Service IP that will be defined on your Isilon cluster. The Service IP can be any unused static IP address in your lab subnet. (a) If using a Windows Server 2008 R2 server, see the screenshot below to create the delegation. 1.6. Prepare Isilon 19 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 (b) Enter the DNS domain that will be delegated. This should identify your Isilon cluster. Click Next. (c) Enter the DNS domain again. Then enter the Isilon SmartConnect Zone Service IP address. At this time, you can ignore any warning messages about not being able to validate the server. Click OK. 20 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 (d) Click Next and then Finish. 1.6. Prepare Isilon 21 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 2. If you are using your own lab DNS server, you should create a CNAME alias for your Isilon SmartConnect zone. For example, create a CNAME record for mycluster1-hdfs.lab.example.com that targets subnet0pool0.isiloncluster1.lab.example.com. 22 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 3. Test name resolution. [user@workstation ~]$ ping mycluster1-hdfs.lab.example.com PING subnet0-pool0.isiloncluster1.lab.example.com (10.111.129.115) 56(84) bytes of data. 64 bytes from 10.111.129.115: icmp_seq=1 ttl=64 time=1.15 ms 1.7 Prepare NFS Clients To allow for scripts and other small files to be easily shared between all servers in your environment, it is highly recommended to enable NFS (Unix Sharing) on your Isilon cluster. By default, the entire /ifs directory is already exported and this can remain unchanged. However, Isilon best-practices suggest creating an NFS mount for /ifs/isiloncluster1/scripts. 1. On Isilon, create an NFS export for /ifs/isiloncluster1/scripts. Enable read/write and root access from all hosts 1.7. Prepare NFS Clients 23 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 in your lab subnet. 2. Mount your NFS export on your workstation and the BDE server. (Note: The BDE post-deployment script can automatically perform these steps for VMs deployed using BDE.). [root@workstation ~]$ yum install nfs-utils [root@workstation ~]$ mkdir /mnt/scripts [root@workstation ~]$ echo \ subnet0-pool0.isiloncluster1.lab.example.com:/ifs/isiloncluster1/scripts \ /mnt/scripts nfs \ nolock,nfsvers=3,tcp,rw,hard,intr,timeo=600,retrans=2,rsize=131072,wsize=524288 \ >> /etc/fstab [root@workstation ~]$ mount -a 3. If you want to locate any VMDKs on an Isilon cluster instead of on local disks, add an NFS datastore to each of your ESXi hosts. 1.8 Prepare VMware Big Data Extensions 1.8.1 Installing Big Data Extensions Refer to VMware Big Data Extensions documentation (http://pubs.vmware.com/bde-2/index.jsp) for complete details for installing it. For a typical deployment, only the steps in the following topics need to be performed: • Installing Big Data Extensions – Deploy the Big Data Extensions vApp in the vSphere Web Client – Install the Big Data Extensions Plug-In – Connect to a Serengeti Management Server • Managing vSphere Resources for Hadoop and HBase Clusters – Add a Datastore in the vSphere Web Client – Add a Network in the vSphere Web Client 1.8.2 How to Login to the Serengeti CLI The Serengeti CLI is installed on your Big Data Extensions vApp server. You will first change directory to the path that contains your BDE spec files which define the VMs to create. [root@bde ~]# cd /mnt/scripts/isilon-hadoop-tools/etc [root@bde etc]# java -jar /opt/serengeti/cli/serengeti-cli-*.jar serengeti>connect --host localhost:8443 Enter the username: root Enter the password: *** Connected 1.8.3 View Resources Use the commands below to view your BDE resources. Their names will be needed later. 24 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 serengeti>network list serengeti>datastore list serengeti>resourcepool list 1.9 Prepare Hadoop Virtual Machines First you will create one virtual machine to run Pivotal Command Center. It can be provisioned using any tool you like. This document will show how to use BDE to deploy a “cluster” consisting of a single server that will run Pivotal Command Center. Then you will create your actual Hadoop nodes (e.g. Resource Manager, Node Managers) using BDE. 1.9.1 Provision Virtual Machines Using Big Data Extensions 1. Copy the file isilon-hadoop-tools/etc/template-bde.json to isilon-hadoop-tools/etc/mycluster1-bde.json. 2. Edit the file mycluster1-bde.json as desired. Table 1.1: Description of mycluster1-bde.json Field Name Description nodeThe number of hosts that will be created in this node group. This should be 1 for all Groups.instanceNum node groups except the worker group. nodeSize of additional disks for each host. Note that the OS disk is always 20 GB and the Groups.storage.sizeGB swapdisk will depend on the memory. nodeIf “shared”, virtual disks will be located on an NFS export. If “local”, virtual disks Groups.storage.type will be located on disks directly attached to the ESXihosts. nodeNumber of CPU cores assigned to the VM. This can be changed after the cluster has Groups.cpuNum been deployed. nodeAmount of RAM assigned to the VM. This can be changed after the cluster has been Groups.memCapacityMB deployed. nodeResource pool name. For available names, type “resourcepool list” in the Serengeti Groups.rpNames CLI. For more details, refer to the BDE documentation. 3. Login to the Serengeti CLI. 4. If you want to use a dedicated Pivotal Command Center server, create the Pivotal Command Center cluster. serengeti>cluster create --name hadoopmanager \ --specFile basic-server.json --password 1 --networkName defaultNetwork Hint: Password are from 8 to 128 characters, and can include alphanumeric characters ([0-9, a-z, A-Z]) and the following special characters: _, @, #, $, %, ^, &, * Enter the password: *** Confirm the password: *** COMPLETED 100% node group: server, instance number: 1 roles:[basic] NAME IP STATUS TASK -----------------------------------------------------------hadoopmanager-server-0 10.111.128.240 Service Ready cluster hadoopmanager created 1.9. Prepare Hadoop Virtual Machines 25 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 Note: If any errors occur during cluster creation, you can try to resume the cluster creation with the –resume parameter. If that does not work, you may need to delete and recreate the cluster. For example: serengeti>cluster create --name mycluster1 --resume serengeti>cluster delete --name mycluster1 5. Create the Hadoop cluster. serengeti>cluster create --name mycluster1 --specFile mycluster1-bde.json \ --password 1 --networkName defaultNetwork Hint: Password are from 8 to 128 characters, and can include alphanumeric characters ([0-9, a-z, A-Z]) and the following special characters: _, @, #, $, %, ^, &, * Enter the password: *** Confirm the password: *** COMPLETED 100% node group: namenode, instance number: 1 roles:[basic] NAME IP STATUS TASK ---------------------------------------------------------mycluster1-namenode-0 10.111.128.241 Service Ready node group: worker, instance number: 3 roles:[basic] NAME IP STATUS TASK -------------------------------------------------------mycluster1-worker-0 10.111.128.244 Service Ready mycluster1-worker-1 10.111.128.242 Service Ready mycluster1-worker-2 10.111.128.245 Service Ready node group: master, instance number: 1 roles:[basic] NAME IP STATUS TASK -------------------------------------------------------mycluster1-master-0 10.111.128.243 Service Ready cluster mycluster1 created 6. After the cluster has been provisioned, you should check the following: (a) The VMs should be distributed across your hosts evenly. (b) The data disks (not including the boot and swap disk) should be an equal size and distributed across datastores evenly. 1.9.2 BDE Post Deployment Script Once the VMs come up, there are a few customizations that you will want to make. The Python script isilon-hadooptools/bde/bde_cluster_post_deploy.py automates this process. This script performs the following actions: • Retrieves the list of VMs in the BDE cluster. • Writes a file consisting of the FQDNs of each VM. This will be used to script other cluster-wide activities. • Enables password-less SSH to each VM. 26 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 • Updates /etc/sysconfig/network to set DHCP_HOSTNAME instead of HOSTNAME and restarts the network service. With properly configured DHCP and DNS, this will result in correct forward and reverse DNS records with FQDNs such as mycluster1-worker-0.lab.example.com. • Installs several packages using Yum. • Mounts NFS directories. • Mounts each virtual data disk in /data/n. • Overwrites /etc/rc.local and /etc/sysctl.conf with recommended parameters. To run the script, follow these steps: 1. Login to your workstation (shown as user@workstation in the prompts below). 2. Ensure that you are running a Python version 2.6.6 or higher but less than 3.0. [user@workstation ~]$ python --version Python 2.6.6 3. If you do not have sshpass installed, you may install it on Centos 6.x using the following commands: [root@workstation ~]$ wget \ http://dl.fedoraproject.org/pub/epel/6/x86_64/sshpass-1.05-1.el6.x86_64.rpm [root@workstation ~]$ rpm -i sshpass-1.05-1.el6.x86_64.rpm 4. If you have not previously created your SSH key, run the following. [user@workstation ~]$ ssh-keygen -t rsa 5. Copy the file isilon-hadoop-tools/etc/template-phd-post.json to isilon-hadoop-tools/etc/mycluster1-post.json. 6. Edit the file mycluster1-post.json with parameters that apply to your environment. Table 1.2: Description of mycluster1-post.json Field Name ser_host ser_username Description The URL to the BDE web service. For example: https://bde.lab.example.com:8443 The user name used to authenticate to the BDE web service. This is the same account used to login using the Serengeti CLI. ser_password The password for the above account. skip_configure_network If “false” (the default), the script will set the DHCP_HOSTNAME parameter on the host. If “false”, you must also set the dhcp_domain setting in this file. Set to “true” if you are using static IP addresses or DHCP without DNS integration. dhcp_domain This is the DNS suffix that is appended to the host name to create a FQDN. It should begin with a dot. Ignored if skip_configure_network is true. For example: .lab.example.com cluster_name This is the name of the BDE cluster. host_file_name This file will be created and it will contain the FQDN of each host in the BDE cluster. node_password This is the root password of the hosts created by BDE. This was specified when the cluster was created. name_filter_regex If non-empty, specify the name of a single host in your BDE cluster to apply this script to just a single host. tools_root This is the fully-qualified path to the Isilon Hadoop Tools. It must be in an NFS mount. nfs_mounts This is a list of one or more NFS mounts that will be imported on each host in the BDE cluster. nfs_mounts.mount_point NFS mount point. For example: /mnt/isiloncluster1 nfs_mounts.pathNFS path. For example: host.domain.com:/directory ssh_commands This is a list of commands that will be executed on each host. This can be used to run scripts that will create users, adjust mount points, etc.. 1.9. Prepare Hadoop Virtual Machines 27 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 7. Edit the file isilon-hadoop-tools/bde/create_phd_users.sh with the appropriate gid_base and uid_base values. This should match the values entered in isilon-hadoop-tools/onefs/isilon_create_phd_users.sh in a previous step. 8. Run bde_cluster_post_deploy.py: [user@workstation ~]$ cd /mnt/scripts/isilon-hadoop-tools [user@workstation isilon-hadoop-tools]$ bde/bde_cluster_post_deploy.py \ etc/mycluster1-post.json ... Success! 9. Repeat the above steps for your Pivotal Command Center cluster named hadoopmanager. 1.9.3 Resize Root Disk By default, the / (root) partition size for a VM created by BDE is 20 GB. This is sufficient for a Hadoop worker but should be increased for the your Pivotal Command Center (hadoopmanager-server-0) and the master node (mycluster1-master-0). Follow the steps below on each of these nodes. 1. Remove old data disk. (a) Edit /etc/fstab. [root@mycluster1-master-0 ~]# vi /etc/fstab (b) Remove line containing “/dev/sdc1”, save the file, and then unmount it. [root@mycluster1-master-0 ~]# umount /dev/sdc1 [root@mycluster1-master-0 ~]# rmdir /data/1 (c) Shutdown the VM. (d) Use the vSphere Web Client to remove virtual disk 3. 2. Use the vSphere Web Client to increase the size of virtual disk 1 to 250 GB. 3. Power on the VM and SSH into it. 4. Extend the partition. Warning: Perform the steps below very carefully. Failure performing these steps may result in an unusable system or lost data. [root@mycluster1-master-0 ~]# fdisk /dev/sda WARNING: DOS-compatible mode is deprecated. It’s strongly recommended to switch off the mode (command ‘c’) and change display units to sectors (command ‘u’). Command (m for help): d Partition number (1-4): 3 Command (m for help): n Command action e extended p primary partition (1-4) p Partition number (1-4): 3 First cylinder (33-32635, default 33): Using default value 33 Last cylinder, +cylinders or +size{K,M,G} (33-32635, default 32635): Using default value 32635 Command (m for help): w 28 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 The partition table has been altered! Calling ioctl() to re-read partition table. WARNING: Re-reading the partition table failed with error 16: Device or resource busy. The kernel still uses the old table. The new table will be used at the next reboot or after you run partprobe(8) or kpartx(8) Syncing disks. [root@mycluster1-master-0 ~]# reboot 5. After the server reboots, resize the file system. [root@mycluster1-master-0 ~]# resize2fs /dev/sda3 1.9.4 Fill Disk VMware Big Data Extensions creates VMDKs for each of the Hadoop server. In an Isilon environment, these VMDKs are not used for HDFS, of course, but large jobs that spill temporary intermediate data to local disks will utilize the VMDKs. BDE creates VMDKs that are lazy-zeroed. This means that the VMDKs are created very quickly but the drawback is that the first write to each sector of the virtual disk is significantly slower than subsequent writes to the same sector. This means that optimal VMDK performance may not be achieved until after several days of normal usage. To accelerate this, you can run the script fill_disk.py. This script will create a temporary file on each drive on each Hadoop server. The file will grow until the disk runs out of space, then the file will be deleted. To use the script, provide it with file containing a list of Hadoop server FQDNs. [user@workstation etc]$ /mnt/scripts/isilon-hadoop-tools/bde/fill_disk.py \ mycluster1-hosts.txt 1.9.5 Resizing Your Cluster VMware Big Data Extensions can be used to resize an existing cluster. It will allow you to create additional VMs, change the amount RAM for each VM, and change the number of CPUs for each VM. This can be done through the vSphere Web Client or from the Serengeti CLI using the “cluster resize” command. For instance: serengeti>cluster resize --name mycluster1 --nodeGroup worker \ --instanceNum 20 serengeti>cluster resize --name mycluster1 --nodeGroup worker \ --cpuNumPerNode 16 --memCapacityMbPerNode 131072 After creating new VMs, you will want to run the BDE Post Deployment script and the Fill Disk script on the new nodes. Then use Pivotal Command Center to deploy the appropriate Hadoop components. When changing the CPUs and RAM, you will usually want to change the amount allocated for YARN or your other services using your Hadoop cluster manager. 1.10 Install Pivotal Command Center Note: The steps below will walk you through one way of installing Pivotal Command Center and Pivotal HD. This is intended to guide you through your first POC. In particular, Kerberos security is not enabled. For complete details, refer to the Pivotal HD documentation (http://pivotalhd.docs.pivotal.io/doc/2100/index.html). 1. Download the following files from https://network.pivotal.io/products/pivotal-hd. This document assumes they have been downloaded to your Isilon cluster in /mnt/scripts/downloads 1.10. Install Pivotal Command Center 29 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 • PHD 2.1.0 • PHD 2.1.0: Pivotal Command Center 2.3.0 • PHD 2.1.0: Pivotal HAWQ 1.2.1.0 (optional) 2. Extract Pivotal Command Center and install it. [root@hadoopmanager-server-0 ~]# mkdir phd [root@hadoopmanager-server-0 ~]# cd phd [root@hadoopmanager-server-0 phd]# tar --no-same-owner -zxvf \ /mnt/scripts/downloads/PCC-2.3.0-443.x86_64.tgz [root@hadoopmanager-server-0 phd]# cd PCC-2.3.0-443 [root@hadoopmanager-server-0 PCC-2.3.0-443]# ./install You may find the logs in /usr/local/pivotal-cc/pcc_installation.log Performing pre-install checks ... [ OK ] [INFO]: User gpadmin does not exist. It will be created as part of the installation process. [INFO]: By default, the directory ‘/home/gpadmin’ is created as the default home directory for the gpadmin user. This directory must must be consistent across all nodes. Do you wish to use the default home directory? [Yy|Nn]: y [INFO]: Continuing with default /home/gpadmin as home directory for gpadmin ... You have successfully installed Pivotal Command Center 2.3.0. You now need to install a PHD cluster to monitor with Pivotal Command Center. You can view your cluster statuses here: https://hadoopmanager-server-0.lab.example.com:5443/status [root@hadoopmanager-server-0 PCC-2.3.0-443]# service commander status nodeagent is running Jetty is running httpd is running Pivotal Command Center HTTPS is running Pivotal Command Center Background Worker is running 3. Import packages into Pivotal Command Center. [root@hadoopmanager-server-0 ~]# su - gpadmin [gpadmin@hadoopmanager-server-0 ~] -bash-4.1$ tar zxf \ /mnt/scripts/downloads/PHD-2.1.0.0-175.tgz [gpadmin@hadoopmanager-server-0 ~] -bash-4.1$ tar zxf \ /mnt/scripts/downloads/PADS-1.2.1.0-10335.tgz [gpadmin@hadoopmanager-server-0 ~] -bash-4.1$ icm_client import -s \ PHD-2.1.0.0-175/ stack: PHD-2.1.0.0 [INFO] Importing stack [INFO] Finding and copying available stack rpms from PHD-2.1.0.0-175/... [OK] [INFO] Creating local stack repository... [OK] [INFO] Import complete [gpadmin@hadoopmanager-server-0 ~] -bash-4.1$ icm_client import -s \ PADS-1.2.1.0-10335/ stack: PADS-1.2.1.0 [INFO] Importing stack [INFO] Finding and copying available stack rpms from PADS-1.2.1.0-10335/... [OK] 30 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 [INFO] Creating local stack repository... [OK] [INFO] Import complete 1.11 Deploy a Pivotal HD Cluster 1. Make a copy of the cluster configuration template directory. [gpadmin@hadoopmanager-server-0 ~] -bash-4.1$ cd \ /mnt/scripts/isilon-hadoop-tools/etc [gpadmin@hadoopmanager-server-0 etc] -bash-4.1$ cp -rv template-pcc \ mycluster1-pcc 2. Edit mycluster1-pcc/clusterConfig.xml. (a) Replace all occurances of mycluster1 with your actual BDE cluster name. This will identify your host names. (b) Replace the value of the url element within isi-hdfs with your Isilon HDFS URL. For example. hdfs://mycluster1-hdfs.lab.example.com:8020. (c) Update yarn-nodemanager, hawq-segment, and pxf-service elements with the complete list of worker host names. You may copy and paste from the file isilon-hadoop-tools/etc/mycluster1-hosts.txt. Do not include your master node in these lists. (d) Update datanode.disk.mount.points so that it refers to the local disks that you will use for temporary files. For example, if you deployed your VMs to ESXi hosts with 4 local disks, the value will be /data/1,/data/2,/data/3,/data/4. You should confirm that these mounts points exists on your workers using the df command. (e) Update yarn.nodemanager.resource.memory-mb with the maximum number of MB that you will allow YARN applications to use on your worker nodes. You must ensure that the machine has enough RAM for the OS, Hadoop services, YARN applications, and HAWQ. Usually 12 GB less than the machine RAM is recommended. You may need to reduce this further if you are using multiple HAWQ segments. (f) If you want to run more than one HAWQ segment per worker node, update hawq.segment.directory. For example, to run two HAWQ segments, the value can be (/data/1/primary /data/2/primary). For complete details, refer to http://pivotalhd.docs.pivotal.io/doc/2100/EditingtheClusterConfigurationFiles.html. 3. Deploy your cluster. [gpadmin@hadoopmanager-server-0 etc] -bash-4.1$ icm_client deploy \ --confdir mycluster1-pcc --selinuxoff --iptablesoff Please enter the root password for the cluster nodes: *** PCC creates a gpadmin user on the newly added cluster nodes (if any). Please enter a non-empty password to be used for the gpadmin user: *** Verifying input Starting install [==================================================================] 100% Results: mycluster1-hdfs.lab.example.com... [Success] mycluster1-worker-1.lab.example.com... [Success] mycluster1-master-0.lab.example.com... [Success] mycluster1-worker-2.lab.example.com... [Success] mycluster1-worker-0.lab.example.com... [Success] Details at /var/log/gphd/gphdmgr/gphdmgr-webservices.log Cluster ID: 5 1.11. Deploy a Pivotal HD Cluster 31 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 4. Start your cluster. [gpadmin@hadoopmanager-server-0 etc] -bash-4.1$ icm_client start \ --clustername mycluster1 Starting services Starting cluster [==================================================================] 100% Results: Details at /var/log/gphd/gphdmgr/gphdmgr-webservices.log 1.12 Initialize HAWQ 1. Initialize HAWQ. [gpadmin@mycluster1-master-0 ~]$ source /usr/local/hawq/greenplum_path.sh [gpadmin@mycluster1-master-0 ~]$ gpssh-exkeys -f \ /mnt/scripts/isilon-hadoop-tools/etc/mycluster1-hosts.txt [STEP 1 of 5] create local ID and authorize on local host [STEP 2 of 5] keyscan all hosts and update known_hosts file [STEP 3 of 5] authorize current user on remote hosts ... send to mycluster1-worker-0.lab.example.com * * Enter password for mycluster1-worker-0.lab.example.com: *** ... send to mycluster1-worker-1.lab.example.com ... send to mycluster1-worker-2.lab.example.com [STEP 4 of 5] determine common authentication file content [STEP ... ... ... 5 of 5] copy finished key finished key finished key authentication files to all remote hosts exchange with mycluster1-worker-0.lab.example.com exchange with mycluster1-worker-1.lab.example.com exchange with mycluster1-worker-2.lab.example.com [INFO] completed successfully [gpadmin@mycluster1-master-0 ~]$ /etc/init.d/hawq init 20141009:06:05:35:009986 gpinitsystem:mycluster1-master-0:gpadmin-[INFO]: -Checking configuration parameters, please wait... ... 20141009:06:06:55:009986 gpinitsystem:mycluster1-master-0:gpadmin-[INFO]: -HAWQ instance successfully created ... 1.13 Adding a Hadoop User You must add a user account for each Linux user that will submit MapReduce jobs. The procedure below can be used to add a user named hduser1. 32 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 Warning: The steps below will create local user and group accounts on your Isilon cluster. If you are using a directory service such as Active Directory, and you want these users and groups to be defined in your directory service, then DO NOT run these steps. Instead, refer to the OneFS documentation and EMC Isilon Best Practices for Hadoop Data Storage (http://www.emc.com/collateral/white-paper/h12877-wp-emc-isilon-hadoopbest-practices.pdf). 1. Add user to Isilon. isiloncluster1-1# isi auth groups create hduser1 --zone zone1 \ --provider local isiloncluster1-1# isi auth users create hduser1 --primary-group hduser1 \ --zone zone1 --provider local \ --home-directory /ifs/isiloncluster1/zone1/hadoop/user/hduser1 2. Add user to Hadoop nodes. Usually, this only needs to be performed on the master-0 node. [root@mycluster1-master-0 ~]# adduser hduser1 3. Create the user’s home directory on HDFS. [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -mkdir -p /user/hduser1 [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -chown hduser1:hduser1 \ /user/hduser1 [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -chmod 755 /user/hduser1 1.14 Functional Tests The tests below should be performed to ensure a proper installation. Perform the tests in the order shown. You must create the Hadoop user hduser1 before proceeding. 1.14.1 HDFS [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -ls / Found 5 items -rw-r--r-1 root hadoop 0 2014-08-05 05:59 /THIS_IS_ISILON drwxr-xr-x - hbase hbase 148 2014-08-05 06:06 /hbase drwxrwxr-x - solr solr 0 2014-08-05 06:07 /solr drwxrwxrwt - hdfs supergroup 107 2014-08-05 06:07 /tmp drwxr-xr-x - hdfs supergroup 184 2014-08-05 06:07 /user [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -put -f /etc/hosts /tmp [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -cat /tmp/hosts 127.0.0.1 localhost [root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -rm -skipTrash /tmp/hosts [root@mycluster1-master-0 ~]# su - hduser1 [hduser1@mycluster1-master-0 ~]$ hdfs dfs -ls / Found 5 items -rw-r--r-1 root hadoop 0 2014-08-05 05:59 /THIS_IS_ISILON drwxr-xr-x - hbase hbase 148 2014-08-05 06:28 /hbase drwxrwxr-x - solr solr 0 2014-08-05 06:07 /solr drwxrwxrwt - hdfs supergroup 107 2014-08-05 06:07 /tmp drwxr-xr-x - hdfs supergroup 209 2014-08-05 06:39 /user [hduser1@mycluster1-master-0 ~]$ hdfs dfs -ls ... 1.14. Functional Tests 33 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 1.14.2 YARN / MapReduce [hduser1@mycluster1-master-0 ~]$ hadoop jar \ /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ pi 10 1000 ... Estimated value of Pi is 3.14000000000000000000 [hduser1@mycluster1-master-0 ~]$ hadoop fs -mkdir in You can put any file into the in directory. It will be used the datasource for subsequent tests. [hduser1@mycluster1-master-0 ~]$ hadoop fs -put -f /etc/hosts in [hduser1@mycluster1-master-0 ~]$ hadoop fs -ls in ... [hduser1@mycluster1-master-0 ~]$ hadoop fs -rm -r out [hduser1@mycluster1-master-0 ~]$ hadoop jar \ /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ wordcount in out ... [hduser1@mycluster1-master-0 ~]$ hadoop fs -ls out Found 4 items -rw-r--r-1 hduser1 hduser1 0 2014-08-05 06:44 out/_SUCCESS -rw-r--r-1 hduser1 hduser1 24 2014-08-05 06:44 out/part-r-00000 -rw-r--r-1 hduser1 hduser1 0 2014-08-05 06:44 out/part-r-00001 -rw-r--r-1 hduser1 hduser1 0 2014-08-05 06:44 out/part-r-00002 [hduser1@mycluster1-master-0 ~]$ hadoop fs -cat out/part* localhost 1 127.0.0.1 1 Browse to the YARN Resource Manager GUI http://mycluster1-master-0.lab.example.com:8088/. Browse to the MapReduce History Server GUI http://mycluster1-master-0.lab.example.com:19888/. In particular, confirm that you can view the complete logs for task attempts. 1.14.3 Hive [hduser1@mycluster1-master-0 ~]$ hadoop fs -mkdir -p sample_data/tab1 [hduser1@mycluster1-master-0 ~]$ cat - > tab1.csv 1,true,123.123,2012-10-24 08:55:00 2,false,1243.5,2012-10-25 13:40:00 3,false,24453.325,2008-08-22 09:33:21.123 4,false,243423.325,2007-05-12 22:32:21.33454 5,true,243.325,1953-04-22 09:11:33 Type <Control+D>. [hduser1@mycluster1-master-0 ~]$ hadoop fs -put -f tab1.csv sample_data/tab1 [hduser1@mycluster1-master-0 ~]$ hive hive> DROP TABLE IF EXISTS tab1; CREATE EXTERNAL TABLE tab1 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, col_3 TIMESTAMP 34 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ LOCATION ‘/user/hduser1/sample_data/tab1’; DROP TABLE IF EXISTS tab2; CREATE TABLE tab2 ( id INT, col_1 BOOLEAN, col_2 DOUBLE, month INT, day INT ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’; INSERT OVERWRITE TABLE tab2 SELECT id, col_1, col_2, MONTH(col_3), DAYOFMONTH(col_3) FROM tab1 WHERE YEAR(col_3) = 2012; ... OK Time taken: 28.256 seconds hive> show tables; OK tab1 tab2 Time taken: 0.889 seconds, Fetched: 2 row(s) hive> select * from tab1; OK 1 true 123.123 2012-10-24 08:55:00 2 false 1243.5 2012-10-25 13:40:00 3 false 24453.325 2008-08-22 09:33:21.123 4 false 243423.325 2007-05-12 22:32:21.33454 5 true 243.325 1953-04-22 09:11:33 Time taken: 1.083 seconds, Fetched: 5 row(s) hive> select * from tab2; OK 1 true 123.123 10 24 2 false 1243.5 10 25 Time taken: 0.094 seconds, Fetched: 2 row(s) hive> select * from tab1 where id=1; OK 1 true 123.123 2012-10-24 08:55:00 Time taken: 15.083 seconds, Fetched: 1 row(s) hive> select * from tab2 where id=1; OK 1 true 123.123 10 24 Time taken: 13.094 seconds, Fetched: 1 row(s) 1.14. Functional Tests 35 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 hive> exit; 1.14.4 Pig [hduser1@mycluster1-master-0 ~]$ pig grunt> a = load ‘in’; grunt> dump a; ... Success! ... grunt> quit; 1.14.5 HBase [hduser1@mycluster1-master-0 ~]$ hbase shell hbase(main):001:0> create ‘test’, ‘cf’ 0 row(s) in 3.3680 seconds => Hbase::Table - test hbase(main):002:0> list ‘test’ TABLE test 1 row(s) in 0.0210 seconds => [’’test’’] hbase(main):003:0> put ‘test’, ‘row1’, ‘cf:a’, ‘value1’ 0 row(s) in 0.1320 seconds hbase(main):004:0> put ‘test’, ‘row2’, ‘cf:b’, ‘value2’ 0 row(s) in 0.0120 seconds hbase(main):005:0> scan ‘test’ ROW COLUMN+CELL row1 column=cf:a,timestamp=1407542488028,value=value1 row2 column=cf:b,timestamp=1407542499562,value=value2 2 row(s) in 0.0510 seconds hbase(main):006:0> get ‘test’, ‘row1’ COLUMN CELL cf:a timestamp=1407542488028,value=value1 1 row(s) in 0.0240 seconds hbase(main):007:0> quit 1.14.6 HAWQ [gpadmin@mycluster1-master-0 ~]$ psql psql (8.2.15) Type ‘‘help’’ for help. gpadmin=# \l List of databases Name | Owner | Encoding | Access privileges -----------+---------+----------+------------------gpadmin | gpadmin | UTF8 | postgres | gpadmin | UTF8 | template0 | gpadmin | UTF8 | 36 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 template1 | gpadmin | UTF8 (4 rows) | gpadmin=# \c gpadmin psql (8.4.20, server 8.2.15) WARNING: psql version 8.4, server version 8.2. Some psql features might not work. You are now connected to database ‘‘gpadmin’’. gpadmin=# create table test (a int, b text); NOTICE: Table doesn’t have ‘DISTRIBUTED BY’ clause -- Using column named ‘a’ as the Greenplum Database data distribution key for this table. HINT: The ‘DISTRIBUTED BY’ clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew. CREATE TABLE gpadmin=# insert into test values (1, ‘435252345’); INSERT 0 1 gpadmin=# select * from test; a | b ---+----------1 | 435252345 (1 row) gpadmin=# \q 1.14.7 Pivotal Command Center Browse to the Pivotal Command Center GUI https://hadoopmanager-server-0.lab.example.com:5443. Login using the following default account: Username: gpadmin Password: Gpadmin1 1.14.8 Additional Functional Tests If you will be using other Hadoop components and applications, refer http://pivotalhd.docs.pivotal.io/doc/2100/RunningPHDSamplePrograms.html for additional functional tests. to 1.15 Searching Wikipedia One of the many unique features of Isilon is its multi-protocol support. This allows you, for instance, to write a file using SMB (Windows) or NFS (Linux/Unix) and then read it using HDFS to perform Hadoop analytics on it. In this section, we exercise this capability to download the entire Wikipedia database (excluding media) using your favorite browser to Isilon. As soon as the download completes, we’ll run a Hadoop grep to search the entire text of Wikipedia using our Hadoop cluster. As this search doesn’t rely on a word index, your regular expression can be as complicated as you like. 1. First, let’s connect your client (with your favorite web browser) to your Isilon cluster. (a) If you are using a Windows host or other SMB client: i. Click Start -> Run. 1.15. Searching Wikipedia 37 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 ii. Enter: \\subnet0-pool0.isiloncluster1.lab.example.com\ifs iii. You may authenticate as root with your Isilon root password. iv. Browse to \ifs\isiloncluster1\zone1\hadoop\tmp. v. Create a directory here called wikidata. This is where you will download the Wikipedia data to. (b) If you are using a Linux host or other NFS client: i. Mount your NFS export. [root@workstation ~]$ mkdir /mnt/isiloncluster1 [root@workstation ~]$ echo \ subnet0-pool0.isiloncluster1.lab.example.com:/ifs \ /mnt/isiloncluster1 nfs \ nolock,nfsvers=3,tcp,rw,hard,intr,timeo=600,retrans=2,rsize=131072,wsize=524288 >> /etc/fstab [root@workstation ~]$ mount -a [root@workstation ~]$ mkdir -p \ /mnt/isiloncluser1/isiloncluster1/zone1/hadoop/tmp/wikidata 2. Open your favorite web browser and go to http://dumps.wikimedia.org/enwiki/latest. 3. Locate the file enwiki-latest-pages-articles.xml.bz2 and download it directly to the wikidata folder on Isilon. Your web browser will be writing this file to the Isilon file system using SMB or NFS. Note: This file is approximately 10 GB in size and contains the entire text of the English version of Wikipedia. If this is too large, you may want to download one of the smaller files such as enwiki-latest-all-titles.gz. 38 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 Note: If you get an access-denied error, the quickest resolution is to SSH into any node in your Isilon cluster as root and run chmod -R a+rwX /ifs/isiloncluster1/zone1/hadoop/tmp. 4. Now let’s run the Hadoop grep job. We’ll search for all two-word phrases that begin with EMC. [hduser1@mycluster1-master-0 ~]$ hadoop fs -ls /tmp/wikidata [hduser1@mycluster1-master-0 ~]$ hadoop fs -rm -r /tmp/wikigrep [hduser1@mycluster1-master-0 ~]$ hadoop jar \ /usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar \ grep /tmp/wikidata /tmp/wikigrep ‘‘EMC [^ ]*’’ 5. When the the job completes, use your favorite text file viewer to view the output file /tmp/wikigrep/part-r-00000. If you are using Windows, you may open the file in Wordpad. 1.15. Searching Wikipedia 39 EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 6. That’s it! Can you think of anything else to search for? 1.16 Where To Go From Here You are now ready to fine-tune the configuration and performance of your Isilon Hadoop environment. You should consider the following areas for fine-tuning. • YARN NodeManager container memory 1.17 Known Limitations Although Kerberos is supported by OneFS, this document does not address a Hadoop environment secured with Kerberos. Instead, refer to EMC Isilon Best Practices for Hadoop Data Storage (http://www.emc.com/collateral/whitepaper/h12877-wp-emc-isilon-hadoop-best-practices.pdf). 1.18 Legal To learn more about how EMC products, services, and solutions can help solve your business and IT challenges, contact (http://www.emc.com/contact-us/contact-us.esp) your local representative or authorized reseller, visit www.emc.com (http://www.emc.com), or explore and compare products in the EMC Store (https://store.emc.com/?EMCSTORE_CPP) Copyright © 2014 EMC Corporation. All Rights Reserved. EMC believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any EMC software described in this publication requires an applicable software license. For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com. VMware and vSphere are registered trademarks or trademarks of VMware, Inc. in the United States and/or other jurisdictions. All other trademarks used herein are the property of their respective owners. 40 Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0 1.19 References EMC Community - Isilon Hadoop Starter Kit Home Page (https://community.emc.com/docs/DOC-26892) EMC Community - Isilon Hadoop Info Hub (https://community.emc.com/docs/DOC-39529) https://community.emc.com/community/connect/everything_big_data http://bigdatablog.emc.com/ http://www.emc.com/collateral/white-paper/h12877-wp-emc-isilon-hadoop-best-practices.pdf http://www.emc.com/collateral/white-paper/h13354-wp-security-compliance-scale-out-hadoop-data-lakes.pdf https://github.com/claudiofahey/isilon-hadoop-tools http://pubs.vmware.com/bde-2/index.jsp http://hadoop.apache.org/ http://www.transparencymarketresearch.com http://www.pivotal.io/ 1.19. References 41
© Copyright 2024