EMC Isilon Hadoop Starter Kit for Pivotal HD Release 3.0 EMC

EMC Isilon Hadoop Starter Kit for
Pivotal HD
Release 3.0
EMC
November 04, 2014
Contents
1
EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Installation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Prepare Network Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Prepare Isilon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7 Prepare NFS Clients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.8 Prepare VMware Big Data Extensions . . . . . . . . . . . . . . . . . . . . . . .
1.9 Prepare Hadoop Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . .
1.10 Install Pivotal Command Center . . . . . . . . . . . . . . . . . . . . . . . . . .
1.11 Deploy a Pivotal HD Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.12 Initialize HAWQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.13 Adding a Hadoop User . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.14 Functional Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.15 Searching Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.16 Where To Go From Here . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.17 Known Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.18 Legal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.19 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
3
5
6
8
13
23
24
25
29
31
32
32
33
37
40
40
40
41
i
ii
CHAPTER 1
EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data
Extensions
This document describes how to create a Hadoop environment utilizing Pivotal HD and an EMC Isilon Scale-Out
NAS for HDFS accessible shared storage. VMware Big Data Extensions is used to provision and manage the compute
resources.
#RememberRuddy
Note: The latest version of this document is available at http://hsk-phd.readthedocs.org/.
1.1 Introduction
IDC published an update to their Digital Universe study in December 2012 and found that the rate of digital data
creation is not only continuing to grow, but the rate is actually accelerating. By the end of this decade we will create
40 Zettabytes of new digital information yearly or the equivalent of 1.7MB of digital information for every man,
woman, and child every second of every day.
This information explosion is creating new opportunities for our businesses to leverage digital information to serve
their customers better, faster, and most cost effectively through Big Data Analytics applications. Hadoop technologies
can be cost effective solutions and can manage structured, semi-structured and unstructured data unlike traditional
solutions such as RDBMS. The need to track and analyze consumer behavior, maintain inventory and space, target
marketing offers on the basis of consumer preferences and attract and retain consumers, are some of the factors pushing
the demand for Big Data Analytics solutions using Hadoop technologies. According to a new market report published
by Transparency Market Research (http://www.transparencymarketresearch.com) “Hadoop Market - Global Industry
Analysis, Size, Share, Growth, Trends, and Forecast, 2012- 2018,” the global Hadoop market was worth USD 1.5
billion in 2012 and is expected to reach USD 20.9 billion in 2018, growing at a CAGR of 54.7% from 2012 to 2018.
Hadoop like any new technology can be time consuming, and expensive for our customers to get deployed and operational. When we surveyed a number of our customers, two main challenges were identified to getting started:
confusion over which Hadoop distribution to use and how to deploy using existing IT assets and knowledge. Hadoop
software is distributed by several vendors including Pivotal, Hortonworks, and Cloudera with proprietary extensions.
In addition to these distributions, Apache distributes a free open source version. From an infrastructure perspective
many Hadoop deployments start outside the IT data center and do not leverage the existing IT automation, storage
efficiency, and protection capabilities. Many customers cited the time it took IT to deploy Hadoop as the primary
reason to start with a deployment outside of IT.
This guide is intended to simplify Hadoop deployments by creating a shared storage model with EMC Isilon scale out
NAS and Pivotal HD, reduce the time to deployment, and the cost of deployment leveraging tools that can automate
Hadoop cluster deployments.
1
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
1.1.1 Audience
This document is intended for IT program managers, IT architects, Developers, and IT management to easily deploy
Hadoop with automation tools and leverage EMC Isilon for HDFS shared storage. It can be used by somebody who
does not yet have an EMC Isilon cluster by downloading the free EMC Isilon OneFS Simulator which can be installed
as a virtual machine for testing and training. However, this document can also be used by somebody who will be
installing in a production environment as best-practices are followed whenever possible.
1.1.2 Apache Hadoop Projects
Apache Hadoop is an open source, batch data processing system for enormous amounts of data. Hadoop runs as a
platform that provides cost-effective, scalable infrastructure for building Big Data analytic applications. All Hadoop
clusters contain a distributed file system called the Hadoop Distributed File System (HDFS), a computation layer
called MapReduce.
The Apache Hadoop project contains the following subprojects:
• Hadoop Distributed File System (HDFS) – A distributed file system that provides high-throughput access to
application data.
• Hadoop MapReduce – A software framework for writing applications to reliably process large amounts of data
in parallel across a cluster.
Hadoop is supplemented by an ecosystem of Apache projects, such as Pig, Hive, Sqoop, Flume, Oozie, Whirr, HBase,
and Zookeeper that extend the value of Hadoop and improves its usability.
Version 2 of Apache Hadoop introduces YARN, a sub-project of Hadoop that separates the resource management and
processing components. YARN was born of a need to enable a broader array of interaction patterns for data stored in
HDFS beyond MapReduce. The YARN-based architecture of Hadoop 2.0 provides a more general processing platform
that is not constrained to MapReduce.
For full details of the Apache Hadoop project see http://hadoop.apache.org/.
1.1.3 Pivotal HD with HAWQ
Open-source Hadoop frameworks have rapidly emerged as preferred technology to grapple with the explosion of
structured and unstructured big data. You can accelerate your Hadoop investment with an enterprise-ready, fully
supported Hadoop distribution that ensures you can harness—and quickly gain insight from—the massive data being
driven by new apps, systems, machines, and the torrent of customer sources.
HAWQ™ adds SQL’s expressive power to Hadoop to accelerate data analytics projects, simplify development while
increasing productivity, expand Hadoop’s capabilities and cut costs. HAWQ can help your organization render Hadoop
queries faster than any Hadoop-based query interface on the market by adding rich, proven, parallel SQL processing
facilities. HAWQ leverages your existing business intelligence products, analytic products, and your workforce’s SQL
skills to bring more than 100X performance improvement (in some cases up to 600X) to a wide range of query types
and workloads. The world’s fastest SQL query engine on Hadoop, HAWQ is 100 percent SQL compliant.
The first true SQL processing for enterprise-ready Hadoop, Pivotal HD with HAWQ is available to you as software
only or as an appliance-based solution.
More information on Pivotal HD and HAWQ can be found on http://www.pivotal.io/big-data/pivotal-hd.
1.1.4 Isilon Scale-Out NAS For HDFS
EMC Isilon is the only scale-out NAS platform natively integrated with the Hadoop Distributed File System (HDFS).
Using HDFS as an over-the-wire protocol, you can deploy a powerful, efficient, and flexible data storage and analytics
2
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
ecosystem.
In addition to native integration with HDFS, EMC Isilon storage easily scales to support massively large Hadoop
analytics projects. Isilon scale-out NAS also offers unmatched simplicity, efficiency, flexibility, and reliability that you
need to maximize the value of your Hadoop data storage and analytics workflow investment.
1.1.5 Overview of Isilon Scale-Out NAS for Big Data
The EMC Isilon scale-out platform combines modular hardware with unified software to provide the storage foundation for data analysis. Isilon scale-out NAS is a fully distributed system that consists of nodes of modular hardware
arranged in a cluster. The distributed Isilon OneFS operating system combines the memory, I/O, CPUs, and disks of
the nodes into a cohesive storage unit to present a global namespace as a single file system.
The nodes work together as peers in a shared-nothing hardware architecture with no single point of failure. Every node
adds capacity, performance, and resiliency to the cluster, and each node acts as a Hadoop namenode and datanode.
The namenode daemon is a distributed process that runs on all the nodes in the cluster. A compute client can connect
to any node through HDFS.
As nodes are added, the file system expands dynamically and redistributes data, eliminating the work of partitioning disks and creating volumes. The result is a highly efficient and resilient storage architecture that brings all the
advantages of an enterprise scale-out NAS system to storing data for analysis.
Unlike traditional storage, Hadoop’s ratio of CPU, RAM, and disk space depends on the workload—factors that make
it difficult to size a Hadoop cluster before you have had a chance to measure your MapReduce workload. Expanding
data sets also makes sizing decisions upfront problematic. Isilon scale-out NAS lends itself perfectly to this scenario:
Isilon scale-out NAS lets you increase CPUs, RAM, and disk space by adding nodes to dynamically match storage
capacity and performance with the demands of a dynamic Hadoop workload.
An Isilon cluster optimizes data protection. OneFS more efficiently and reliably protects data than HDFS. The HDFS
protocol, by default, replicates a block of data three times. In contrast, OneFS stripes the data across the cluster and
protects the data with forward error correction codes, which consume less space than replication with better protection.
1.2 Environment
1.2.1 Versions
The test environment used for this document consists of the following software versions:
• Pivotal HD 2.1.0
• Pivotal Command Center 2.3.0
• Pivotal HAWQ 1.2.1.0
• Isilon OneFS 7.2.0
• VMware vSphere Enterprise 5.5.0
• VMware Big Data Extensions 2.0
1.2.2 Hosts
A typical Hadoop environment composed of many types of hosts. Below is a description of these hosts.
1.2. Environment
3
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
List of Hosts in Hadoop Environments
4
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
Host (Host Name)
DNS/DHCP Server
VMware vCenter
VMware Big Data
Extensions vApp Server
(bde)
VMware ESXi
Linux workstation
(workstation)
Isilon nodes
(isiloncluster1-1, 2, 3, ...)
Isilon InsightIQ
Pivotal Command Center
(hadoopmanager-server-0)
Hadoop Master
(mycluster1-master-0)
Hadoop Worker Node
(mycluster1-worker-0, 1,
2, ...)
Description
It is recommended to have a DNS and DHCP server that you can control. This will
be used by all other hosts for DNS and DHCP services.
VMware vCenter will manage all ESXi hosts and their VMs. It is required for Big
Data Extensions (BDE).
This will run the Serengeti components to assist in deploying Hadoop clusters.
Physical machines will run the VMware ESXi operating system to allow virtual
machines to run in them.
You should have a Linux workstation with a GUI that you can use to control
everything with.
You will need one or more Isilon Scale-Out NAS nodes. For functional testing, you
can use a single Isilon OneFS Simulator node instead of the Isilon appliance. The
nodes will be clustered together into an Isilon Cluster. Isilon nodes run the OneFS
operating system.
This is optional licensed software from Isilon and can be used to monitor the health
and performance of your Isilon cluster. It is recommend for any performance testing.
This host will manage your Hadoop cluster. This will run Pivotal Command Center
which will deploy and monitor the Hadoop components. In smaller environments,
you may choose to deploy Pivotal Command Center to your Hadoop Master host.
This host will the YARN Resource Manager, Job History Server, Hive Metastore
Server, etc.. In general, it will run all “master” services except for the HDFS Name
Node.
There will be any number of worker nodes, depending on the compute requirements.
Each node will run the YARN Node Manager, HBase Region Server, etc.. During
the installation process, these nodes will run the HDFS Data Node but these will
become idle like the HDFS Name Node.
Due to the many hosts involved, all commands that must be typed are prefixed with the standard system prompt
identifying the user and host name. For example:
[user@workstation ~]$ ping myhost.lab.example.com
1.3 Installation Overview
Below is the overview of the installation process that this document will describe.
1. Confirm prerequisites.
2. Prepare your network infrastructure including DNS and DHCP.
3. Prepare your Isilon cluster.
4. Prepare VMware Big Data Extensions (BDE).
5. Use BDE to provision multiple virtual machines.
6. Install Pivotal Command Center.
7. Use Pivotal Command Center to deploy Pivotal HD to the virtual machines.
8. Initialize HAWQ.
9. Perform key functional tests.
1.3. Installation Overview
5
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
1.4 Prerequisites
1.4.1 Isilon
• For low-capacity, non-performance testing of Isilon, the EMC Isilon OneFS Simulator can be used instead of
a cluster of physical Isilon appliances. This can be downloaded for free from http://www.emc.com/getisilon.
Refer to the EMC Isilon OneFS Simulator Install Guide for details. Be sure to follow the section for running
the virtual nodes in VMware ESX. Only a single virtual node is required but adding additional nodes will allow
you to explore other features such as data protection, SmartPools (tiering), and SmartConnect (network load
balancing).
• For physical Isilon nodes, you should have already completed the console-based installation process for your
first Isilon node and added at least two other nodes for a minimum of 3 Isilon nodes.
• You must have OneFS version 7.1.1.0 with patch-130611 or version 7.2.0.0 and higher.
• You must obtain a OneFS HDFS license code and install it on your Isilon cluster. You can get your free OneFS
HDFS license from http://www.emc.com/campaign/isilon-hadoop/index.htm.
• It is recommended, but not required, to have a SmartConnect Advanced license for your Isilon cluster.
• To allow for scripts and other small files to be easily shared between all nodes in your environment, it is highly
recommended to enable NFS (Unix Sharing) on your Isilon cluster. By default, the entire /ifs directory is already
exported and this can remain unchanged. This document assumes that a single Isilon cluster is used for this NFS
export as well as for HDFS. However, there is no requirement that the NFS export be on the same Isilon cluster
that you are using for HDFS.
1.4.2 VMware
• VMware vSphere Enterprise or Enterprise Plus
• VMware vSphere vCenter should be installed and functioning.
• All hosts referenced in this document can be virtualized with VMware ESXi.
• At a minimum for non-performance testing, you will need a single VMware ESXi host with 72 GB of RAM.
• VMDK files we be used for the operating system and applications for each host. Additionally, VMDKs will be
used for temporary files needed to process Hadoop jobs (e.g. MapReduce spill and shuffle files). VMDKs can
be located on physical disks attached to the ESXi host or on a shared NFS datastore hosted on an Isilon cluster.
If using an Isilon cluster, be aware that the VMDK I/O will compete with the HDFS I/O, reducing the overall
performance.
• All ESXi hosts must be managed by vCenter.
• As a requirement for the VMware Big Data Extensions vApp, you must define a vSphere cluster.
1.4.3 Networking
• For the best performance, a single 10 Gigabit Ethernet switch should connect to at least one 10 Gigabit port on
each ESXi host running a worker VM. Additionally, the same switch should connect to at least one 10 Gigabit
port on each Isilon node.
• A single dedicated layer-2 network can be used to connect all ESXi hosts and Isilon nodes. Although multiple
networks can be used for increased security, monitoring, and robustness, it adds complications that should be
avoided when possible.
6
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
• At least an entire /24 IP address block should be allocated to your lab network. This will allow a DNS reverse
lookup zone to be delegated to your own lab DNS server.
• If not using DHCP for dynamic IP address allocation, a static set of IP addresses must be assigned to Big Data
Extensions so that it may allocate them to new virtual machines that it provisions. You will usually need just
one IP address for each virtual machine.
• If using the EMC Isilon OneFS Simulator, you will need at least two static IP addresses (one for the node’s ext-1
interface, another for the SmartConnect service IP). Each additional Isilon node will require an additional IP
address.
• At a minimum, you will need to allocate to your Isilon cluster one IP address per Access Zone per Isilon node. In
general, you will need one Access Zone for each separate Hadoop cluster that will use Isilon for HDFS storage.
For the best possible load balancing during an Isilon node failure scenario, the recommended number of IP
addresses is give by the formula below. Of course, this is in addition to any IP addresses used for non-HDFS
pools.
# of IP addresses = 2 * (# of Isilon Nodes) * (# of Access Zones)
For example, 20 IP addresses are recommended for 5 Isilon nodes and 2 Access Zones.
• This document will assume that Internet access is available to all servers to download various components from
Internet repositories.
1.4.4 DHCP and DNS
• A DHCP server on this subnet is highly recommended. Although a DHCP server is not a requirement for a
Hadoop environment, this document will assume that one exists.
• A DNS server is required and you must have the ability to create DNS records and zone delegations.
• Although not required, it is assumed in this document that a Windows Server 2008 R2 server is running on this
subnet providing DHCP and DNS services. To distinguish this DNS server from a corporate or public DNS
server outside of your lab, this will be referred to as your lab DNS server.
• It is recommended, and this document will assume, that the DHCP server creates forward and reverse DNS
records when it provides IP address leases to hosts. The procedure for doing this is documented later in this
document.
• It is recommend that a forward lookup DNS zone be delegated to your lab DNS server. For instance, if
your company’s domain is example.com, then DNS requests for lab.example.com, and all subdomains of
lab.example.com, should be delegated to your lab DNS server.
• It is recommended that your DNS server delegate a subdomain to your Isilon cluster. For instance, DNS requests
for subnet0-pool0.isiloncluster1.lab.example.com or isiloncluster1.lab.example.com should be delegated to the
Service IP defined on your Isilon cluster.
• To allow for a convenient way of changing the HDFS Name Node used by all Hadoop applications and services,
it is recommended to create a DNS record for your Isilon cluster’s HDFS Name Node service. This should be a
CNAME alias to your Isilon SmartConnect zone. Specify a TTL of 1 minute to allow for quick changes while
you are experimenting. For example, create a CNAME record for mycluster1-hdfs.lab.example.com that targets
subnet0-pool0.isiloncluster1.lab.example.com. If you later want to redirect all HDFS I/O to another cluster or
a different pool on the same Isilon cluster, you simply need to change the DNS record and restart all Hadoop
services.
1.4. Prerequisites
7
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
1.4.5 Other
• You will need one Linux workstation which you will use to perform most configuration tasks. No services will
run on this workstation.
– This should have a GUI and a web browser.
– This must have Python 2.6.6 or higher 2.x version.
– CentOS 6.5 has been used for testing but most other systems should also work. However, be aware that
Centos 6.4, supplied with BDE 2.0, must be upgraded to support Python 2.6.6.
– sshpass must be installed. This document shows how to install sshpass on Centos 6.x.
• Several useful scripts and file templates are provided in the archive file isilon-hadoop-tools-x.x.tar.gz. Download
the latest version from https://github.com/claudiofahey/isilon-hadoop-tools/releases.
• Time synchronization is critical for Hadoop. It is highly recommended to configure all ESXi hosts and Isilon
nodes to use NTP. In general, you do not need to run NTP clients in your VMs.
1.5 Prepare Network Infrastructure
1.5.1 Configure DHCP and DNS on Windows Server 2008 R2
DHCP and DNS servers are often configured incorrectly in a lab environment. This section will show you some key
steps to configure a Windows Server 2008 R2 server for DHCP and DNS. Other operating systems can certainly be
used as long as they operate the same way from the client perspective.
1. Open Server Manager.
2. Forward Lookup Zone
(a) Create a forward lookup zone for your lab domain (lab.example.com).
8
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
(b) To avoid possible DNS dynamic update issues, select the option to allow unsecure dynamic updates. This
may not be required in your environment and the best settings for a production deployment should be
carefully considered.
1.5. Prepare Network Infrastructure
9
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
3. Create a reverse lookup zone for your lab subnet (128.111.10.in-addr.arpa). If you have multiple subnets, this
will be the subnet used by your worker VMs.
4. Configure your DHCP server for dynamic DNS.
(a) Right-click on IPv4 (see the screen shot below), then click Properties.
(b) Configure the DNS settings as shown below. These settings will allow the DHCP server to automatically
create DNS A and PTR records for DHCP clients.
10
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
(c) Specify the credential used by the DHCP service to communicate with the DNS server.
1.5. Prepare Network Infrastructure
11
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
5. Create a new DHCP scope for your lab subnet.
(a) Set the following scope options for your environment.
12
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
6. Test for a proper installation.
(a) A CentOS or Redhat Linux machine configured for DHCP should specify the DHCP_HOSTNAME parameter. In particular, the file /etc/sysconfig/network should contain “DHCP_HOSTNAME=myhost” and
it should not specify “HOSTNAME=myhost”. It is this setting that allows the DHCP server to automatically create DNS records for the host. After editing this file, restart the network with “service network
restart”.
(b) A Windows machine only needs to be configured for DHCP on an interface.
(c) Confirm that the forward DNS record myhost.lab.example.com is dynamically created on your DNS server.
You should be able to ping the FQDN from all hosts in your environment.
[user@workstation ~]$ ping myhost.lab.example.com
PING myhost.lab.example.com (10.111.128.240) 56(84) bytes of data.
64 bytes from myhost.lab.example.com (10.111.128.240): icmp_seq=1
ttl=64 time=1.47 ms
(d) Confirm that the reverse DNS record for that IP address can be properly resolved to myhost.lab.example.com.
[user@workstation ~]$ dig +noall +answer -x 10.111.128.240
240.128.111.10.in-addr.arpa. 900 IN
PTR
myhost.lab.example.com.
1.6 Prepare Isilon
1.6.1 Assumptions
This document makes the assumptions listed below. These are not necessarily requirements but they are usually valid
and simplify the process.
• It is assumed that you are not using a directory service such as Active Directory for Hadoop users and groups.
• It is assumed that you are not using Kerberos authentication for Hadoop.
1.6.2 SmartConnect For HDFS
A best practices for HDFS on Isilon is to utilize two SmartConnect IP address pools for each access zone. One
IP address pool should be used by Hadoop clients to connect to the HDFS name node service on Isilon and it
should use the dynamic IP allocation method to minimize connection interruptions in the event that an Isilon node
fails. Note that dyanamic IP allocation requires a SmartConnect Advanced license. A Hadoop client uses a specific
SmartConnect IP address pool simply by using its zone name (DNS name) in the HDFS URI (e.g. hdfs://subnet0pool1.isiloncluster1.lab.example.com:8020).
1.6. Prepare Isilon
13
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
A second IP address pool should be used for HDFS data node connections and it should use the static IP allocation
method to ensure that data node connections are balanced evenly among all Isilon nodes. To assign specific SmartConnect IP address pools for data node connections, you will use the “isi hdfs racks modify” command.
If IP addresses are limited and you have a SmartConnect Advanced license, you may choose to use a single dynamic
pool for name node and data node connections. This may result in uneven utilization of Isilon nodes.
If you do not have a SmartConnect Advanced license, you may choose to use a single static pool for name node and
data node connections. This may result in some failed HDFS connections immediately after Isilon node failures.
For more information, see EMC Isilon Best Practices for Hadoop Data Storage (http://www.emc.com/collateral/whitepaper/h12877-wp-emc-isilon-hadoop-best-practices.pdf).
1.6.3 OneFS Access Zones
Access zones on OneFS are a way to select a distinct configuration for the OneFS cluster based on the IP address
that the client connects to. For HDFS, this configuration includes authentication methods, HDFS root path, and
authentication providers (AD, LDAP, local, etc.). By default, OneFS includes a single access zone called System.
If you will only have a single Hadoop cluster connecting to your Isilon cluster, then you can use the System access
zone with no additional configuration. However, to have more than one Hadoop cluster connect to your Isilon cluster,
it is best to have each Hadoop cluster connect to a separate OneFS access zone. This will allow OneFS to present each
Hadoop cluster with its own HDFS namespace and an independent set of users.
For more information,
see Security and Compliance for Scale-out Hadoop Data Lakes
(http://www.emc.com/collateral/white-paper/h13354-wp-security-compliance-scale-out-hadoop-data-lakes.pdf).
To view your current list of access zones and the IP pools associated with them:
isiloncluster1-1# isi zone zones list
Name
Path
-----------System /ifs
-----------Total: 1
isiloncluster1-1# isi networks list pools -v
subnet0:pool0
In Subnet: subnet0
Allocation: Static
Ranges: 1
10.111.129.115-10.111.129.126
Pool Membership: 4
1:10gige-1 (up)
2:10gige-1 (up)
3:10gige-1 (up)
4:10gige-1 (up)
Aggregation Mode: Link Aggregation Control Protocol (LACP)
Access Zone: System (1)
SmartConnect:
Suspended Nodes : None
Auto Unsuspend ... 0
Zone
: subnet0-pool0.isiloncluster1.lab.example.com
Time to Live
: 0
Service Subnet
: subnet0
Connection Policy: Round Robin
Failover Policy : Round Robin
14
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
Rebalance Policy : Automatic Failback
To create a new access zone and an associated IP address pool:
isiloncluster1-1# mkdir -p /ifs/isiloncluster1/zone1
isiloncluster1-1# isi zone zones create --name zone1 \
--path /ifs/isiloncluster1/zone1
isiloncluster1-1# isi networks create pool --name subnet0:pool1 \
--ranges 10.111.129.127-10.111.129.138 --ifaces 1-4:10gige-1 \
--access-zone zone1 --zone subnet0-pool1.isiloncluster1.lab.example.com \
--sc-subnet subnet0 --dynamic
Creating pool
‘subnet0:pool1’:
OK
Saving:
OK
Note: If you do not have a SmartConnect Advanced license, you will need to omit the –dynamic option.
To allow the new IP address pool to be used by data node connections:
isiloncluster1-1# isi hdfs racks create /rack0 --client-ip-ranges \
0.0.0.0-255.255.255.255
isiloncluster1-1# isi hdfs racks modify /rack0 --add-ip-pools subnet0:pool1
isiloncluster1-1# isi hdfs racks list
Name
Client IP Ranges
IP Pools
-------------------------------------------/rack0 0.0.0.0-255.255.255.255 subnet0:pool1
-------------------------------------------Total: 1
1.6.4 Sharing Data Between Access Zones
Access zones in OneFS provide a measure of multi-tenancy in that data within one access zone cannot be accessed by
another access zone. In certain use cases, however, you may actually want to make the same dataset available to more
than one Hadoop cluster. This can be done by using fully-qualified paths to refer to data in other access zones.
To use this approach, you will configure your Hadoop jobs to simply access the datasets from a common shared HDFS
namespace. For instance, you would start with two independent Hadoop clusters, each with its own access zone on
Isilon. Then you can add a 3rd access zone on Isilon, with its own IP addresses and HDFS root, and containing a
dataset that is shared with other Hadoop clusters.
1.6.5 User and Group IDs
Isilon clusters and Hadoop servers each have their own mapping of user IDs (uid) to user names and group IDs (gid) to
group names. When Isilon is used only for HDFS storage by the Hadoop servers, the IDs do not need to match. This
is due to the fact that the HDFS wire protocol only refers to users and groups by their names, and never their numeric
IDs.
In contrast, the NFS wire protocol refers to users and groups by their numeric IDs. Although NFS is rarely used in
traditional Hadoop environments, the high-performance, enterprise-class, and POSIX-compatible NFS functionality
of Isilon makes NFS a compelling protocol for certain workflows. If you expect to use both NFS and HDFS on
your Isilon cluster (or simply want to be open to the possibility in the future), it is highly recommended to maintain
1.6. Prepare Isilon
15
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
consistent names and numeric IDs for all users and groups on Isilon and your Hadoop servers. In a multi-tenant
environment with multiple Hadoop clusters, numeric IDs for users in different clusters should be distinct.
For instance, the user sqoop in Hadoop cluster A will have ID 610 and this same ID will be used in the Isilon access
zone for Hadoop cluster A as well as every server in Hadoop cluster A. The user sqoop in Hadoop cluster B will have
ID 710 and this ID will be used in the Isilon access zone for Hadoop cluster B as well as every server in Hadoop cluster
B.
1.6.6 Configure Isilon For HDFS
Note: In the steps below, replace zone1 with System to use the default System access zone or you may specify the
name of a new access zone that you previously created.
1. Open a web browser to the your Isilon cluster’s web administration page. If you don’t know the URL, simply
point your browser to https://isilon_node_ip_address:8080, where isilon_node_ip_address is any IP address on
any Isilon node that is in the System access zone. This usually corresponds to the ext-1 interface of any Isilon
node.
2. Login with your root account. You specified the root password when you configured your first node using the
console.
3. Check, and edit as necessary, your NTP settings. Click Cluster Management -> General Settings -> NTP.
16
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
1. SSH into any node in your Isilon cluster as root.
2. Confirm that your Isilon cluster is at OneFS version 7.1.1.0 or higher.
isiloncluster1-1# isi version
Isilon OneFS v7.1.1.0 ...
3. For OneFS version 7.1.1.0, you must have patch-130611 installed. You can view the list of patches you have
installed with:
isiloncluster1-1# isi pkg info
patch-130611:
This patch allows clients to use
version 2.4 of the Hadoop Distributed File System (HDFS)
with an Isilon cluster.
4. Install the patch if needed:
[user@workstation ~]$ scp patch-130611.tgz root@mycluster1-hdfs:/tmp
isiloncluster1-1# gunzip < /tmp/patch-130611.tgz | tar -xvf isiloncluster1-1# isi pkg install patch-130611.tar
Preparing to install the package...
1.6. Prepare Isilon
17
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
Checking the package for installation...
Installing the package
Committing the installation...
Package successfully installed.
5. Verify your HDFS license.
isiloncluster1-1# isi license
Module
License Status
------------------HDFS
Evaluation
Configuration
------------Not Configured
Expiration Date
--------------September 4, 2014
6. Create the HDFS root directory. This is usually called hadoop and must be within the access zone directory.
isiloncluster1-1# mkdir -p /ifs/isiloncluster1/zone1/hadoop
7. Set the HDFS root directory for the access zone.
isiloncluster1-1# isi zone zones modify zone1 \
--hdfs-root-directory /ifs/isiloncluster1/zone1/hadoop
8. Increase the HDFS daemon thread count.
isiloncluster1-1# isi hdfs settings modify --server-threads 256
9. Set the HDFS block size used for reading from Isilon.
isiloncluster1-1# isi hdfs settings modify --default-block-size 128M
10. Create an indicator file so that we can easily determine when we are looking your Isilon cluster via HDFS.
isiloncluster1-1# touch \
/ifs/isiloncluster1/zone1/hadoop/THIS_IS_ISILON_isiloncluster1_zone1
11. Extract the Isilon Hadoop Tools to your Isilon cluster. This can be placed in any directory under /ifs. It is
recommended to use /ifs/isiloncluster1/scripts where isiloncluster1 is the name of your Isilon cluster.
[user@workstation ~]$ scp isilon-hadoop-tools-x.x.tar.gz \
root@isilon_node_ip_address:/ifs/isiloncluster1/scripts
isiloncluster1-1# tar -xzvf \
/ifs/isiloncluster1/isilon-hadoop-tools-x.x.tar.gz \
-C /ifs/isiloncluster1/scripts
isiloncluster1-1# mv /ifs/isiloncluster1/scripts/isilon-hadoop-tools-x.x \
/ifs/isiloncluster1/scripts/isilon-hadoop-tools
12. Execute the script isilon_create_users.sh. This script will create all required users and groups for the Hadoop
services and applications.
Warning: The script isilon_create_users.sh will create local user and group accounts on your Isilon
cluster for Hadoop services. If you are using a directory service such as Active Directory, and you
want these users and groups to be defined in your directory service, then DO NOT run this script.
Instead, refer to the OneFS documentation and EMC Isilon Best Practices for Hadoop Data Storage
(http://www.emc.com/collateral/white-paper/h12877-wp-emc-isilon-hadoop-best-practices.pdf).
Script Usage: isilon_create_users.sh –dist <DIST> [–startgid <GID>] [–startuid <UID>] [–zone <ZONE>]
dist This will correspond to your Hadoop distribution - phd
startgid Group IDs will begin with this value. For example: 501
startuid User IDs will begin with this value. This is generally the same as gid_base. For example: 501
18
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
zone Access Zone name. For example: System
isiloncluster1-1# bash \
/ifs/isiloncluster1/scripts/isilon-hadoop-tools/onefs/isilon_create_users.sh \
--dist phd --startgid 501 --startuid 501 --zone zone1
13. Execute the script isilon_create_directories.sh. This script will create all required directories with the appropriate ownership and permissions.
Script Usage: isilon_create_directories.sh –dist <DIST> [–fixperm] [–zone <ZONE>]
dist This will correspond to your Hadoop distribution - phd
fixperm If specified, ownership and permissions will be set on existing directories.
zone Access Zone name. For example: System
isiloncluster1-1# bash \
/ifs/isiloncluster1/scripts/isilon-hadoop-tools/onefs/isilon_create_directories.sh \
--dist phd --fixperm --zone zone1
14. Map the hdfs user to the Isilon superuser. This will allow the hdfs user to chown (change ownership of) all files.
Warning: The command below will restart the HDFS service on Isilon to ensure that any cached user
mapping rules are flushed. This will temporarily interrupt any HDFS connections coming from other Hadoop
clusters.
isiloncluster1-1# isi zone zones modify --user-mapping-rules=’’hdfs=>root’’ \
--zone zone1
isiloncluster1-1# isi services isi_hdfs_d disable ; \
isi services isi_hdfs_d enable
The service ‘isi_hdfs_d’ has been disabled.
The service ‘isi_hdfs_d’ has been enabled.
1.6.7 Create DNS Records For Isilon
You will now create the required DNS records that will be used to access your Isilon cluster.
1. Create a delegation record so that DNS requests for the zone isiloncluster1.lab.example.com are delegated to the
Service IP that will be defined on your Isilon cluster. The Service IP can be any unused static IP address in your
lab subnet.
(a) If using a Windows Server 2008 R2 server, see the screenshot below to create the delegation.
1.6. Prepare Isilon
19
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
(b) Enter the DNS domain that will be delegated. This should identify your Isilon cluster. Click Next.
(c) Enter the DNS domain again. Then enter the Isilon SmartConnect Zone Service IP address. At this time,
you can ignore any warning messages about not being able to validate the server. Click OK.
20
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
(d) Click Next and then Finish.
1.6. Prepare Isilon
21
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
2. If you are using your own lab DNS server, you should create a CNAME alias for your Isilon SmartConnect zone. For example, create a CNAME record for mycluster1-hdfs.lab.example.com that targets subnet0pool0.isiloncluster1.lab.example.com.
22
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
3. Test name resolution.
[user@workstation ~]$ ping mycluster1-hdfs.lab.example.com
PING subnet0-pool0.isiloncluster1.lab.example.com (10.111.129.115) 56(84)
bytes of data.
64 bytes from 10.111.129.115: icmp_seq=1 ttl=64 time=1.15 ms
1.7 Prepare NFS Clients
To allow for scripts and other small files to be easily shared between all servers in your environment, it is highly
recommended to enable NFS (Unix Sharing) on your Isilon cluster. By default, the entire /ifs directory is already exported and this can remain unchanged. However, Isilon best-practices suggest creating an NFS mount for
/ifs/isiloncluster1/scripts.
1. On Isilon, create an NFS export for /ifs/isiloncluster1/scripts. Enable read/write and root access from all hosts
1.7. Prepare NFS Clients
23
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
in your lab subnet.
2. Mount your NFS export on your workstation and the BDE server. (Note: The BDE post-deployment script can
automatically perform these steps for VMs deployed using BDE.).
[root@workstation ~]$ yum install nfs-utils
[root@workstation ~]$ mkdir /mnt/scripts
[root@workstation ~]$ echo \
subnet0-pool0.isiloncluster1.lab.example.com:/ifs/isiloncluster1/scripts \
/mnt/scripts nfs \
nolock,nfsvers=3,tcp,rw,hard,intr,timeo=600,retrans=2,rsize=131072,wsize=524288 \
>> /etc/fstab
[root@workstation ~]$ mount -a
3. If you want to locate any VMDKs on an Isilon cluster instead of on local disks, add an NFS datastore to each of
your ESXi hosts.
1.8 Prepare VMware Big Data Extensions
1.8.1 Installing Big Data Extensions
Refer to VMware Big Data Extensions documentation (http://pubs.vmware.com/bde-2/index.jsp) for complete details
for installing it. For a typical deployment, only the steps in the following topics need to be performed:
• Installing Big Data Extensions
– Deploy the Big Data Extensions vApp in the vSphere Web Client
– Install the Big Data Extensions Plug-In
– Connect to a Serengeti Management Server
• Managing vSphere Resources for Hadoop and HBase Clusters
– Add a Datastore in the vSphere Web Client
– Add a Network in the vSphere Web Client
1.8.2 How to Login to the Serengeti CLI
The Serengeti CLI is installed on your Big Data Extensions vApp server. You will first change directory to the path
that contains your BDE spec files which define the VMs to create.
[root@bde ~]# cd /mnt/scripts/isilon-hadoop-tools/etc
[root@bde etc]# java -jar /opt/serengeti/cli/serengeti-cli-*.jar
serengeti>connect --host localhost:8443
Enter the username: root
Enter the password: ***
Connected
1.8.3 View Resources
Use the commands below to view your BDE resources. Their names will be needed later.
24
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
serengeti>network list
serengeti>datastore list
serengeti>resourcepool list
1.9 Prepare Hadoop Virtual Machines
First you will create one virtual machine to run Pivotal Command Center. It can be provisioned using any tool you
like. This document will show how to use BDE to deploy a “cluster” consisting of a single server that will run Pivotal
Command Center.
Then you will create your actual Hadoop nodes (e.g. Resource Manager, Node Managers) using BDE.
1.9.1 Provision Virtual Machines Using Big Data Extensions
1. Copy the file isilon-hadoop-tools/etc/template-bde.json to isilon-hadoop-tools/etc/mycluster1-bde.json.
2. Edit the file mycluster1-bde.json as desired.
Table 1.1: Description of mycluster1-bde.json
Field Name
Description
nodeThe number of hosts that will be created in this node group. This should be 1 for all
Groups.instanceNum node groups except the worker group.
nodeSize of additional disks for each host. Note that the OS disk is always 20 GB and the
Groups.storage.sizeGB
swapdisk will depend on the memory.
nodeIf “shared”, virtual disks will be located on an NFS export. If “local”, virtual disks
Groups.storage.type will be located on disks directly attached to the ESXihosts.
nodeNumber of CPU cores assigned to the VM. This can be changed after the cluster has
Groups.cpuNum
been deployed.
nodeAmount of RAM assigned to the VM. This can be changed after the cluster has been
Groups.memCapacityMB
deployed.
nodeResource pool name. For available names, type “resourcepool list” in the Serengeti
Groups.rpNames
CLI.
For more details, refer to the BDE documentation.
3. Login to the Serengeti CLI.
4. If you want to use a dedicated Pivotal Command Center server, create the Pivotal Command Center cluster.
serengeti>cluster create --name hadoopmanager \
--specFile basic-server.json --password 1 --networkName defaultNetwork
Hint: Password are from 8 to 128 characters, and can include
alphanumeric characters ([0-9, a-z, A-Z]) and the following special
characters: _, @, #, $, %, ^, &, *
Enter the password: ***
Confirm the password: ***
COMPLETED 100%
node group: server, instance number: 1
roles:[basic]
NAME
IP
STATUS
TASK
-----------------------------------------------------------hadoopmanager-server-0 10.111.128.240 Service Ready
cluster hadoopmanager created
1.9. Prepare Hadoop Virtual Machines
25
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
Note: If any errors occur during cluster creation, you can try to resume the cluster creation with the
–resume parameter. If that does not work, you may need to delete and recreate the cluster. For example:
serengeti>cluster create --name mycluster1 --resume
serengeti>cluster delete --name mycluster1
5. Create the Hadoop cluster.
serengeti>cluster create --name mycluster1 --specFile mycluster1-bde.json \
--password 1 --networkName defaultNetwork
Hint: Password are from 8 to 128 characters, and can include
alphanumeric characters ([0-9, a-z, A-Z]) and the following special
characters: _, @, #, $, %, ^, &, *
Enter the password: ***
Confirm the password: ***
COMPLETED 100%
node group: namenode, instance number: 1
roles:[basic]
NAME
IP
STATUS
TASK
---------------------------------------------------------mycluster1-namenode-0 10.111.128.241 Service Ready
node group: worker, instance number: 3
roles:[basic]
NAME
IP
STATUS
TASK
-------------------------------------------------------mycluster1-worker-0 10.111.128.244 Service Ready
mycluster1-worker-1 10.111.128.242 Service Ready
mycluster1-worker-2 10.111.128.245 Service Ready
node group: master, instance number: 1
roles:[basic]
NAME
IP
STATUS
TASK
-------------------------------------------------------mycluster1-master-0 10.111.128.243 Service Ready
cluster mycluster1 created
6. After the cluster has been provisioned, you should check the following:
(a) The VMs should be distributed across your hosts evenly.
(b) The data disks (not including the boot and swap disk) should be an equal size and distributed across
datastores evenly.
1.9.2 BDE Post Deployment Script
Once the VMs come up, there are a few customizations that you will want to make. The Python script isilon-hadooptools/bde/bde_cluster_post_deploy.py automates this process.
This script performs the following actions:
• Retrieves the list of VMs in the BDE cluster.
• Writes a file consisting of the FQDNs of each VM. This will be used to script other cluster-wide activities.
• Enables password-less SSH to each VM.
26
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
• Updates /etc/sysconfig/network to set DHCP_HOSTNAME instead of HOSTNAME and restarts the network
service. With properly configured DHCP and DNS, this will result in correct forward and reverse DNS records
with FQDNs such as mycluster1-worker-0.lab.example.com.
• Installs several packages using Yum.
• Mounts NFS directories.
• Mounts each virtual data disk in /data/n.
• Overwrites /etc/rc.local and /etc/sysctl.conf with recommended parameters.
To run the script, follow these steps:
1. Login to your workstation (shown as user@workstation in the prompts below).
2. Ensure that you are running a Python version 2.6.6 or higher but less than 3.0.
[user@workstation ~]$ python --version
Python 2.6.6
3. If you do not have sshpass installed, you may install it on Centos 6.x using the following commands:
[root@workstation ~]$ wget \
http://dl.fedoraproject.org/pub/epel/6/x86_64/sshpass-1.05-1.el6.x86_64.rpm
[root@workstation ~]$ rpm -i sshpass-1.05-1.el6.x86_64.rpm
4. If you have not previously created your SSH key, run the following.
[user@workstation ~]$ ssh-keygen -t rsa
5. Copy the file isilon-hadoop-tools/etc/template-phd-post.json to isilon-hadoop-tools/etc/mycluster1-post.json.
6. Edit the file mycluster1-post.json with parameters that apply to your environment.
Table 1.2: Description of mycluster1-post.json
Field Name
ser_host
ser_username
Description
The URL to the BDE web service. For example: https://bde.lab.example.com:8443
The user name used to authenticate to the BDE web service. This is the same account used
to login using the Serengeti CLI.
ser_password The password for the above account.
skip_configure_network
If “false” (the default), the script will set the DHCP_HOSTNAME parameter on the host.
If “false”, you must also set the dhcp_domain setting in this file. Set to “true” if you are
using static IP addresses or DHCP without DNS integration.
dhcp_domain This is the DNS suffix that is appended to the host name to create a FQDN. It should begin
with a dot. Ignored if skip_configure_network is true. For example: .lab.example.com
cluster_name This is the name of the BDE cluster.
host_file_name This file will be created and it will contain the FQDN of each host in the BDE cluster.
node_password This is the root password of the hosts created by BDE. This was specified when the cluster
was created.
name_filter_regex
If non-empty, specify the name of a single host in your BDE cluster to apply this script to
just a single host.
tools_root
This is the fully-qualified path to the Isilon Hadoop Tools. It must be in an NFS mount.
nfs_mounts
This is a list of one or more NFS mounts that will be imported on each host in the BDE
cluster.
nfs_mounts.mount_point
NFS mount point. For example: /mnt/isiloncluster1
nfs_mounts.pathNFS path. For example: host.domain.com:/directory
ssh_commands This is a list of commands that will be executed on each host. This can be used to run
scripts that will create users, adjust mount points, etc..
1.9. Prepare Hadoop Virtual Machines
27
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
7. Edit the file isilon-hadoop-tools/bde/create_phd_users.sh with the appropriate gid_base and uid_base values.
This should match the values entered in isilon-hadoop-tools/onefs/isilon_create_phd_users.sh in a previous step.
8. Run bde_cluster_post_deploy.py:
[user@workstation ~]$ cd /mnt/scripts/isilon-hadoop-tools
[user@workstation isilon-hadoop-tools]$ bde/bde_cluster_post_deploy.py \
etc/mycluster1-post.json
...
Success!
9. Repeat the above steps for your Pivotal Command Center cluster named hadoopmanager.
1.9.3 Resize Root Disk
By default, the / (root) partition size for a VM created by BDE is 20 GB. This is sufficient for a Hadoop worker
but should be increased for the your Pivotal Command Center (hadoopmanager-server-0) and the master node
(mycluster1-master-0). Follow the steps below on each of these nodes.
1. Remove old data disk.
(a) Edit /etc/fstab.
[root@mycluster1-master-0 ~]# vi /etc/fstab
(b) Remove line containing “/dev/sdc1”, save the file, and then unmount it.
[root@mycluster1-master-0 ~]# umount /dev/sdc1
[root@mycluster1-master-0 ~]# rmdir /data/1
(c) Shutdown the VM.
(d) Use the vSphere Web Client to remove virtual disk 3.
2. Use the vSphere Web Client to increase the size of virtual disk 1 to 250 GB.
3. Power on the VM and SSH into it.
4. Extend the partition.
Warning: Perform the steps below very carefully. Failure performing these steps may result in an unusable
system or lost data.
[root@mycluster1-master-0 ~]# fdisk /dev/sda
WARNING: DOS-compatible mode is deprecated. It’s strongly recommended to
switch off the mode (command ‘c’) and change display units to
sectors (command ‘u’).
Command (m for help): d
Partition number (1-4): 3
Command (m for help): n
Command action
e
extended
p
primary partition (1-4)
p
Partition number (1-4): 3
First cylinder (33-32635, default 33):
Using default value 33
Last cylinder, +cylinders or +size{K,M,G} (33-32635, default 32635):
Using default value 32635
Command (m for help): w
28
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
The partition table has been altered!
Calling ioctl() to re-read partition table.
WARNING: Re-reading the partition table failed with error 16: Device or
resource busy.
The kernel still uses the old table. The new table will be used at
the next reboot or after you run partprobe(8) or kpartx(8)
Syncing disks.
[root@mycluster1-master-0 ~]# reboot
5. After the server reboots, resize the file system.
[root@mycluster1-master-0 ~]# resize2fs /dev/sda3
1.9.4 Fill Disk
VMware Big Data Extensions creates VMDKs for each of the Hadoop server. In an Isilon environment, these VMDKs
are not used for HDFS, of course, but large jobs that spill temporary intermediate data to local disks will utilize the
VMDKs. BDE creates VMDKs that are lazy-zeroed. This means that the VMDKs are created very quickly but the
drawback is that the first write to each sector of the virtual disk is significantly slower than subsequent writes to the
same sector. This means that optimal VMDK performance may not be achieved until after several days of normal
usage. To accelerate this, you can run the script fill_disk.py. This script will create a temporary file on each drive on
each Hadoop server. The file will grow until the disk runs out of space, then the file will be deleted.
To use the script, provide it with file containing a list of Hadoop server FQDNs.
[user@workstation etc]$ /mnt/scripts/isilon-hadoop-tools/bde/fill_disk.py \
mycluster1-hosts.txt
1.9.5 Resizing Your Cluster
VMware Big Data Extensions can be used to resize an existing cluster. It will allow you to create additional VMs,
change the amount RAM for each VM, and change the number of CPUs for each VM. This can be done through the
vSphere Web Client or from the Serengeti CLI using the “cluster resize” command. For instance:
serengeti>cluster resize --name mycluster1 --nodeGroup worker \
--instanceNum 20
serengeti>cluster resize --name mycluster1 --nodeGroup worker \
--cpuNumPerNode 16 --memCapacityMbPerNode 131072
After creating new VMs, you will want to run the BDE Post Deployment script and the Fill Disk script on the new
nodes. Then use Pivotal Command Center to deploy the appropriate Hadoop components.
When changing the CPUs and RAM, you will usually want to change the amount allocated for YARN or your other
services using your Hadoop cluster manager.
1.10 Install Pivotal Command Center
Note: The steps below will walk you through one way of installing Pivotal Command Center and Pivotal HD. This
is intended to guide you through your first POC. In particular, Kerberos security is not enabled. For complete details,
refer to the Pivotal HD documentation (http://pivotalhd.docs.pivotal.io/doc/2100/index.html).
1. Download the following files from https://network.pivotal.io/products/pivotal-hd. This document assumes they
have been downloaded to your Isilon cluster in /mnt/scripts/downloads
1.10. Install Pivotal Command Center
29
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
• PHD 2.1.0
• PHD 2.1.0: Pivotal Command Center 2.3.0
• PHD 2.1.0: Pivotal HAWQ 1.2.1.0 (optional)
2. Extract Pivotal Command Center and install it.
[root@hadoopmanager-server-0 ~]# mkdir phd
[root@hadoopmanager-server-0 ~]# cd phd
[root@hadoopmanager-server-0 phd]# tar --no-same-owner -zxvf \
/mnt/scripts/downloads/PCC-2.3.0-443.x86_64.tgz
[root@hadoopmanager-server-0 phd]# cd PCC-2.3.0-443
[root@hadoopmanager-server-0 PCC-2.3.0-443]# ./install
You may find the logs in /usr/local/pivotal-cc/pcc_installation.log
Performing pre-install checks ...
[ OK ]
[INFO]: User gpadmin does not exist. It will be created as part of the
installation process.
[INFO]: By default, the directory ‘/home/gpadmin’ is created as the default
home directory for the gpadmin user. This directory must must be consistent
across all nodes.
Do you wish to use the default home directory? [Yy|Nn]: y
[INFO]: Continuing with default /home/gpadmin as home directory for gpadmin
...
You have successfully installed Pivotal Command Center 2.3.0.
You now need to install a PHD cluster to monitor with Pivotal Command
Center.
You can view your cluster statuses here:
https://hadoopmanager-server-0.lab.example.com:5443/status
[root@hadoopmanager-server-0 PCC-2.3.0-443]# service commander status
nodeagent is running
Jetty is running
httpd is running
Pivotal Command Center HTTPS is running
Pivotal Command Center Background Worker is running
3. Import packages into Pivotal Command Center.
[root@hadoopmanager-server-0 ~]# su - gpadmin
[gpadmin@hadoopmanager-server-0 ~] -bash-4.1$ tar zxf \
/mnt/scripts/downloads/PHD-2.1.0.0-175.tgz
[gpadmin@hadoopmanager-server-0 ~] -bash-4.1$ tar zxf \
/mnt/scripts/downloads/PADS-1.2.1.0-10335.tgz
[gpadmin@hadoopmanager-server-0 ~] -bash-4.1$ icm_client import -s \
PHD-2.1.0.0-175/
stack: PHD-2.1.0.0
[INFO] Importing stack
[INFO] Finding and copying available stack rpms from PHD-2.1.0.0-175/... [OK]
[INFO] Creating local stack repository...
[OK]
[INFO] Import complete
[gpadmin@hadoopmanager-server-0 ~] -bash-4.1$ icm_client import -s \
PADS-1.2.1.0-10335/
stack: PADS-1.2.1.0
[INFO] Importing stack
[INFO] Finding and copying available stack rpms from PADS-1.2.1.0-10335/...
[OK]
30
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
[INFO] Creating local stack repository...
[OK]
[INFO] Import complete
1.11 Deploy a Pivotal HD Cluster
1. Make a copy of the cluster configuration template directory.
[gpadmin@hadoopmanager-server-0 ~] -bash-4.1$ cd \
/mnt/scripts/isilon-hadoop-tools/etc
[gpadmin@hadoopmanager-server-0 etc] -bash-4.1$ cp -rv template-pcc \
mycluster1-pcc
2. Edit mycluster1-pcc/clusterConfig.xml.
(a) Replace all occurances of mycluster1 with your actual BDE cluster name. This will identify your host
names.
(b) Replace the value of the url element within isi-hdfs with your Isilon HDFS URL. For example.
hdfs://mycluster1-hdfs.lab.example.com:8020.
(c) Update yarn-nodemanager, hawq-segment, and pxf-service elements with the complete list of worker host
names. You may copy and paste from the file isilon-hadoop-tools/etc/mycluster1-hosts.txt. Do not include
your master node in these lists.
(d) Update datanode.disk.mount.points so that it refers to the local disks that you will use for temporary
files. For example, if you deployed your VMs to ESXi hosts with 4 local disks, the value will be
/data/1,/data/2,/data/3,/data/4. You should confirm that these mounts points exists on
your workers using the df command.
(e) Update yarn.nodemanager.resource.memory-mb with the maximum number of MB that you will allow
YARN applications to use on your worker nodes. You must ensure that the machine has enough RAM for
the OS, Hadoop services, YARN applications, and HAWQ. Usually 12 GB less than the machine RAM is
recommended. You may need to reduce this further if you are using multiple HAWQ segments.
(f) If you want to run more than one HAWQ segment per worker node, update hawq.segment.directory. For
example, to run two HAWQ segments, the value can be (/data/1/primary /data/2/primary).
For complete details, refer to http://pivotalhd.docs.pivotal.io/doc/2100/EditingtheClusterConfigurationFiles.html.
3. Deploy your cluster.
[gpadmin@hadoopmanager-server-0 etc] -bash-4.1$ icm_client deploy \
--confdir mycluster1-pcc --selinuxoff --iptablesoff
Please enter the root password for the cluster nodes: ***
PCC creates a gpadmin user on the newly added cluster nodes (if any).
Please enter a non-empty password to be used for the gpadmin user: ***
Verifying input
Starting install
[==================================================================] 100%
Results:
mycluster1-hdfs.lab.example.com... [Success]
mycluster1-worker-1.lab.example.com... [Success]
mycluster1-master-0.lab.example.com... [Success]
mycluster1-worker-2.lab.example.com... [Success]
mycluster1-worker-0.lab.example.com... [Success]
Details at /var/log/gphd/gphdmgr/gphdmgr-webservices.log
Cluster ID: 5
1.11. Deploy a Pivotal HD Cluster
31
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
4. Start your cluster.
[gpadmin@hadoopmanager-server-0 etc] -bash-4.1$ icm_client start \
--clustername mycluster1
Starting services
Starting cluster
[==================================================================] 100%
Results:
Details at /var/log/gphd/gphdmgr/gphdmgr-webservices.log
1.12 Initialize HAWQ
1. Initialize HAWQ.
[gpadmin@mycluster1-master-0 ~]$ source /usr/local/hawq/greenplum_path.sh
[gpadmin@mycluster1-master-0 ~]$ gpssh-exkeys -f \
/mnt/scripts/isilon-hadoop-tools/etc/mycluster1-hosts.txt
[STEP 1 of 5] create local ID and authorize on local host
[STEP 2 of 5] keyscan all hosts and update known_hosts file
[STEP 3 of 5] authorize current user on remote hosts
... send to mycluster1-worker-0.lab.example.com
*
* Enter password for mycluster1-worker-0.lab.example.com: ***
... send to mycluster1-worker-1.lab.example.com
... send to mycluster1-worker-2.lab.example.com
[STEP 4 of 5] determine common authentication file content
[STEP
...
...
...
5 of 5] copy
finished key
finished key
finished key
authentication files to all remote hosts
exchange with mycluster1-worker-0.lab.example.com
exchange with mycluster1-worker-1.lab.example.com
exchange with mycluster1-worker-2.lab.example.com
[INFO] completed successfully
[gpadmin@mycluster1-master-0 ~]$ /etc/init.d/hawq init
20141009:06:05:35:009986 gpinitsystem:mycluster1-master-0:gpadmin-[INFO]:
-Checking configuration parameters, please wait...
...
20141009:06:06:55:009986 gpinitsystem:mycluster1-master-0:gpadmin-[INFO]:
-HAWQ instance successfully created
...
1.13 Adding a Hadoop User
You must add a user account for each Linux user that will submit MapReduce jobs. The procedure below can be used
to add a user named hduser1.
32
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
Warning: The steps below will create local user and group accounts on your Isilon cluster. If you are using
a directory service such as Active Directory, and you want these users and groups to be defined in your directory service, then DO NOT run these steps. Instead, refer to the OneFS documentation and EMC Isilon Best
Practices for Hadoop Data Storage (http://www.emc.com/collateral/white-paper/h12877-wp-emc-isilon-hadoopbest-practices.pdf).
1. Add user to Isilon.
isiloncluster1-1# isi auth groups create hduser1 --zone zone1 \
--provider local
isiloncluster1-1# isi auth users create hduser1 --primary-group hduser1 \
--zone zone1 --provider local \
--home-directory /ifs/isiloncluster1/zone1/hadoop/user/hduser1
2. Add user to Hadoop nodes. Usually, this only needs to be performed on the master-0 node.
[root@mycluster1-master-0 ~]# adduser hduser1
3. Create the user’s home directory on HDFS.
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -mkdir -p /user/hduser1
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -chown hduser1:hduser1 \
/user/hduser1
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -chmod 755 /user/hduser1
1.14 Functional Tests
The tests below should be performed to ensure a proper installation. Perform the tests in the order shown.
You must create the Hadoop user hduser1 before proceeding.
1.14.1 HDFS
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -ls /
Found 5 items
-rw-r--r-1 root hadoop
0 2014-08-05 05:59 /THIS_IS_ISILON
drwxr-xr-x
- hbase hbase
148 2014-08-05 06:06 /hbase
drwxrwxr-x
- solr solr
0 2014-08-05 06:07 /solr
drwxrwxrwt
- hdfs supergroup
107 2014-08-05 06:07 /tmp
drwxr-xr-x
- hdfs supergroup
184 2014-08-05 06:07 /user
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -put -f /etc/hosts /tmp
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -cat /tmp/hosts
127.0.0.1 localhost
[root@mycluster1-master-0 ~]# sudo -u hdfs hdfs dfs -rm -skipTrash /tmp/hosts
[root@mycluster1-master-0 ~]# su - hduser1
[hduser1@mycluster1-master-0 ~]$ hdfs dfs -ls /
Found 5 items
-rw-r--r-1 root hadoop
0 2014-08-05 05:59 /THIS_IS_ISILON
drwxr-xr-x
- hbase hbase
148 2014-08-05 06:28 /hbase
drwxrwxr-x
- solr solr
0 2014-08-05 06:07 /solr
drwxrwxrwt
- hdfs supergroup
107 2014-08-05 06:07 /tmp
drwxr-xr-x
- hdfs supergroup
209 2014-08-05 06:39 /user
[hduser1@mycluster1-master-0 ~]$ hdfs dfs -ls
...
1.14. Functional Tests
33
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
1.14.2 YARN / MapReduce
[hduser1@mycluster1-master-0 ~]$ hadoop jar \
/usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
pi 10 1000
...
Estimated value of Pi is 3.14000000000000000000
[hduser1@mycluster1-master-0 ~]$ hadoop fs -mkdir in
You can put any file into the in directory. It will be used the datasource for subsequent tests.
[hduser1@mycluster1-master-0 ~]$ hadoop fs -put -f /etc/hosts in
[hduser1@mycluster1-master-0 ~]$ hadoop fs -ls in
...
[hduser1@mycluster1-master-0 ~]$ hadoop fs -rm -r out
[hduser1@mycluster1-master-0 ~]$ hadoop jar \
/usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
wordcount in out
...
[hduser1@mycluster1-master-0 ~]$ hadoop fs -ls out
Found 4 items
-rw-r--r-1 hduser1 hduser1
0 2014-08-05 06:44 out/_SUCCESS
-rw-r--r-1 hduser1 hduser1
24 2014-08-05 06:44 out/part-r-00000
-rw-r--r-1 hduser1 hduser1
0 2014-08-05 06:44 out/part-r-00001
-rw-r--r-1 hduser1 hduser1
0 2014-08-05 06:44 out/part-r-00002
[hduser1@mycluster1-master-0 ~]$ hadoop fs -cat out/part*
localhost
1
127.0.0.1
1
Browse to the YARN Resource Manager GUI http://mycluster1-master-0.lab.example.com:8088/.
Browse to the MapReduce History Server GUI http://mycluster1-master-0.lab.example.com:19888/.
In particular, confirm that you can view the complete logs for task attempts.
1.14.3 Hive
[hduser1@mycluster1-master-0 ~]$ hadoop fs -mkdir -p sample_data/tab1
[hduser1@mycluster1-master-0 ~]$ cat - > tab1.csv
1,true,123.123,2012-10-24 08:55:00
2,false,1243.5,2012-10-25 13:40:00
3,false,24453.325,2008-08-22 09:33:21.123
4,false,243423.325,2007-05-12 22:32:21.33454
5,true,243.325,1953-04-22 09:11:33
Type <Control+D>.
[hduser1@mycluster1-master-0 ~]$ hadoop fs -put -f tab1.csv sample_data/tab1
[hduser1@mycluster1-master-0 ~]$ hive
hive>
DROP TABLE IF EXISTS tab1;
CREATE EXTERNAL TABLE tab1
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE,
col_3 TIMESTAMP
34
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
LOCATION ‘/user/hduser1/sample_data/tab1’;
DROP TABLE IF EXISTS tab2;
CREATE TABLE tab2
(
id INT,
col_1 BOOLEAN,
col_2 DOUBLE,
month INT,
day INT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’;
INSERT OVERWRITE TABLE tab2
SELECT id, col_1, col_2, MONTH(col_3), DAYOFMONTH(col_3)
FROM tab1 WHERE YEAR(col_3) = 2012;
...
OK
Time taken: 28.256 seconds
hive> show tables;
OK
tab1
tab2
Time taken: 0.889 seconds, Fetched: 2 row(s)
hive> select * from tab1;
OK
1
true
123.123
2012-10-24 08:55:00
2
false 1243.5
2012-10-25 13:40:00
3
false 24453.325
2008-08-22 09:33:21.123
4
false 243423.325
2007-05-12 22:32:21.33454
5
true
243.325
1953-04-22 09:11:33
Time taken: 1.083 seconds, Fetched: 5 row(s)
hive> select * from tab2;
OK
1
true
123.123
10
24
2
false 1243.5
10
25
Time taken: 0.094 seconds, Fetched: 2 row(s)
hive> select * from tab1 where id=1;
OK
1
true
123.123
2012-10-24 08:55:00
Time taken: 15.083 seconds, Fetched: 1 row(s)
hive> select * from tab2 where id=1;
OK
1
true
123.123
10
24
Time taken: 13.094 seconds, Fetched: 1 row(s)
1.14. Functional Tests
35
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
hive> exit;
1.14.4 Pig
[hduser1@mycluster1-master-0 ~]$ pig
grunt> a = load ‘in’;
grunt> dump a;
...
Success!
...
grunt> quit;
1.14.5 HBase
[hduser1@mycluster1-master-0 ~]$ hbase shell
hbase(main):001:0> create ‘test’, ‘cf’
0 row(s) in 3.3680 seconds
=> Hbase::Table - test
hbase(main):002:0> list ‘test’
TABLE
test
1 row(s) in 0.0210 seconds
=> [’’test’’]
hbase(main):003:0> put ‘test’, ‘row1’, ‘cf:a’, ‘value1’
0 row(s) in 0.1320 seconds
hbase(main):004:0> put ‘test’, ‘row2’, ‘cf:b’, ‘value2’
0 row(s) in 0.0120 seconds
hbase(main):005:0> scan ‘test’
ROW
COLUMN+CELL
row1
column=cf:a,timestamp=1407542488028,value=value1
row2
column=cf:b,timestamp=1407542499562,value=value2
2 row(s) in 0.0510 seconds
hbase(main):006:0> get ‘test’, ‘row1’
COLUMN
CELL
cf:a
timestamp=1407542488028,value=value1
1 row(s) in 0.0240 seconds
hbase(main):007:0> quit
1.14.6 HAWQ
[gpadmin@mycluster1-master-0 ~]$ psql
psql (8.2.15)
Type ‘‘help’’ for help.
gpadmin=# \l
List of databases
Name
| Owner | Encoding | Access privileges
-----------+---------+----------+------------------gpadmin
| gpadmin | UTF8
|
postgres | gpadmin | UTF8
|
template0 | gpadmin | UTF8
|
36
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
template1 | gpadmin | UTF8
(4 rows)
|
gpadmin=# \c gpadmin
psql (8.4.20, server 8.2.15)
WARNING: psql version 8.4, server version 8.2.
Some psql features might not work.
You are now connected to database ‘‘gpadmin’’.
gpadmin=# create table test (a int, b text);
NOTICE: Table doesn’t have ‘DISTRIBUTED BY’ clause -- Using column named ‘a’
as the Greenplum Database data distribution key for this table.
HINT: The ‘DISTRIBUTED BY’ clause determines the distribution of data. Make
sure column(s) chosen are the optimal data distribution key to minimize skew.
CREATE TABLE
gpadmin=# insert into test values (1, ‘435252345’);
INSERT 0 1
gpadmin=# select * from test;
a |
b
---+----------1 | 435252345
(1 row)
gpadmin=# \q
1.14.7 Pivotal Command Center
Browse to the Pivotal Command Center GUI https://hadoopmanager-server-0.lab.example.com:5443.
Login using the following default account:
Username: gpadmin
Password: Gpadmin1
1.14.8 Additional Functional Tests
If
you
will
be
using
other
Hadoop
components
and
applications,
refer
http://pivotalhd.docs.pivotal.io/doc/2100/RunningPHDSamplePrograms.html for additional functional tests.
to
1.15 Searching Wikipedia
One of the many unique features of Isilon is its multi-protocol support. This allows you, for instance, to write a file
using SMB (Windows) or NFS (Linux/Unix) and then read it using HDFS to perform Hadoop analytics on it.
In this section, we exercise this capability to download the entire Wikipedia database (excluding media) using your
favorite browser to Isilon. As soon as the download completes, we’ll run a Hadoop grep to search the entire text of
Wikipedia using our Hadoop cluster. As this search doesn’t rely on a word index, your regular expression can be as
complicated as you like.
1. First, let’s connect your client (with your favorite web browser) to your Isilon cluster.
(a) If you are using a Windows host or other SMB client:
i. Click Start -> Run.
1.15. Searching Wikipedia
37
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
ii. Enter: \\subnet0-pool0.isiloncluster1.lab.example.com\ifs
iii. You may authenticate as root with your Isilon root password.
iv. Browse to \ifs\isiloncluster1\zone1\hadoop\tmp.
v. Create a directory here called wikidata. This is where you will download the Wikipedia data to.
(b) If you are using a Linux host or other NFS client:
i. Mount your NFS export.
[root@workstation ~]$ mkdir /mnt/isiloncluster1
[root@workstation ~]$ echo \
subnet0-pool0.isiloncluster1.lab.example.com:/ifs \
/mnt/isiloncluster1 nfs \
nolock,nfsvers=3,tcp,rw,hard,intr,timeo=600,retrans=2,rsize=131072,wsize=524288
>> /etc/fstab
[root@workstation ~]$ mount -a
[root@workstation ~]$ mkdir -p \
/mnt/isiloncluser1/isiloncluster1/zone1/hadoop/tmp/wikidata
2. Open your favorite web browser and go to http://dumps.wikimedia.org/enwiki/latest.
3. Locate the file enwiki-latest-pages-articles.xml.bz2 and download it directly to the wikidata folder on Isilon.
Your web browser will be writing this file to the Isilon file system using SMB or NFS.
Note: This file is approximately 10 GB in size and contains the entire text of the English version of Wikipedia.
If this is too large, you may want to download one of the smaller files such as enwiki-latest-all-titles.gz.
38
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
Note: If you get an access-denied error, the quickest resolution is to SSH into any node in your Isilon cluster
as root and run chmod -R a+rwX /ifs/isiloncluster1/zone1/hadoop/tmp.
4. Now let’s run the Hadoop grep job. We’ll search for all two-word phrases that begin with EMC.
[hduser1@mycluster1-master-0 ~]$ hadoop fs -ls /tmp/wikidata
[hduser1@mycluster1-master-0 ~]$ hadoop fs -rm -r /tmp/wikigrep
[hduser1@mycluster1-master-0 ~]$ hadoop jar \
/usr/lib/gphd/hadoop-mapreduce/hadoop-mapreduce-examples.jar \
grep /tmp/wikidata /tmp/wikigrep ‘‘EMC [^ ]*’’
5. When the the job completes, use your favorite text file viewer to view the output file /tmp/wikigrep/part-r-00000.
If you are using Windows, you may open the file in Wordpad.
1.15. Searching Wikipedia
39
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
6. That’s it! Can you think of anything else to search for?
1.16 Where To Go From Here
You are now ready to fine-tune the configuration and performance of your Isilon Hadoop environment. You should
consider the following areas for fine-tuning.
• YARN NodeManager container memory
1.17 Known Limitations
Although Kerberos is supported by OneFS, this document does not address a Hadoop environment secured with
Kerberos. Instead, refer to EMC Isilon Best Practices for Hadoop Data Storage (http://www.emc.com/collateral/whitepaper/h12877-wp-emc-isilon-hadoop-best-practices.pdf).
1.18 Legal
To learn more about how EMC products, services, and solutions can help solve your business and IT
challenges, contact (http://www.emc.com/contact-us/contact-us.esp) your local representative or authorized reseller, visit www.emc.com (http://www.emc.com), or explore and compare products in the EMC Store
(https://store.emc.com/?EMCSTORE_CPP)
Copyright © 2014 EMC Corporation. All Rights Reserved.
EMC believes the information in this publication is accurate as of its publication date. The information is subject to
change without notice.
The information in this publication is provided “as is.” EMC Corporation makes no representations or warranties of
any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Use, copying, and distribution of any EMC software described in this publication requires an applicable software
license.
For the most up-to-date listing of EMC product names, see EMC Corporation Trademarks on EMC.com.
VMware and vSphere are registered trademarks or trademarks of VMware, Inc. in the United States and/or other
jurisdictions. All other trademarks used herein are the property of their respective owners.
40
Chapter 1. EMC Isilon Hadoop Starter Kit for Pivotal HD with VMware Big Data Extensions
EMC Isilon Hadoop Starter Kit for Pivotal HD, Release 3.0
1.19 References
EMC Community - Isilon Hadoop Starter Kit Home Page (https://community.emc.com/docs/DOC-26892)
EMC Community - Isilon Hadoop Info Hub (https://community.emc.com/docs/DOC-39529)
https://community.emc.com/community/connect/everything_big_data
http://bigdatablog.emc.com/
http://www.emc.com/collateral/white-paper/h12877-wp-emc-isilon-hadoop-best-practices.pdf
http://www.emc.com/collateral/white-paper/h13354-wp-security-compliance-scale-out-hadoop-data-lakes.pdf
https://github.com/claudiofahey/isilon-hadoop-tools
http://pubs.vmware.com/bde-2/index.jsp
http://hadoop.apache.org/
http://www.transparencymarketresearch.com
http://www.pivotal.io/
1.19. References
41