Platfora Deployment Planning Guide

Platfora Deployment Planning
Guide
Version 4.5
Copyright Platfora 2015
Last Updated: 10:14 p.m. June 28, 2015
Contents
Document Conventions............................................................................................. 3
Contact Platfora Support...........................................................................................4
Copyright Notices...................................................................................................... 4
Chapter 1: About Platfora Deployments.................................................................... 6
Platfora Deployment Architectures............................................................................6
On-Premise Hadoop Deployments...................................................................... 6
Amazon AWS Cloud Deployments......................................................................8
Platfora Server Architecture...................................................................................... 8
FAQs - Platfora Deployments................................................................................. 10
Chapter 2: Supported Hadoop and Hive Versions..................................................14
Chapter 3: System Requirements (On-Premise)..................................................... 15
Platfora Server Requirements.................................................................................15
Hadoop Resource Requirements............................................................................16
Chapter 4: System Requirements (AWS Cloud)......................................................18
Platfora EC2 Instance Requirements......................................................................18
Amazon EMR Instance Requirements....................................................................19
AWS Security Settings for Platfora.........................................................................20
Amazon AWS Virtual Private Cloud (VPC)....................................................... 20
IAM User and IAM Roles for Platfora................................................................21
EC2 Security Group Settings............................................................................ 26
Chapter 5: Port Configuration Requirements..........................................................28
Ports to Open on Platfora Nodes........................................................................... 28
Ports to Open on Hadoop Nodes........................................................................... 29
Chapter 6: Browser Requirements........................................................................... 31
Appendix A: Hardware Specifications for Platfora Nodes..................................... 32
Appendix B: EC2 Considerations for Platfora Instances....................................... 33
Preface
This guide provides information about what you need to consider when deploying a new Platfora®
cluster. This guide is intended for system and Hadoop administrators who are responsible for procuring
and managing server resources. Knowledge of Linux system administration, network administration and
Hadoop administration is recommended.
Document Conventions
This documentation uses certain text conventions for language syntax and code examples.
Convention
Usage
Example
$
Command-line prompt proceeds a command to be
entered in a command-line
terminal session.
$ ls
$ sudo
Command-line prompt
$ sudo yum install open-jdk-1.7
for a command that
requires root permissions
(commands will be prefixed
with sudo).
UPPERCASE
Function names and
keywords are shown in all
uppercase for readability,
but keywords are caseinsensitive (can be written
in upper or lower case).
SUM(page_views)
italics
Italics indicate a usersupplied argument or
variable.
SUM(field_name)
[ ] (square
Square brackets denote
optional syntax items.
CONCAT(string_expression[,...])
...
(elipsis)
An elipsis denotes a syntax
item that can be repeated
any number of times.
CONCAT(string_expression[,...])
brackets)
Page 3
Platfora Deployment Planning Guide - Introduction
Contact Platfora Support
For technical support, you can send an email to:
[email protected]
Or visit the Platfora support site for the most up-to-date product news, knowledge base articles, and
product tips.
http://support.platfora.com
To access the support portal, you must have a valid support agreement with Platfora. Please contact
your Platfora sales representative for details about obtaining a valid support agreement or with questions
about your account.
Copyright Notices
Copyright © 2012-15 Platfora Corporation. All rights reserved.
Platfora believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” PLATFORA
CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH
RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE.
Use, copying, and distribution of any Platfora software described in this publication requires an
applicable software license. Platfora®, You Should Know™, Interest Driven Pipeline™, Fractal Cache™,
and Adaptive Job Synthesis™ are trademarks of the Platfora Corporation. Apache Hadoop™ and Apache
Hive™ are trademarks of the Apache Software Foundation. All other trademarks used herein are the
property of their respective owners.
Embedded Software Copyrights and License Agreements
Platfora contains the following open source and third-party proprietary software subject to their
respective copyrights and license agreements:
• Apache Hive PDK
• dom4j
• freemarker
• GeoNames
• Google Maps API
• javassist
Page 4
Platfora Deployment Planning Guide - Introduction
• javax.servlet
• Mortbay Jetty 6.1.26
• OWASP CSRFGuard 3
• PostgreSQL JDBC 9.1-901
• Scala
• sjsxp : 1.0.1
• Unboundid
Page 5
Chapter
1
About Platfora Deployments
Platfora runs on dedicated servers in the same network as your Hadoop deployment, which can be in an onpremise data center or in the cloud. Platfora uses the data processing services of Hadoop to process and prepare
data for analysis. Platfora uses the data storage services of Hadoop to access the raw data and to store the output
of the optimized data it prepares. This section explains how Platfora is deployed and the basics of the Platfora/
Hadoop server architecture.
Topics:
•
Platfora Deployment Architectures
•
Platfora Server Architecture
•
FAQs - Platfora Deployments
Platfora Deployment Architectures
The Platfora software runs on a scale-out cluster of servers. These servers can be physical servers in an
on-premise data center or virtual server instances in the cloud. Platfora uses native Hadoop protocols
to connect to the distributed file system and data processing services of Hadoop. Platfora should be
deployed on dedicated machines with low-latency connections to these Hadoop cluster services. This
section explains how Platfora is deployed in your network environment, using either an on-premise or
AWS cloud deployment of Hadoop.
On-Premise Hadoop Deployments
An on-premise Hadoop deployment means that you already have an existing Hadoop installation in your
data center (either a physical data center or a virtual private cloud).
Page 6
Platfora Deployment Planning Guide - About Platfora Deployments
Platfora connects to the Hadoop cluster managed by your organization, and the majority of your
organization's data is stored in the distributed file system of this primary Hadoop cluster.
For on-premise Hadoop deployments, the Platfora servers should be on their own dedicated hardware
co-located in the same data center as your Hadoop cluster. A data center can be a physical location with
actual hardware resources, or a virtual private cloud environment with virtual server instances (such as
Rackspace or Amazon EC2). Platfora recommends putting the Platfora servers on a network with at least
1 Gbps connectivity to the Hadoop nodes.
Platfora users access the Platfora master node using an HTML5-compliant web browser. The Platfora
master node accesses the HDFS NameNode and the MapReduce JobTracker or YARN Resource
Manager using native Hadoop protocols. The Platfora worker nodes access the HDFS DataNodes
directly. If using a firewall, Platfora recommends placing the Platfora servers on the same side of the
firewall as your Hadoop cluster.
Platfora software can run on a wide variety of server configurations – on as little as one server or scale
across multiple servers. Since Platfora runs best with all of the active lenses readily available in RAM,
Platfora recommends obtaining servers optimized for higher RAM capacity and a minimum of 8 CPUs.
Page 7
Platfora Deployment Planning Guide - About Platfora Deployments
Amazon AWS Cloud Deployments
An Amazon Web Services (AWS) cloud deployment means that you do not have a persistent Hadoop
cluster. Instead, your organization uses Amazon S3 for raw data storage and Amazon EMR for ondemand Hadoop data processing.
In an Amazon AWS cloud deployment, the Platfora server instances are deployed on dedicated, highmemory EC2 instances. Your organization’s raw data is managed in Amazon's Simple Storage Service
(S3). Platfora uses Amazon Elastic MapReduce (EMR) to run its data processing jobs (lens builds). The
results of the lens build jobs are then written back to S3.
Platfora Server Architecture
Platfora connects to an existing Hadoop implementation, and makes the raw data residing in Hadoop
accessible to users. The Platfora server has a number of services that work together with Hadoop's
Page 8
Platfora Deployment Planning Guide - About Platfora Deployments
services to access the raw data, prepare it for analysis, and present the results to users. This topic helps
you understand the main components of the Platfora server architecture.
The Platfora Master Node
You can have a fully-functioning Platfora installation with just one node—the master node. The master
node manages the following Platfora services:
• Metadata Catalog - Platfora's metadata catalog holds all of the information about the data managed
by Platfora (the datasets, lenses, vizboards and so on). The metadata catalog is a relational database
that runs on the Platfora master node, but is accessed by all nodes in the Platfora cluster.
• Lens Builder - The lens builder interfaces with the data processing services of Hadoop. It translates
data requests from the Platfora application into a series of custom MapReduce jobs, which it then
submits to the Hadoop Job Tracker or Resource Manager for execution. After the requested data has
been extracted and transformed in Hadoop, the job results are written back to the Hadoop file system
in Platfora's proprietary file format called a lens.
• On-Disk Storage - Finished lenses are immediately copied from the Hadoop file system to on-disk
storage of the Platfora nodes. The data of a lens is distributed across all of the available worker nodes
in a Platfora cluster.
• In-Memory Query Engine - When users explore and analyze data in Platfora, they are actually
generating queries that run against a lens. The result of a lens query is rendered as a visualization
in Platfora. When users construct visualizations, they choose a lens to work with. Choosing a lens
loads its data into Platfora's in-memory query engine. The in-memory query engine has two kinds of
processes that work on a query:
1. Query Coordinator - The query coordinator process runs on the master node only, and translates
actions made in the Platfora application into queries. The coordinator sends the query to the
workers for processing, then consolidates the partial results from each worker into a final result.
Page 9
Platfora Deployment Planning Guide - About Platfora Deployments
2. Query Worker - The query worker process typically runs on the worker nodes, but the master
may also serve as a worker in some cases. A query worker process works on its portion of lens
data for a given query.
• Web Application Server - Platfora's user interface runs as a web application in your network. Users
connect to Platfora using any HTML5-compliant browser. Through the browser, users interact with
data in Hadoop as easily as browsing a web site.
The Platfora Worker Nodes
The Platfora worker nodes are used to distribute lens storage capacity and query processing workload.
As users work with more and bigger lenses in Platfora, more memory and processing power is needed to
render visualizations quickly. Administrators can add additional worker nodes to scale up lens storage
capacity and performance. By using the resources of multiple machines to store and process lens data,
Platfora can handle true 'big data' query workloads.
FAQs - Platfora Deployments
Got questions about what you need to get Platfora up and running? Want to know how Platfora is
deployed in your data center environment and how it works with Hadoop? This topic answers the most
frequently asked questions (FAQs) about Platfora installation and deployment.
What do I need before I can install Platfora?
Before you can install Platfora, you will need:
Page 10
Platfora Deployment Planning Guide - About Platfora Deployments
• Hadoop - Platfora needs access to an installed and running Hadoop cluster, or to an Amazon Web
Services (AWS) account with Amazon S3 (Simple Storage Service) and EMR (Elastic MapReduce)
enabled.
• Linux Server(s) - You will need one or more dedicated servers running a supported Linux operating
system on which to install Platfora. The Platfora server(s) should be in the same data center (or
region) as your Hadoop distribution, but not on the same machines.
• Platfora Binaries - A Platfora customer support representative can give you the download link to the
Platfora installation package for your chosen Hadoop distribution. Platfora provides both rpm and tar
installer packages.
• Platfora License - A Platfora customer support representative must issue you a license file. Trial
period licenses are available upon request for pilot installations.
• Platfora Installation Guide - You will need the Platfora installation guide for your specific Hadoop
distribution. The setup steps vary slightly depending on the version of Hadoop you are using.
What are the high-level steps involved in installing Platfora?
Every Platfora installation involves these basic steps, although the details will vary slightly depending
on the Hadoop distribution you are using:
• Configure Hadoop for Platfora Access - Make sure that the Platfora server(s) can access your
Hadoop services over the network and that Platfora has write access to a designated directory in
the Hadoop file system. Obtain the required connection details for your Hadoop services (Platfora
connects to Hadoop during setup).
• Install Prerequisites on all Platfora Nodes - Make sure the Platfora servers have the required
dependencies before installing Platfora. If using the rpm installer, Platfora provides a base package
that includes the dependencies. If using the tar installer, you will need to manually install the
dependent software yourself.
• Install the Platfora Software on the Master - Install the Platfora binaries on the master node.
• Setup the Platfora Master - Run the setup utility to configure the Platfora master server and connect
it to your Hadoop services.
• Start Platfora - After setup completes, start the Platfora server. You should now have a fullyfunctioning single-node Platfora installation.
• Run Tests and Load the Tutorial Data - After setup completes, you may want to run some tests to
make sure that Platfora is properly configured and can access your Hadoop cluster. One way to test
everything is to load the tutorial data that comes with your Platfora installation. This will put some
data in Hadoop and build a small lens to make sure everything is working.
• Add Platfora Worker Nodes - Once you have the Platfora master node up and running, you can use
it to add Platfora worker nodes to the cluster. The master node is always used to install and manage
the worker nodes.
Is there a trial version of Platfora?
Platfora does not currently have a trial version available for download. You can contact Platfora
Customer Support to arrange for a pilot or trial installation.
Page 11
Platfora Deployment Planning Guide - About Platfora Deployments
Why would I need multiple Platfora nodes?
When users work with lens data in Platfora, that data is loaded into memory so that queries (vizzes) are
fast and responsive. If there is more lens data than can fit into memory, then some queries may be slow
or not be able to run at all. Adding more nodes to your Platfora cluster makes more disk, memory and
CPU available to store and process lens data.
How many Platfora nodes would I need?
Platfora is intended for big data query workloads, and performs best when using the resources of
multiple machines. Although you can have a fully-functioning Platfora installation with just one node, a
multi-node installation is necessary for optimal performance and bigger lens sizes.
The ideal number of Platfora nodes really depends on a lot of factors: lens size, lens quantity, data
variety, and number of concurrent users (to name a few). Your Platfora account representative will help
you determine the number of nodes that best fits your unique data requirements. You can also scale up
your Platfora cluster as your data and usage grows.
How does Platfora interact with Hadoop?
Platfora uses the powerful distributed storage and processing features of Hadoop, but masks the
complexity of working with HDFS and MapReduce by providing an easy-to-use web interface.
Platfora uses Hadoop to access the raw data stored in its distributed file system (DFS) and makes the
data visible to Platfora users. It uses the data processing services of Hadoop (MapReduce) to pull
requested data and prepare it for analysis. The result of these processing jobs is the Platfora lens.
Platfora lenses are stored in the Hadoop distributed file system, as well as copied over to the Platfora
servers.
Can Platfora connect to more than one source system?
When you install Platfora, you connect it to one Hadoop distribution. This is the primary source system
that Platfora uses to access the source data and process its lens builds.
You can create data sources that point to external sources (such as a cloud storage service or a relational
database). However, this external data must be pulled over to the primary Hadoop source system during
lens build processing. To avoid moving large amounts of data over the network, Platfora recommends
using external data sources for smaller, supplemental datasets only.
What does Platfora do to the data in Hadoop?
Platfora reads the raw data, but does not edit, update, or delete it in place. It makes a copy of the
requested portion of the data when it builds a lens, and does its lens processing on the copied data. Your
original data remains intact and unaltered.
How does Platfora keep my data secure?
Platfora's role-based security allows you to control who can authenticate to the Platfora application and
what actions they can perform. You can maintain user credentials within the Platfora application, or
configure Platfora to use an external LDAP directory service to authenticate users.
Page 12
Platfora Deployment Planning Guide - About Platfora Deployments
To authorize access to the raw data, you can either manage data access permissions within the Platfora
application itself, or you can configure Platfora to use Kerberos authorization check the HDFS file
system permissions.
How does Platfora handle redundancy and high availability?
Platfora relies on Hadoop for redundancy and high-availability of the raw data itself.
The Platfora worker nodes are fully redundant and highly available. The worker nodes process the lens
queries submitted to the Platfora application. Lens data is distributed and replicated across all of the
worker nodes in the Platfora cluster. Depending on the number of worker nodes you have, you can lose a
node and still continue processing queries without interruption of service.
A redundant Platfora master node involves taking routine backups of the metadata catalog database so
you can restore the master node if needed.
Page 13
Chapter
2
Supported Hadoop and Hive Versions
This section lists the Hadoop distributions and versions that are compatible with the Platfora installation
packages. If using Hive as a data source for Platfora, the version of Hive must be compatible with the version of
Hadoop you are using.
Hadoop Distro Version
Hive
Version
M/R
Version
Platfora Package
CDH5.0
0.12
YARN
cdh5
CDH5.1
0.12
YARN
cdh5
CDH5.2
0.13
YARN
cdh52
CDH5.3
0.13.1
YARN
cdh52
CDH5.4
1.1
YARN
cdh54
HDP 2.1.x
0.13.0
YARN
hadoop_2_4_0_hive_0_13_0
HDP 2.2.x
0.14.0
YARN
hadoop_2_6_0_hive_0_14_0
MapR 4.0.1
0.12.0
YARN
mapr4
MapR 4.0.2
0.13.0
YARN
mapr402
MapR 4.1.0
0.13.0
YARN
mapr402
Pivotal Labs
PivotalHD 3.0
0.14.0
YARN
hadoop_2_6_0_hive_0_14_0
Amazon EMR
(AMI 3.7.x)
Hadoop 2.4.0
0.13.1
YARN
hadoop_2_4_0_hive_0_13_0
Cloudera 5
Hortonworks
MapR
Page 14
Chapter
3
System Requirements (On-Premise)
The Platfora software runs on a scale-out cluster of servers. You can install Platfora on a single node to start,
and then scale up storage and processing capacity by adding additional nodes. Platfora requires access to an
existing, compatible Hadoop implementation in order to start. Users then access the Platfora application using a
compatible web browser client. This section describes the system requirements for on-premise deployments of
the Platfora servers, Hadoop source systems, network connectivity, and web browser clients.
Topics:
•
Platfora Server Requirements
•
Hadoop Resource Requirements
Platfora Server Requirements
Platfora recommends the following minimum system requirements for Platfora servers. For multi-node
installations, the master server and all worker servers must be the same operating system (OS) and
system configuration (same amount of memory, CPU, etc.).
1
2
64-bit Operating
System or Amazon
Machine Image
(AMIs)
CentOS 6.2-6.5 (7.0 is not supported)
Software
Java 1.7
Python 2.6.8, 2.7.1, 2.7.3 through 2.7.6 (3.0 not supported)
PostgreSQL 9.2.1-1, 9.2.5, 9.2.7 or 9.3 (master only)
2
OpenSSL 1.0.1 or higher
Unix Utilities
rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget
RHEL 6.2-6.5 (7.0 is not supported)
Scientific Linux 6.2
Amazon Linux AMI 2014.03+
Oracle Enterprise Linux 6.x
Ubuntu 12.04.1 LTS or higher
1
Security-Enhanced Linux 6.2
If you wish to install Security-Enhanced Linux, refer to Platfora's Support site for
installation instructions.
Only required if you want to enable SSL for secure communications between Platfora
servers
Page 15
Platfora Deployment Planning Guide - System Requirements (On-Premise)
Memory
64 GB minimum, 256 recommended
The server needs enough memory to accommodate
actively used lens data. Additionally, it needs 1-2 GB
reserved for normal operations and the lens query engine
workspace.
CPU
8 cores minimum, 16 recommended
Disk
All Platfora nodes (master or worker) require 300MB for the
Platfora installation. Every node requires high-speed local storage
and a local disk cache configured as a single logical volume.
Hardware RAID is recommended for the best performance.
All nodes combined require appropriate free space for aggregated
data structures (Platfora lenses). At a minimum, you will need
twice the amount of disk space as the amount of system memory.
The Platfora master node requires an additional, approximately
700 MB for metadata catalog (dataset definitions, vizboard and
visualization definitions, lens definitions, etc.)
Network
1 Gbps reliable network connectivity between Platfora master
server and query processing servers
1 Gbps reliable network connectivity between Platfora master
server and Hadoop NameNode and JobTracker/ResourceManager
node
Network bandwidth should be comparable to the amount of
memory on the Platfora master server
Hadoop Resource Requirements
Platfora must be able to connect to an existing Hadoop installation. Platfora also requires permissions
and resources in the Hadoop source system. This section describes the Hadoop resource requirements for
Platfora.
Platfora uses the remote Distributed File System (DFS) of the Hadoop cluster for persistent storage and
as the primary data source. Optionally, you can also configure Platfora to use a Hive metastore server as
a data source.
Page 16
Platfora Deployment Planning Guide - System Requirements (On-Premise)
Platfora uses the Hadoop MapReduce services to process data and build lenses. For larger lens builds to
succeed, Platfora requires minimum resources on the Hadoop cluster for MapReduce tasks.
DFS Disk Space
Platfora requires a designated persistent storage directory in the
remote distributed file system (DFS) with appropriate free space for
Platfora system files and data structures (lenses). The location is
configurable.
DFS Permissions
The platfora system user needs read permissions to source data
directories and files.
The platfora system user needs write permissions to Platfora's
persistent storage directory on DFS.
MapReduce
Permissions
The platfora system user needs to be added to the submit-jobs
and administer-jobs access control list (or added to a group that has
these permissions).
DFS Resources
Minimum Open File Limit = 5000
MapReduce
Resources
Minimum Memory for Task Processes = 1 GB
Page 17
Chapter
4
System Requirements (AWS Cloud)
This section describes the system requirements for customers who plan to use Amazon Web Services (AWS) as
their installation environment for Platfora, and Simple Storage Service (S3) and Elastic MapReduce (EMR) and
as their Hadoop distributed data storage and processing services.
Topics:
•
Platfora EC2 Instance Requirements
•
Amazon EMR Instance Requirements
•
AWS Security Settings for Platfora
Platfora EC2 Instance Requirements
Platfora recommends the following system requirements for Amazon EC2 instances that will serve as
Platfora server nodes. For multi-node installations, the master server instance and all worker server
instances must be the same configuration (same EC2 instance type, storage configuration, network
configuration, etc.).
Amazon Machine
Images (AMIs)
Amazon Linux AMI 2014.03.x or higher
Red Hat Enterprise Linux 6.2 - 6.5
Ubuntu Server 12.04.1 LTS or higher
EC2 Instance Type
Small to Medium Lens Sizes: c3.8xlarge
Medium to Large Lens Sizes, 10+ Platfora nodes: r3.8xlarge
Medium to Large Lens Sizes, 1-9 Platfora nodes: i2.8xlarge
Root Device Volume
(EBS)
Recommended Size = 1 TB
Type = General Purpose (SSD)
Additional EBS
Volumes
Optional. Additional EBS volumes can be attached to an EC2
instance after launch time, and can be used to increase lens
cache storage capacity if needed. EBS volumes are less expensive
than Instance Store volumes, and the data is persistent between
shutdowns.
Page 18
Platfora Deployment Planning Guide - System Requirements (AWS Cloud)
Instance Store
Volume (Ephemeral)
Optional. You may choose to add instance store volumes for the
Platfora lens cache instead of using EBS volumes. This costs more,
but offers slightly faster performance. Instance store volumes can
only be attached to an EC2 instance at launch time, and the data
is not saved when the instance shuts down. The size of an instance
store volume depends on the instance type:
c3.8xlarge: 2 x 320 GB SSD (640 GB)
r3.8xlarge: 2 x 320 GB SSD (640 GB)
i2.8xlarge: 8 x 800 GB SSD (6400 GB)
Enhanced
Networking
yes (requires use of VPC instead of EC2-Classic)
EBS Optimized
Instance
yes (the 8xlarge instance types are EBS optimized instances by
Availability Zone
yes (use same zone for all nodes in the Platfora cluster)
Placement Group
yes (use same placement group for all nodes in the Platfora
cluster)
IAM User
yes (create a dedicated Platfora IAM User in your AWS account)
Other Required
Software
Java 1.7
Python 2.7.8 through 2.7.9 (3.0 not supported)
(master node only) PostgreSQL 9.2.1-1.28 (AMZN), 9.2.5, 9.2.7 or
9.3
3
OpenSSL 1.0.1 or higher
Required Unix
Utilities
rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget
default)
Amazon EMR Instance Requirements
Platfora launches an Elastic MapReduce (EMR) cluster when it builds la lens. This section describes the
recommended requirements for the EMR instances that are launched by Platfora.
Amazon EMR is Hadoop as a web service. Platfora uses the EMR Hadoop cluster to process its lens
builds. Since the EMR Hadoop cluster is only instantiated as needed, the source data does not reside
in the Hadoop Distributed File System (HDFS) of the EMR Hadoop cluster. The source data is instead
stored on Amazon S3. Data is copied from S3 to EMR for data processing, then the results are written
back to S3 when the job completes.
3
Only required if you want to enable SSL for secure communications between Platfora
servers
Page 19
Platfora Deployment Planning Guide - System Requirements (AWS Cloud)
At the start of a lens build job, the raw source data is copied from S3 to the local HDFS file system on
the EMR nodes. The EMR instances must have enough local instance storage to support the input source
dataset and the temporary workspace for intermediate lens build job results. Also consider that the local
HDFS of the EMR cluster replicates the data to ensure redundancy and high availability during lens
build processing.
Platfora recommends the i2.4xlarge instance type for EMR data nodes and the m3.xlarge for the EMR
name node. The i2.4xlarge offers a great balance between total local disk space, CPU power, and pernode memory size.
Hadoop Version
2.4.0
AMI Version
3.7.0
EMR NameNode
Instance Type
m3.xlarge
EMR DataNode
Instance Type
i2.4xlarge
Number of EMR
DataNodes
The number of nodes you will need to complete a lens build
depends on the following factors:
• The size of the raw dataset in S3 that is considered as input to
the lens build.
• The replication factor of HDFS. EMR clusters of 1-4 nodes have
a replication factor of 1, 5-9 nodes have a replication factor of
2, and over 10 nodes have a replication factor of 3.
• Temporary work space for intermediate lens build results about 20-30% of total disk space.
AWS Security Settings for Platfora
Amazon Web Services (AWS) has a number of security features that you can use to protect your AWS
account and cloud server instances. This section contains security setting recommendations if you plan
to use Amazon Elastic MapReduce (EMR) as the Hadoop implementation for your Platfora cluster.
Amazon AWS Virtual Private Cloud (VPC)
To use Amazon EMR for Hadoop data processing, Platfora must be able to launch an EMR cluster in a
public subnet. Administrators do this by provisioning an Amazon VPC with a public subnet, and then
specifying the subnet identifier in Platfora. Platfora must create the EMR cluster on an Internet-facing
subnet to allow the AWS EMR Provisioning Service to reach the EMR cluster.
Additionally, you must ensure the Platfora server can communicate with the Amazon EMR cluster. If
the Platfora server is on the same subnet as the Amazon EMR cluster, this happens automatically. If
Page 20
Platfora Deployment Planning Guide - System Requirements (AWS Cloud)
the Platfora server and the EMR cluster are on different VPC subnets, then a route between the subnets
needs to be added to the Route table(s) so that communication can occur between the two subnets. Also,
if the VPC uses Access Control Lists (ACLs), then those ACLs must be modified to allow traffic from
Platfora to Hadoop.
The subnet identifier cannot exceed 255 characters in length.
After the Amazon VPC has been provisioned, specify its subnet identifier in the
platfora.emr.subnet.id Platfora configuration property.
For more information on setting up and using an Amazon VPC with Amazon EMR, see http://
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-vpc-subnet.html.
IAM User and IAM Roles for Platfora
AWS Identity and Access Management (IAM) allows you to create users, groups, and roles to control
access to AWS services and resources. Platfora recommends creating an IAM User account and two
IAM Roles specifically for use by Platfora.
Platfora uses a combination of an IAM User and IAM Roles to communicate with Amazon AWS and to
create an EMR cluster. An Amazon AWS administrator needs to create a platfora IAM User and two
IAM Roles specifically for use by Platfora. Then a Platfora system administrator needs to enter some
information about that user and those roles in Platfora.
The Platfora server uses security credentials of the platfora IAM User to request Amazon AWS to
create an Amazon EMR cluster. Once that request is approved, the platfora IAM User then passes an
IAM Role to actually launch an EMR cluster, and then uses another IAM Role to start EC2 instances in
the EMR cluster. You must specify these roles in Platfora.
For more details on creating the user and roles, see Create IAM User for Platfora and Create IAM Roles
for Platfora.
Create IAM User for Platfora
The Amazon AWS administrator can create a new platfora user in the IAM Management Console
of your AWS account. After creating the user, download the AWS credentials for this user. The Platfora
Page 21
Platfora Deployment Planning Guide - System Requirements (AWS Cloud)
system administrator will need the Access Key Id and Secret Access Key when you initialize Platfora
for use with Amazon EMR.
Page 22
Platfora Deployment Planning Guide - System Requirements (AWS Cloud)
The security policy for the platfora IAM User must have (at a minimum) the permissions listed in the
following sample policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"iam:ListRoles",
"iam:PassRole",
"elasticmapreduce:*",
"s3:GetBucketLocation",
"s3:ListAllMyBuckets"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::Bucket_defined_in_core-site.xml",
"arn:aws:s3:::Datasource_Bucket_1",
"arn:aws:s3:::Datasource_Bucket_n"
},
{
},
{
}
]
}
]
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:Get*",
"s3:DeleteObject",
],
"Resource": [
"arn:aws:s3:::Bucket_defined_in_core-site.xml/*"
]
"Effect": "Allow",
"Action": [
"s3:Get*"
],
"Resource": [
"arn:aws:s3:::Datasource_Bucket_1/path/to/files/*",
"arn:aws:s3:::Datasource_Bucket_n/*"
]
Page 23
Platfora Deployment Planning Guide - System Requirements (AWS Cloud)
Under Permissions for this user, attach a security policy that contains the permissions listed above.
These permissions allow the platfora IAM User to pass an IAM Role to launch the EMR cluster,
start an EMR cluster, and access S3 for source data during data ingest.
Create IAM Roles for Platfora
Amazon requires all AWS users to use IAM Roles to launch EMR clusters. One IAM Role is used to
start the Amazon EMR service, and the other role is used by the EC2 instances in the EMR cluster.
Amazon AWS offers some default IAM Roles for these services. However, Platfora recommends
creating custom IAM Roles specifically for use by Platfora instead.
The Amazon AWS administrator can create the IAM Roles in the IAM Management Console of your
AWS account. Create a role for each of the following EMR cluster services, and specify them in Platfora
using the specified configuration properties:
• Amazon EMR service (service role). In Amazon AWS, create an IAM Role and attach a security
policy that contains at a minimum the permissions specified below. Enter this IAM Role name in
the platfora.emr.service.role Platfora configuration property. The custom role you define
corresponds to the default IAM Role Amazon offers called EMR_DefaultRole.
• EC2 instances (instance profile) in the Amazon EMR cluster. In Amazon AWS, create an IAM
Role and attach a security policy that contains at a minimum the permissions specified below.
Enter this IAM Role name in the platfora.emr.jobflow.role Platfora configuration
property. The custom role you define corresponds to the default IAM Role Amazon offers called
EMR_EC2_DefaultRole.
The security policy for the Amazon EMR service (service role) IAM Role must have (at a minimum) the
permissions listed in the following sample policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"ec2:AuthorizeSecurityGroupIngress",
"ec2:CancelSpotInstanceRequests",
"ec2:CreateSecurityGroup",
"ec2:CreateTags",
"ec2:DeleteTags",
"ec2:Describe*",
"ec2:ModifyImageAttribute",
"ec2:ModifyInstanceAttribute",
"ec2:RequestSpotInstances",
"ec2:RunInstances",
"ec2:TerminateInstances"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
Page 24
Platfora Deployment Planning Guide - System Requirements (AWS Cloud)
"iam:PassRole",
"iam:ListRolePolicies",
"iam:GetRole",
"iam:GetRolePolicy",
"iam:ListInstanceProfiles"
},
{
}
]
}
],
"Effect": "Allow",
"Resource": "*"
"Effect": "Allow",
"Action": [
"s3:Get*"
],
"Resource": "arn:aws:s3:::Bucket_defined_in_core-site.xml/*"
The security policy for the EC2 instances (instance profile) IAM Role must have (at a minimum) the
permissions listed in the following sample policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Resource": "*",
"Action": [
"ec2:Describe*",
"elasticmapreduce:Describe*",
"elasticmapreduce:ListBootstrapActions",
"elasticmapreduce:ListClusters",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:ListInstances",
"elasticmapreduce:ListSteps",
"s3:ListAllMyBuckets"
]
},
{
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::Bucket_defined_in_core-site.xml",
"arn:aws:s3:::Datasource_Bucket_1",
"arn:aws:s3:::Datasource_Bucket_n"
]
"Effect": "Allow",
Page 25
Platfora Deployment Planning Guide - System Requirements (AWS Cloud)
},
{
],
}
]
}
"Action": [
"s3:PutObject",
"s3:Get*",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::Bucket_defined_in_core-site.xml/*",
]
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*"
"Resource": [
"arn:aws:s3:::Datasource_Bucket_1/path/to/files/*",
"arn:aws:s3:::Datasource_Bucket_n/*",
"arn:aws:s3:::*elasticmapreduce/*"
]
Verify that the permissions for and access to Amazon resources (especially S3) for
the EC2 instances role are the same or greater than the permissions and access
assigned to the platfora IAM User. For example, if the platfora IAM User can
access an Amazon S3 bucket, but the EC2 instances role cannot, then lens builds
that rely on that S3 bucket will fail.
For more information on using IAM Roles for EMR, see http://docs.aws.amazon.com/
ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html.
EC2 Security Group Settings
EC2 security groups allow you to specify firewalling rules for your Amazon elastic cloud computing
(EC2) server instances.
EC2 security group rules are independent of, and in addition to, the software firewalling provided by the
instance's operating system. Security groups must be defined before you create an EC2 instance.
The security group configured for the Platfora server instance must permit connections from your user
network to the Platfora web application server port (8001 by default). You also may want to open the
EMR Hadoop ResourceManager and JobHistory web ports so that you can monitor and troubleshoot
YARN jobs executed by Platfora.
An example security group configuration for a Platfora server instance would look something like the
following:
Page 26
Platfora Deployment Planning Guide - System Requirements (AWS Cloud)
Page 27
Chapter
5
Port Configuration Requirements
You must open ports in the firewall of your Platfora nodes to allow client access and intra-cluster
communications. You also must open ports within your Hadoop cluster to allow access from Platfora. This
section lists the default ports required.
Topics:
•
Ports to Open on Platfora Nodes
•
Ports to Open on Hadoop Nodes
Ports to Open on Platfora Nodes
Your Platfora master node must allow HTTP connections from your user network. All nodes must allow
connections from the other Platfora nodes in a multi-node cluster.
On Amazon EC2 instances, you must configure the port firewall rules on the
Platfora server instances in addition to the EC2 Security Group Settings.
Platfora Service
Default
Port
Allow connections from…
Master Web Services Port
(HTTP)
8001
External user network
Platfora worker servers
localhost
Secure Master Web Services
Port (HTTPS)
8443
External user network
Platfora worker servers
localhost
Master Server Management
Port
8002
Platfora worker servers
localhost
Worker Server Management
Port
8002
Platfora master server
other Platfora worker servers
localhost
Page 28
Platfora Deployment Planning Guide - Port Configuration Requirements
Platfora Service
Default
Port
Allow connections from…
Master Data Port
8003
Platfora worker servers
localhost
Worker Data Port
8003
Platfora master server
other Platfora worker servers
localhost
Master PostgreSQL Database
Port
5432
Platfora worker servers
localhost
Ports to Open on Hadoop Nodes
Platfora must be able to access certain services of your Hadoop cluster. This section lists the Hadoop
services Platfora needs to access and the default ports for those services.
Note that this only applies to on-premise Hadoop deployments or to self-managed Hadoop deployments
in a virtual private cloud, not to Amazon Elastic MapReduce (EMR).
Hadoop Service
Default Ports by Hadoop Allow connections from…
Distro
CDH,
HDP,
Pivotal
Apache MapR
Hadoop
HDFS NameNode
8020
9000
N/A
Platfora master and worker servers
HDFS DataNodes
50010
50010
N/A
Platfora master and worker servers
MapRFS CLDB
N/A
N/A
7222
Platfora master and worker servers
MapRFS DataNodes
N/A
N/A
5660
Platfora master and worker servers
MRv1 JobTracker
8021
9001
9001
Platfora master server
MRv1 JobTracker
Web UI
50030
50030
50030
External user network (optional)
YARN
ResourceManager
8032
8032
8032
Platfora master server
Page 29
Platfora Deployment Planning Guide - Port Configuration Requirements
Hadoop Service
4
Default Ports by Hadoop Allow connections from…
Distro
CDH,
HDP,
Pivotal
Apache MapR
Hadoop
YARN
ResourceManager
Web UI
8088
8088
8088
External user network (optional)
YARN Job History
Server
10020
10020
10020
Platfora master server
YARN Job History
Server Web UI
19888
19888
19888
External user network (optional)
HiveServer Thrift
Port
10000
10000
10000
Platfora master server
Hive Metastore DB
4
Port
9083
9933
(HDP2)
N/A
9083
Platfora master server
If connecting to Hive directly using JDBC
Page 30
Chapter
6
Browser Requirements
Users can connect to the Platfora web application using the latest HTML5-compliant web browsers. Platfora
supports the latest releases of the following web browsers:
• Chrome (preferred browser)
• Firefox
• Safari
• Internet Explorer with the Compatibility View feature disabled (versions prior to IE 10 are not
supported)
Platfora supports these web browsers on desktop machines only.
Page 31
Appendix
A
Hardware Specifications for Platfora Nodes
This section shows some example hardware configurations that have worked well in other Platfora
deployments.
To achieve the best performance and lowest operating cost, Platfora recommends that all servers in the Platfora
cluster have the same configuration. At a minimum, all servers in the Platfora cluster should have an identical
RAM capacity and the same number of CPU cores.
Platfora software can be deployed on either rack or blade servers. Typical Platfora server configurations have
specifications similar to:
Rack Server Specs
Blade Server Specs
CPU: 2x E5-2440 2.40GHz 6-cores
CPU: 2x E5-2470 2.30GHz 8-cores
RAM: 12x 16GB RAM (192GB total)
RAM: 12x16GB RAM (192GB total)
Disk: 8x 300GB 10K SAS 2.5” HDDs
Disk: 2x 900GB 10K SATA 2.5” HDDs
Network: 1x Gbps NIC
Page 32
Appendix
B
EC2 Considerations for Platfora Instances
This section explains what to consider when using Amazon Elastic Compute Cloud (EC2) instances to deploy a
production Platfora cluster.
EC2 Storage Considerations
When you launch an Amazon EC2 instance, you have several choices with regards to the storage that
you can attach to the instance. There are two main types of storage available: Elastic Block Store (EBS)
and Instance Store (Ephemeral). The type and capacity of storage available depends on the instance type
you choose.
• The Root Device Volume - All instances have a root device volume, which is backed by either EBS
or Instance storage. Platfora recommends EBS-backed instance types; they launch faster and use
persistent storage.
Root device volumes for Platfora nodes should always be increased to the maximum size (1
TB). This ensures adequate space for the Platfora installation and logs. When using the Platfora
recommended 8xlarge instance types, general purpose (SSD) EBS volumes also guarantee 3,000
IOPS.
• EBS Volumes - Amazon EBS volumes are highly available and reliable storage volumes that can be
attached to any running instance that is in the same Availability Zone. Amazon EBS volumes that are
attached to an Amazon EC2 instance are exposed as storage volumes that persist independently from
the life of the instance. Also with Amazon EBS, you only pay for what you use, making it a costeffective choice.
Platfora recommends General Purpose (SSD) EBS volumes. For maximum performance, you can
choose Provisioned IOPS EBS volumes instead.
If you choose an instance type that is not EBS optimized by default, make sure to choose EBS
Optimized Instance at launch time. This ensures that the instance has a dedicated connection to the
EBS volume, which reduces overall latency and maximizes throughput. The Platfora recommended
8xlarge instance types are already EBS optimized instances.
• Instance Store Volumes - Ephemeral storage is ideal for temporary storage of information that
changes frequently, such as caches, or for data that is replicated across multiple instances. Instances
that use EBS for the root device do not, by default, have instance store volumes available at boot
time. Also, you can't attach instance store volumes after you've launched an instance. Therefore, if
you want your Amazon EBS-backed instance to use instance store volumes, you must specify them
when you first launch your instance.
Page 33
Platfora Deployment Planning Guide - EC2 Considerations for Platfora Instances
The choice to add instance store volumes to Platfora nodes depends on price, performance, and
persistence of the data. Ephemeral storage allows data to be read faster from disk, but is also more
expensive. Also, the data stored on these volumes is not persistent - it will be lost if the instance is
shutdown or terminated.
If you do decide to use ephemeral drives for the Platfora cache directories, use RAID 0 (Stripe).
This ensures Platfora has access to the maximum possible disk space and will also yield the highest
performance. Remember, ephemeral drives are temporary storage, so there is no need to use RAID 1.
When the instance is stopped, the data is not saved.
In Platfora, the PLATFORA_DATA/dfscache and PLATFORA_DATA/fsCache directories can
be mapped to instance store volumes (if you decide to use them). These are the only directories of a
Platfora installation that should use ephemeral storage. Lens data is backed up in S3, so the loss of
any cached data is temporary.
EC2 Network Considerations
• Placement Groups - All Platfora server instances should be launched within the same Amazon EC2
Placement Group. A placement group is a logical grouping of instances within a single Availability
Zone. Using placement groups enables applications to participate in a low-latency, 10 Gbps network
connectivity. Placement groups are recommended for applications that benefit from low network
latency, high network throughput, or both. See the Amazon EC2 Documentation on Placement
Groups.
• Enhanced Networking - To enable enhanced networking, you must launch each instance in the same
Amazon EC2 virtual private cloud (VPC). You can't enable enhanced networking if the instance
is in EC2-Classic. For more information, see the Amazon VPC User Guide and the Amazon EC2
Documentation on Enhanced Networking.
Page 34