Platfora Installation Guide

Platfora Installation Guide
Version 4.5
For Amazon EMR Cloud Deployments
Copyright Platfora 2015
Last Updated: 10:14 p.m. June 28, 2015
Contents
Document Conventions............................................................................................. 5
Contact Platfora Support...........................................................................................6
Copyright Notices...................................................................................................... 6
Chapter 1: Installation Overview (Amazon EMR)...................................................... 8
Amazon AWS Cloud Deployments........................................................................... 8
Master vs Worker Node Installations........................................................................9
Preinstall Checklist.................................................................................................... 9
High-Level Install Steps.......................................................................................... 11
Chapter 2: System Requirements (AWS Cloud)......................................................13
Supported Hadoop and Hive Versions................................................................... 13
Platfora EC2 Instance Requirements......................................................................14
Amazon EMR Instance Requirements....................................................................15
AWS Security Settings for Platfora.........................................................................16
Amazon AWS Virtual Private Cloud (VPC)....................................................... 16
IAM User and IAM Roles for Platfora................................................................17
EC2 Security Group Settings............................................................................ 22
Port Configuration Requirements............................................................................23
Ports to Open on Platfora Nodes...................................................................... 23
Browser Requirements............................................................................................24
Chapter 3: Install Platfora Software and Dependencies.........................................25
About the Platfora Installer Packages.................................................................... 25
Install Using RPM Packages.................................................................................. 26
Install Dependencies RPM Package................................................................. 26
Install Optional Security RPM Package.............................................................27
Install Platfora RPM Package (Master Only).....................................................28
Install Using the TAR Package...............................................................................29
Create the Platfora System User...................................................................... 29
Set OS Kernel Parameters................................................................................31
Install Dependent Software................................................................................33
Install Platfora TAR Package (Master Only)..................................................... 37
Install PDF Dependencies (Master Only).......................................................... 38
Chapter 4: Configure Environment on Platfora Nodes...........................................41
Install the MapR Client Software (MapR Only).......................................................41
Configure Network Environment............................................................................. 43
Configure /etc/hosts File.................................................................................... 43
Verify Connectivity Between Platfora Nodes..................................................... 44
Platfora Installation Guide - Contents
Verify Connectivity to Hadoop Nodes................................................................45
Open Firewall Ports........................................................................................... 46
Configure Passwordless SSH................................................................................. 47
Verify Local SSH Access...................................................................................47
Exchange SSH Keys (Multi-Node Only)............................................................47
Synchronize the System Clocks............................................................................. 48
Create Local Storage Directories............................................................................49
Verify Environment Variables..................................................................................50
Chapter 5: Initialize Platfora Master Node............................................................... 51
Connect Platfora to Your Hadoop Services............................................................51
Understand How Platfora Connects to Hadoop................................................ 51
Create Local Hadoop Configuration Directory...................................................53
Initialize the Platfora Master................................................................................... 56
Configure SSL for Client Connections...............................................................58
Configure SSL for Catalog Connections........................................................... 60
About System Diagnostic Data..........................................................................61
Configure Platfora for Amazon EMR...................................................................... 62
Troubleshoot Setup Issues..................................................................................... 66
View the Platfora Log Files............................................................................... 66
Setup Fails Setting up Catalog Metadata Service.............................................66
TEST FAILED: Checking integrity of binaries................................................... 67
Chapter 6: Start Platfora............................................................................................69
Start the Platfora Server......................................................................................... 69
Log in to the Platfora Web Application................................................................... 70
Add a License Key..................................................................................................72
Change the Default Admin Password.....................................................................72
Load the Tutorial Data............................................................................................ 73
Chapter 7: Initialize a Worker Node......................................................................... 75
Appendix A: Command Line Utility Reference........................................................76
setup.py................................................................................................................... 76
hadoop-check.......................................................................................................... 80
hadoopcp................................................................................................................. 83
hadoopfs.................................................................................................................. 84
install-node.............................................................................................................. 85
platfora-catalog........................................................................................................ 86
platfora-catalog ssl.............................................................................................88
platfora-config.......................................................................................................... 89
platfora-export..........................................................................................................91
platfora-import..........................................................................................................95
Page 3
Platfora Installation Guide - Contents
platfora-license........................................................................................................ 97
platfora-license install........................................................................................ 98
platfora-license uninstall.................................................................................... 99
platfora-license view.......................................................................................... 99
platfora-node..........................................................................................................100
platfora-node add.............................................................................................101
platfora-node config......................................................................................... 102
platfora-services.................................................................................................... 103
platfora-services start.......................................................................................104
platfora-services stop.......................................................................................106
platfora-services restart................................................................................... 108
platfora-services status.................................................................................... 109
platfora-services sync...................................................................................... 111
platfora-syscapture................................................................................................ 111
platfora-syscheck...................................................................................................113
Appendix B: Glossary..............................................................................................116
Page 4
Preface
This guide provides information and instructions for installing and initializing a Platfora® cluster. This
guide is intended for system administrators with knowledge of Linux/Unix system administration and
basic Hadoop administration.
This Amazon Web Services (AWS) cloud installation guide is for organizations that do not have a
persistent Hadoop cluster. Instead, your organization uses Amazon S3 for raw data storage and Amazon
Elastic MapReduce (EMR) for on-demand Hadoop data processing.
Document Conventions
This documentation uses certain text conventions for language syntax and code examples.
Convention
Usage
Example
$
Command-line prompt proceeds a command to be
entered in a command-line
terminal session.
$ ls
$ sudo
Command-line prompt
$ sudo yum install open-jdk-1.7
for a command that
requires root permissions
(commands will be prefixed
with sudo).
UPPERCASE
Function names and
keywords are shown in all
uppercase for readability,
but keywords are caseinsensitive (can be written
in upper or lower case).
SUM(page_views)
italics
Italics indicate a usersupplied argument or
variable.
SUM(field_name)
[ ] (square
Square brackets denote
optional syntax items.
CONCAT(string_expression[,...])
...
(elipsis)
An elipsis denotes a syntax
item that can be repeated
any number of times.
CONCAT(string_expression[,...])
brackets)
Page 5
Platfora Installation Guide - Introduction
Contact Platfora Support
For technical support, you can send an email to:
[email protected]
Or visit the Platfora support site for the most up-to-date product news, knowledge base articles, and
product tips.
http://support.platfora.com
To access the support portal, you must have a valid support agreement with Platfora. Please contact
your Platfora sales representative for details about obtaining a valid support agreement or with questions
about your account.
Copyright Notices
Copyright © 2012-15 Platfora Corporation. All rights reserved.
Platfora believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” PLATFORA
CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH
RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE.
Use, copying, and distribution of any Platfora software described in this publication requires an
applicable software license. Platfora®, You Should Know™, Interest Driven Pipeline™, Fractal Cache™,
and Adaptive Job Synthesis™ are trademarks of the Platfora Corporation. Apache Hadoop™ and Apache
Hive™ are trademarks of the Apache Software Foundation. All other trademarks used herein are the
property of their respective owners.
Embedded Software Copyrights and License Agreements
Platfora contains the following open source and third-party proprietary software subject to their
respective copyrights and license agreements:
• Apache Hive PDK
• dom4j
• freemarker
• GeoNames
• Google Maps API
• javassist
Page 6
Platfora Installation Guide - Introduction
• javax.servlet
• Mortbay Jetty 6.1.26
• OWASP CSRFGuard 3
• PostgreSQL JDBC 9.1-901
• Scala
• sjsxp : 1.0.1
• Unboundid
Page 7
Chapter
1
Installation Overview (Amazon EMR)
This section provides an overview of the Platfora installation process for Amazon AWS cloud environments that
will use Amazon Elastic MapReduce (EMR) as their primary Hadoop deployment for Platfora.
Topics:
•
Amazon AWS Cloud Deployments
•
Master vs Worker Node Installations
•
Preinstall Checklist
•
High-Level Install Steps
Amazon AWS Cloud Deployments
An Amazon Web Services (AWS) cloud deployment means that you do not have a persistent Hadoop
cluster. Instead, your organization uses Amazon S3 for raw data storage and Amazon EMR for ondemand Hadoop data processing.
Page 8
Platfora Installation Guide - Installation Overview (Amazon EMR)
In an Amazon AWS cloud deployment, the Platfora server instances are deployed on dedicated, highmemory EC2 instances. Your organization’s raw data is managed in Amazon's Simple Storage Service
(S3). Platfora uses Amazon Elastic MapReduce (EMR) to run its data processing jobs (lens builds). The
results of the lens build jobs are then written back to S3.
Master vs Worker Node Installations
If you are installing Platfora for the very first time, you begin by installing, configuring and initializing
the Platfora master node. Once you have the master node up and running, you can then add in additional
worker nodes as needed.
All nodes in a Platfora cluster (master and workers) must meet the minimum system requirements and
have the required prerequisite software installed. If you are using the RPM installer packages, you can
use the base installer package to install the required software on each Platfora node. If you are using the
TAR installer packages, you must manually install the required software on each Platfora node.
You only need to install the Platfora server software, however, on the master node. Platfora copies the
server software from the master to the worker nodes during the worker node initialization process.
All nodes in a Platfora cluster also require you to configure the network environment so that all the
nodes can talk to each other, as well as to the Hadoop cluster nodes. If you are adding additional worker
nodes to an existing Platfora cluster, make sure to follow the instructions for installing dependencies
and configuring the environment. You can skip any tasks denoted as 'Master Only' - these tasks are only
required for first-time installations of the Platfora master node.
Preinstall Checklist
Here is a list of items and information you will need in order to install a new Platfora cluster with an
Amazon Elastic MapReduce (EMR) cloud deployment. Platfora must be able to connect to various
Amazon Web Services (AWS) during setup, so you will also need information about your AWS
account.
Platfora Checklist
This is a list of things you will need in order to install Platfora nodes.
What You Need
Description
Platfora License
Platfora Customer Support must issue you a license file.
Trial period licenses are available upon request for pilot
installations.
Page 9
Platfora Installation Guide - Installation Overview (Amazon EMR)
What You Need
Description
Platfora Software
A Platfora customer support representative can give you
the download link to the Platfora installation package
for your chosen EC2 operating system and Amazon EMR
Hadoop version. Platfora provides both rpm and tar
installer packages.
(MapR Only) MapR Client
Software
If you are using a MapR Hadoop cluster with Platfora, you
will need the MapR client software for the version of MapR
you are using. The MapR client software must be installed
on all Platfora nodes.
Amazon Web Services Checklist
This is a list of things you will need to create or obtain from your Amazon Web Services (AWS)
environment in order to install Platfora.
What You Need
Description
AWS VPC Subnet ID
Platfora must be able to launch an Amazon EMR cluster
in a public subnet in AWS. An Amazon AWS administrator
should provision an Amazon VPC with a public subnet.
You must ensure the Platfora server can communicate
with the subnet in the VPC. If the Platfora server is on the
same subnet as the Amazon EMR cluster, this happens
automatically.
After the AWS VPC is provisioned, you will need the subnet
identifier when configuring the Platfora configuration
properties.
IAM User
AWS Identity and Access Management (IAM) allows you to
create users, groups, and roles to control access to AWS
services and resources. Platfora recommends creating an
IAM User account specifically for use by Platfora.
This user must have (at a minimum) the permissions
specified in IAM User and IAM Roles for Platfora.
AWS Access Key
After you have created the Platfora IAM user, download the
AWS credentials for this user. You will need the Access Key
Id and Secret Access Key when you initialize Platfora for use
with Amazon EMR.
Page 10
Platfora Installation Guide - Installation Overview (Amazon EMR)
What You Need
Description
IAM Roles
Amazon requires all AWS users to use IAM Roles to launch
EMR clusters. Platfora recommends creating custom IAM
Roles specifically for use by Platfora.
Create a role for each of the following EMR cluster services:
• Amazon EMR service (service role). In Amazon AWS,
create a custom IAM Role and attach a security policy
that contains at a minimum the permissions specified
in IAM User and IAM Roles for Platfora. The custom role
you define corresponds to the default IAM Role Amazon
offers called EMR_DefaultRole.
• EC2 instances (instance profile) in the Amazon EMR
cluster. n Amazon AWS, create a custom IAM Role
and attach a security policy that contains at a
minimum the permissions specified in IAM User and
IAM Roles for Platfora. The custom role you define
corresponds to the default IAM Role Amazon offers
called EMR_EC2_DefaultRole.
You need the role names when configuring the Platfora
configuration properties.
EC2 Security Group
EC2 security groups allow you to specify firewall rules
for your Amazon elastic cloud computing (EC2) server
instances. You should create a set of Security Group rules
to apply to your Platfora instances.
EC2 Instances
You will need to launch the EC2 instances on which to
install the Platfora master and worker servers.
S3 Bucket
You will need to provide the name of an Amazon S3 bucket
to use for Platfora.
High-Level Install Steps
This section lists the high-level steps involved in installing Platfora to work with an Amazon Elastic
MapReduce (EMR) Hadoop cluster. Note that there are different procedures if you are installing a new
Platfora cluster verses adding a worker node to an existing Platfora cluster.
New Platfora Installation
When installing Platfora for the first time, you begin with installing and configuring the Platfora master
node first. After the master node is installed, initialized and connected to the Hadoop services it needs,
then you can use the master node to add additional worker nodes into the cluster.
These are the high-level steps for installing Platfora for the first time:
Page 11
Platfora Installation Guide - Installation Overview (Amazon EMR)
1. Configure your Amazon Web Services account for Platfora. See AWS Security Settings for Platfora.
2. Initialize the Amazon EC2 Instances for your Platfora nodes. See Platfora EC2 Instance
Requirements.
3. Install Platfora Software and Dependencies.
4. Configure Environment on Platfora Nodes.
5. Configure the Connection to Amazon S3.
6. Initialize the Platfora Master.
7. Configure the Connection to Amazon EMR.
8. Start Platfora.
9. Login to the Platfora Application.
10.Install the License File.
11.(Optional) Load the Tutorial Data (as a quick way to test that everything works).
12.Add Worker Nodes.
Additional Worker Node Installation
Once you have a Platfora master node up and running, you can use it to initialize additional worker
nodes. Before you can initialize a worker node, however, you must make sure that it has the required
dependencies installed.
These are the high-level steps for adding a worker node to an existing Platfora cluster:
1. Initialize the Amazon EC2 Instance for the new worker node. See Platfora EC2 Instance
Requirements.
2. Install the prerequisite software only directly on the worker node instance.
• If using the RPM installer packages, Install Dependencies RPM Package.
• If using the TAR installer packages, you must manually Create the Platfora System User, Set OS
Kernel Parameters, and Install Dependent Software.
3. Configure Environment on Platfora Nodes.
4. Add Worker Node to Platfora Cluster.
Page 12
Chapter
2
System Requirements (AWS Cloud)
This section describes the system requirements for customers who plan to use Amazon Web Services (AWS) as
their installation environment for Platfora, and Simple Storage Service (S3) and Elastic MapReduce (EMR) and
as their Hadoop distributed data storage and processing services.
Topics:
•
Supported Hadoop and Hive Versions
•
Platfora EC2 Instance Requirements
•
Amazon EMR Instance Requirements
•
AWS Security Settings for Platfora
•
Port Configuration Requirements
•
Browser Requirements
Supported Hadoop and Hive Versions
This section lists the Hadoop distributions and versions that are compatible with the Platfora installation
packages. If using Hive as a data source for Platfora, the version of Hive must be compatible with the
version of Hadoop you are using.
Hadoop
Distro
Version
Hive
Version
M/R
Version
Platfora Package
Cloudera 5
CDH5.0
0.12
YARN
cdh5
CDH5.1
0.12
YARN
cdh5
CDH5.2
0.13
YARN
cdh52
CDH5.3
0.13.1
YARN
cdh52
CDH5.4
1.1
YARN
cdh54
HDP 2.1.x
0.13.0
YARN
hadoop_2_4_0_hive_0_13_0
Hortonworks
Page 13
Platfora Installation Guide - System Requirements (AWS Cloud)
Hadoop
Distro
Version
Hive
Version
M/R
Version
Platfora Package
HDP 2.2.x
0.14.0
YARN
hadoop_2_6_0_hive_0_14_0
MapR 4.0.1
0.12.0
YARN
mapr4
MapR 4.0.2
0.13.0
YARN
mapr402
MapR 4.1.0
0.13.0
YARN
mapr402
Pivotal Labs
PivotalHD 3.0 0.14.0
YARN
hadoop_2_6_0_hive_0_14_0
Amazon EMR
(AMI 3.7.x)
Hadoop 2.4.0
YARN
hadoop_2_4_0_hive_0_13_0
MapR
0.13.1
Platfora EC2 Instance Requirements
Platfora recommends the following system requirements for Amazon EC2 instances that will serve as
Platfora server nodes. For multi-node installations, the master server instance and all worker server
instances must be the same configuration (same EC2 instance type, storage configuration, network
configuration, etc.).
Amazon Machine
Images (AMIs)
Amazon Linux AMI 2014.03.x or higher
Red Hat Enterprise Linux 6.2 - 6.5
Ubuntu Server 12.04.1 LTS or higher
EC2 Instance Type
Small to Medium Lens Sizes: c3.8xlarge
Medium to Large Lens Sizes, 10+ Platfora nodes: r3.8xlarge
Medium to Large Lens Sizes, 1-9 Platfora nodes: i2.8xlarge
Root Device Volume
(EBS)
Recommended Size = 1 TB
Type = General Purpose (SSD)
Additional EBS
Volumes
Optional. Additional EBS volumes can be attached to an EC2
instance after launch time, and can be used to increase lens
cache storage capacity if needed. EBS volumes are less expensive
than Instance Store volumes, and the data is persistent between
shutdowns.
Page 14
Platfora Installation Guide - System Requirements (AWS Cloud)
Instance Store
Volume (Ephemeral)
Optional. You may choose to add instance store volumes for the
Platfora lens cache instead of using EBS volumes. This costs more,
but offers slightly faster performance. Instance store volumes can
only be attached to an EC2 instance at launch time, and the data
is not saved when the instance shuts down. The size of an instance
store volume depends on the instance type:
c3.8xlarge: 2 x 320 GB SSD (640 GB)
r3.8xlarge: 2 x 320 GB SSD (640 GB)
i2.8xlarge: 8 x 800 GB SSD (6400 GB)
Enhanced
Networking
yes (requires use of VPC instead of EC2-Classic)
EBS Optimized
Instance
yes (the 8xlarge instance types are EBS optimized instances by
Availability Zone
yes (use same zone for all nodes in the Platfora cluster)
Placement Group
yes (use same placement group for all nodes in the Platfora
cluster)
IAM User
yes (create a dedicated Platfora IAM User in your AWS account)
Other Required
Software
Java 1.7
Python 2.7.8 through 2.7.9 (3.0 not supported)
(master node only) PostgreSQL 9.2.1-1.28 (AMZN), 9.2.5, 9.2.7 or
9.3
1
OpenSSL 1.0.1 or higher
Required Unix
Utilities
rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget
default)
Amazon EMR Instance Requirements
Platfora launches an Elastic MapReduce (EMR) cluster when it builds la lens. This section describes the
recommended requirements for the EMR instances that are launched by Platfora.
Amazon EMR is Hadoop as a web service. Platfora uses the EMR Hadoop cluster to process its lens
builds. Since the EMR Hadoop cluster is only instantiated as needed, the source data does not reside
in the Hadoop Distributed File System (HDFS) of the EMR Hadoop cluster. The source data is instead
stored on Amazon S3. Data is copied from S3 to EMR for data processing, then the results are written
back to S3 when the job completes.
1
Only required if you want to enable SSL for secure communications between Platfora
servers
Page 15
Platfora Installation Guide - System Requirements (AWS Cloud)
At the start of a lens build job, the raw source data is copied from S3 to the local HDFS file system on
the EMR nodes. The EMR instances must have enough local instance storage to support the input source
dataset and the temporary workspace for intermediate lens build job results. Also consider that the local
HDFS of the EMR cluster replicates the data to ensure redundancy and high availability during lens
build processing.
Platfora recommends the i2.4xlarge instance type for EMR data nodes and the m3.xlarge for the EMR
name node. The i2.4xlarge offers a great balance between total local disk space, CPU power, and pernode memory size.
Hadoop Version
2.4.0
AMI Version
3.7.0
EMR NameNode
Instance Type
m3.xlarge
EMR DataNode
Instance Type
i2.4xlarge
Number of EMR
DataNodes
The number of nodes you will need to complete a lens build
depends on the following factors:
• The size of the raw dataset in S3 that is considered as input to
the lens build.
• The replication factor of HDFS. EMR clusters of 1-4 nodes have
a replication factor of 1, 5-9 nodes have a replication factor of
2, and over 10 nodes have a replication factor of 3.
• Temporary work space for intermediate lens build results about 20-30% of total disk space.
AWS Security Settings for Platfora
Amazon Web Services (AWS) has a number of security features that you can use to protect your AWS
account and cloud server instances. This section contains security setting recommendations if you plan
to use Amazon Elastic MapReduce (EMR) as the Hadoop implementation for your Platfora cluster.
Amazon AWS Virtual Private Cloud (VPC)
To use Amazon EMR for Hadoop data processing, Platfora must be able to launch an EMR cluster in a
public subnet. Administrators do this by provisioning an Amazon VPC with a public subnet, and then
specifying the subnet identifier in Platfora. Platfora must create the EMR cluster on an Internet-facing
subnet to allow the AWS EMR Provisioning Service to reach the EMR cluster.
Additionally, you must ensure the Platfora server can communicate with the Amazon EMR cluster. If
the Platfora server is on the same subnet as the Amazon EMR cluster, this happens automatically. If
Page 16
Platfora Installation Guide - System Requirements (AWS Cloud)
the Platfora server and the EMR cluster are on different VPC subnets, then a route between the subnets
needs to be added to the Route table(s) so that communication can occur between the two subnets. Also,
if the VPC uses Access Control Lists (ACLs), then those ACLs must be modified to allow traffic from
Platfora to Hadoop.
The subnet identifier cannot exceed 255 characters in length.
After the Amazon VPC has been provisioned, specify its subnet identifier in the
platfora.emr.subnet.id Platfora configuration property.
For more information on setting up and using an Amazon VPC with Amazon EMR, see http://
docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-vpc-subnet.html.
IAM User and IAM Roles for Platfora
AWS Identity and Access Management (IAM) allows you to create users, groups, and roles to control
access to AWS services and resources. Platfora recommends creating an IAM User account and two
IAM Roles specifically for use by Platfora.
Platfora uses a combination of an IAM User and IAM Roles to communicate with Amazon AWS and to
create an EMR cluster. An Amazon AWS administrator needs to create a platfora IAM User and two
IAM Roles specifically for use by Platfora. Then a Platfora system administrator needs to enter some
information about that user and those roles in Platfora.
The Platfora server uses security credentials of the platfora IAM User to request Amazon AWS to
create an Amazon EMR cluster. Once that request is approved, the platfora IAM User then passes an
IAM Role to actually launch an EMR cluster, and then uses another IAM Role to start EC2 instances in
the EMR cluster. You must specify these roles in Platfora.
For more details on creating the user and roles, see Create IAM User for Platfora and Create IAM Roles
for Platfora.
Create IAM User for Platfora
The Amazon AWS administrator can create a new platfora user in the IAM Management Console
of your AWS account. After creating the user, download the AWS credentials for this user. The Platfora
Page 17
Platfora Installation Guide - System Requirements (AWS Cloud)
system administrator will need the Access Key Id and Secret Access Key when you initialize Platfora
for use with Amazon EMR.
Page 18
Platfora Installation Guide - System Requirements (AWS Cloud)
The security policy for the platfora IAM User must have (at a minimum) the permissions listed in the
following sample policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"iam:ListRoles",
"iam:PassRole",
"elasticmapreduce:*",
"s3:GetBucketLocation",
"s3:ListAllMyBuckets"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::Bucket_defined_in_core-site.xml",
"arn:aws:s3:::Datasource_Bucket_1",
"arn:aws:s3:::Datasource_Bucket_n"
},
{
},
{
}
]
}
]
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:Get*",
"s3:DeleteObject",
],
"Resource": [
"arn:aws:s3:::Bucket_defined_in_core-site.xml/*"
]
"Effect": "Allow",
"Action": [
"s3:Get*"
],
"Resource": [
"arn:aws:s3:::Datasource_Bucket_1/path/to/files/*",
"arn:aws:s3:::Datasource_Bucket_n/*"
]
Page 19
Platfora Installation Guide - System Requirements (AWS Cloud)
Under Permissions for this user, attach a security policy that contains the permissions listed above.
These permissions allow the platfora IAM User to pass an IAM Role to launch the EMR cluster,
start an EMR cluster, and access S3 for source data during data ingest.
Create IAM Roles for Platfora
Amazon requires all AWS users to use IAM Roles to launch EMR clusters. One IAM Role is used to
start the Amazon EMR service, and the other role is used by the EC2 instances in the EMR cluster.
Amazon AWS offers some default IAM Roles for these services. However, Platfora recommends
creating custom IAM Roles specifically for use by Platfora instead.
The Amazon AWS administrator can create the IAM Roles in the IAM Management Console of your
AWS account. Create a role for each of the following EMR cluster services, and specify them in Platfora
using the specified configuration properties:
• Amazon EMR service (service role). In Amazon AWS, create an IAM Role and attach a security
policy that contains at a minimum the permissions specified below. Enter this IAM Role name in
the platfora.emr.service.role Platfora configuration property. The custom role you define
corresponds to the default IAM Role Amazon offers called EMR_DefaultRole.
• EC2 instances (instance profile) in the Amazon EMR cluster. In Amazon AWS, create an IAM
Role and attach a security policy that contains at a minimum the permissions specified below.
Enter this IAM Role name in the platfora.emr.jobflow.role Platfora configuration
property. The custom role you define corresponds to the default IAM Role Amazon offers called
EMR_EC2_DefaultRole.
The security policy for the Amazon EMR service (service role) IAM Role must have (at a minimum) the
permissions listed in the following sample policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"ec2:AuthorizeSecurityGroupIngress",
"ec2:CancelSpotInstanceRequests",
"ec2:CreateSecurityGroup",
"ec2:CreateTags",
"ec2:DeleteTags",
"ec2:Describe*",
"ec2:ModifyImageAttribute",
"ec2:ModifyInstanceAttribute",
"ec2:RequestSpotInstances",
"ec2:RunInstances",
"ec2:TerminateInstances"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
Page 20
Platfora Installation Guide - System Requirements (AWS Cloud)
"iam:PassRole",
"iam:ListRolePolicies",
"iam:GetRole",
"iam:GetRolePolicy",
"iam:ListInstanceProfiles"
},
{
}
]
}
],
"Effect": "Allow",
"Resource": "*"
"Effect": "Allow",
"Action": [
"s3:Get*"
],
"Resource": "arn:aws:s3:::Bucket_defined_in_core-site.xml/*"
The security policy for the EC2 instances (instance profile) IAM Role must have (at a minimum) the
permissions listed in the following sample policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Resource": "*",
"Action": [
"ec2:Describe*",
"elasticmapreduce:Describe*",
"elasticmapreduce:ListBootstrapActions",
"elasticmapreduce:ListClusters",
"elasticmapreduce:ListInstanceGroups",
"elasticmapreduce:ListInstances",
"elasticmapreduce:ListSteps",
"s3:ListAllMyBuckets"
]
},
{
},
{
"Effect": "Allow",
"Action": [
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::Bucket_defined_in_core-site.xml",
"arn:aws:s3:::Datasource_Bucket_1",
"arn:aws:s3:::Datasource_Bucket_n"
]
"Effect": "Allow",
Page 21
Platfora Installation Guide - System Requirements (AWS Cloud)
},
{
],
}
]
}
"Action": [
"s3:PutObject",
"s3:Get*",
"s3:DeleteObject"
],
"Resource": [
"arn:aws:s3:::Bucket_defined_in_core-site.xml/*",
]
"Effect": "Allow",
"Action": [
"s3:Get*",
"s3:List*"
"Resource": [
"arn:aws:s3:::Datasource_Bucket_1/path/to/files/*",
"arn:aws:s3:::Datasource_Bucket_n/*",
"arn:aws:s3:::*elasticmapreduce/*"
]
Verify that the permissions for and access to Amazon resources (especially S3) for
the EC2 instances role are the same or greater than the permissions and access
assigned to the platfora IAM User. For example, if the platfora IAM User can
access an Amazon S3 bucket, but the EC2 instances role cannot, then lens builds
that rely on that S3 bucket will fail.
For more information on using IAM Roles for EMR, see http://docs.aws.amazon.com/
ElasticMapReduce/latest/DeveloperGuide/emr-iam-roles.html.
EC2 Security Group Settings
EC2 security groups allow you to specify firewalling rules for your Amazon elastic cloud computing
(EC2) server instances.
EC2 security group rules are independent of, and in addition to, the software firewalling provided by the
instance's operating system. Security groups must be defined before you create an EC2 instance.
The security group configured for the Platfora server instance must permit connections from your user
network to the Platfora web application server port (8001 by default). You also may want to open the
EMR Hadoop ResourceManager and JobHistory web ports so that you can monitor and troubleshoot
YARN jobs executed by Platfora.
An example security group configuration for a Platfora server instance would look something like the
following:
Page 22
Platfora Installation Guide - System Requirements (AWS Cloud)
Port Configuration Requirements
You must open ports in the firewall of your Platfora nodes to allow client access and intra-cluster
communications. You also must open ports within your Hadoop cluster to allow access from Platfora.
This section lists the default ports required.
Ports to Open on Platfora Nodes
Your Platfora master node must allow HTTP connections from your user network. All nodes must allow
connections from the other Platfora nodes in a multi-node cluster.
On Amazon EC2 instances, you must configure the port firewall rules on the
Platfora server instances in addition to the EC2 Security Group Settings.
Platfora Service
Default
Port
Allow connections from…
Master Web Services Port
(HTTP)
8001
External user network
Platfora worker servers
localhost
Secure Master Web Services
Port (HTTPS)
8443
External user network
Platfora worker servers
localhost
Master Server Management
Port
8002
Platfora worker servers
localhost
Worker Server Management
Port
8002
Platfora master server
other Platfora worker servers
localhost
Master Data Port
8003
Platfora worker servers
localhost
Page 23
Platfora Installation Guide - System Requirements (AWS Cloud)
Platfora Service
Default
Port
Allow connections from…
Worker Data Port
8003
Platfora master server
other Platfora worker servers
localhost
Master PostgreSQL Database
Port
5432
Platfora worker servers
localhost
Browser Requirements
Users can connect to the Platfora web application using the latest HTML5-compliant web browsers.
Platfora supports the latest releases of the following web browsers:
• Chrome (preferred browser)
• Firefox
• Safari
• Internet Explorer with the Compatibility View feature disabled (versions prior to IE 10 are not
supported)
Platfora supports these web browsers on desktop machines only.
Page 24
Chapter
3
Install Platfora Software and Dependencies
This section describes how to provision a Platfora node with the required prerequisites and Platfora
software. If you are installing a new Platfora cluster, the master node needs everything (prerequisites and
Platfora software). Worker nodes only need the prerequisite software installed prior to initialization.
Most of the tasks in this section require root permissions. The example commands in the
documentation use sudo to denote the commands that require root permissions.
Topics:
•
About the Platfora Installer Packages
•
Install Using RPM Packages
•
Install Using the TAR Package
About the Platfora Installer Packages
Platfora provides RPM or TAR installer packages that are specific to the Hadoop distribution you are
using. Platfora Customer Support can provide you with the link to download the installer packages for
your environment.
Make sure to download the correct Platfora installer packages for your Hadoop distribution and version.
See Supported Hadoop and Hive Versions if you are not sure which Platfora package to use for your
chosen Hadoop distribution.
RPM Packages
If you plan to install Platfora on a Linux operating system that supports the RPM packager manager,
such as RedHat or CentOS, Platfora recommends using the RPM packages to install Platfora and its
required dependencies.
The platfora-base RPM package includes all the prerequisite software that Platfora needs, plus
automates the OS configurations needed by Platfora. This package should be installed on all Platfora
nodes (master and workers).
Page 25
Platfora Installation Guide - Install Platfora Software and Dependencies
The platfora-server package includes the Platfora software only, which only needs to be installed
on the master node. The Platfora software is copied to the worker nodes during initialization or upgrade,
so you don't need to install it on the worker nodes ahead of time.
TAR Package
If you plan to install Platfora on a Linux operating system that does not support the RPM package
manager, such as Ubuntu, you have to use the TAR package. You may also use the TAR package if you
just want to install and manage the dependent software that is installed in your environment yourself.
The TAR package contains the Platfora server software only, which only needs to be installed on the
master node.
The TAR package does not contain the prerequisite software that Platfora needs. You must manually
install the required prerequisite software and do the required OS configurations on all Platfora nodes
prior to installing and initializing Platfora.
Install Using RPM Packages
Follow the instructions in this section to install the Platfora dependencies and server software using the
RPM packages. Install the platfora-base RPM package on all Platfora nodes, and the platforaserver RPM package on the master node only.
Install Dependencies RPM Package
The platfora-base RPM package contains all of the dependent software required by Platfora, and
also automates several OS configuration tasks. This package is installed on all Platfora nodes.
This task requires root permissions. Commands that begin with sudo denote root
commands.
The platfora-base RPM package does the following:
• Creates a /usr/local/platfora/base directory containing Platfora's third-party dependencies.
• Creates the platfora system user. The platfora user has no password set.
• Generates an SSH key for the platfora system user and adds the key to the user's
authorized_keys file.
• Adds the platfora system user to the sudoers file. This allows you to execute commands as root
while logged in as the platfora user.
• Ensures the OS kernel parameters are appropriate for Platfora and sets them if they are not.
• Creates a .bashrc file for the platfora system user.
Page 26
Platfora Installation Guide - Install Platfora Software and Dependencies
The platfora-base package uses the following file naming convention, where version-build
is the version and build number of the base package only, and x86_64 is the supported system
architecture. The base and Platfora server packages use different versioning schemes.
platfora-base-version-build-x86_64.rpm
The base package is not updated every Platfora release. It is only updated when
the Platfora dependencies change, which is not as often. When upgrading Platfora,
check the release notes to see if upgrade of the base package is required.
1. Log on to the machine on which you are installing Platfora.
2. Using the download link provided by Platfora Customer Support, download the base package. For
example:
$ wget http://downloads.platfora.com/release/platforabase-version-build-x86_64.rpm
3. Install the package using the yum package manager (requires root permission). For example:
$ sudo yum --nogpgcheck localinstall platfora-base-version-buildx86_64.rpm
Confirm that the /usr/local/platfora/base directory was created.
$ sudo ls -a /usr/local/platfora/base
Install Optional Security RPM Package
The platfora-security RPM package contains SSL-enabled PostgreSQL and the OpenSSL
package it depends on. This package is only needed if you plan to enable SSL communications between
the Platfora worker nodes and the Platfora metadata catalog database.
This task requires root permissions. Commands that begin with sudo denote root
commands.
The platfora-security package is installed after the platfora-base package. The
platfora-security RPM package does the following:
• Creates a /usr/local/platfora/security directory containing the SSL-enabled version of
PostgreSQL.
• Checks if OpenSSL version 1.0.1 or later is installed, and if not downloads and installs the openssl
package dependency from the OpenSSL public repo.
• Edits the .bashrc file for the platfora system user and changes the PATH environment variable
so that secure PostgreSQL is listed before the default PostgreSQL installed by the platfora-base
package.
The platfora-security package uses the following file naming convention, where
version-build is the version and build number of the base package only, and x86_64 is the
Page 27
Platfora Installation Guide - Install Platfora Software and Dependencies
supported system architecture. The base, security and Platfora server packages use different versioning
schemes.
platfora-security-version-build-x86_64.rpm
The security package only needs to be upgraded when the base package is
upgraded, which is not every release. When upgrading Platfora, check the release
notes to see if upgrade of the base and security packages is required.
1. Log on to the machine on which you are installing Platfora.
2. Using the download link provided by Platfora Customer Support, download the security package. For
example:
$ wget http://downloads.platfora.com/release/platforasecurity-version-build-x86_64.rpm
3. Install the package using the yum package manager (requires root permission). For example:
$ sudo yum --nogpgcheck localinstall platfora-security-version-buildx86_64.rpm
Confirm that the /usr/local/platfora/security directory was created.
$ sudo ls -a /usr/local/platfora/security
Install Platfora RPM Package (Master Only)
The platfora-server RPM package contains the Platfora server software. This package is installed
on the Platfora master node only.
The platfora-server RPM package creates a /user/local/platfora/platfora-server
directory containing the Platfora software.
The platfora-server package uses the following file naming convention, where hadoop_distro
corresponds to the Hadoop distribution you are using, version-build is the version and build number
of the Platfora software, and x86_64 is the supported system architecture.
platfora-server-hadoop_distro-version-build-x86_64.rpm
Make sure to download the correct Platfora installer packages for your Hadoop
distribution and version. See Supported Hadoop and Hive Versions if you are not
sure which Platfora package to use for your chosen Hadoop distribution.
This task requires root permissions. Commands that begin with sudo denote root commands.
1. Log on to the machine on which you are installing the Platfora master.
2. Using the download link provided by Platfora Customer Support, download the Platfora server
package. For example:
$ wget http://downloads.platfora.com/release/platforaserver-hadoop_distro-version-build-x86_64.rpm
3. Install the package using the yum package manager (requires root permission). For example:
$ sudo yum --nogpgcheck localinstall platforaserver-hadoop_distro-version-build-x86_64.rpm
Page 28
Platfora Installation Guide - Install Platfora Software and Dependencies
Confirm that the /usr/local/platfora/platfora-server directory was created.
$ sudo ls -a /usr/local/platfora/platfora-server
Install Using the TAR Package
Follow the instructions in this section to install the Platfora dependencies and server software using
the TAR packages. The TAR package contains the Platfora server software only. You must install all
dependencies yourself.
For the Platfora master node, do all the tasks described in this section.
For a Platfora worker node, do all the tasks described in this section except for:
• Install PostgreSQL
• Install Platfora TAR Package
• Install PDF Dependencies
Create the Platfora System User
Platfora requires a platfora system user account to own the Platfora installation and run the Platfora
server processes. This same system user must be created on all Platfora nodes.
This task requires root permissions. Commands that begin with sudo denote root commands.
(MapR Only) If you are using MapR as your Hadoop distribution with Platfora, make sure to follow the
additional steps for MapR. The platfora system user must exist on all Platfora nodes and all MapR
nodes. The UID/GID must also be the same on the MapR nodes as on Platfora nodes.
1. Create the platfora system user:
$ sudo useradd -s /bin/bash -m -d /home/platfora platfora
2. Set a password for the platfora user:
$ sudo passwd platfora
3. (MapR Only) Check the /etc/passwd file on your MapR CLDB node, and find the entry for the
platfora user. Note the user and group id numbers that are used.
For example:
platfora:x:1002:1002::/home/platfora:/bin/bash
4. (MapR Only) Check the /etc/passwd file on your Platfora master node. If the user and group id
numbers for the platfora user are different, update them so that they are the same as on the MapR
nodes.
For example:
$ sudo usermod -u 1002 platfora
$ sudo groupmod -g 1002 platfora
Page 29
Platfora Installation Guide - Install Platfora Software and Dependencies
Configure sudo for the platfora User
This is an optional task. Configuring sudo access for the platfora system user is a convenient way to
run commands as root while logged in as the platfora user.
If you do not configure sudo access for the platfora user, then you must change to the root user to
execute the system commands that require root permissions.
This documentation assumes that you have sudo access configured. If you do not, every time you see
sudo at the beginning of a command, it means you need to be root to run the command.
1. Edit the /etc/sudoers file using the visudo command.
$ sudo visudo
2. Add a line such as the following in this file:
# User privilege specification
platfora ALL=(ALL:ALL) ALL
3. Save your changes and exit the visudo editor.
Generate and Authorize an SSH Key
Generating and authorizing an SSH key for the platfora system user on the localhost is required
by the Platfora management utilities. This task should be performed on all Platfora nodes.
The Platfora management utilities require a trusted-host environment (the ability to SSH to a remote
system in the Platfora cluster without a password prompt). Even in single-node installations, you must
exchange SSH keys for the localhost.
1. Make sure that Selinux is disabled using either the sestatus or getenforce command.
$ sestatus
If Selinux is enabled, disable it using the recommended procedure for the node's operating system.
2. Make sure you are logged in to the Platfora server as the platfora system user.
$ su - platfora
3. Go to the ~/.ssh directory (create it if it does not exist):
$ mkdir .ssh
$ cd .ssh
4. Generate a public/private key pair that is NOT passphrase-protected.
Press the ENTER or RETURN key for each prompt:
$ ssh-keygen -C 'platfora key for node 0' -t rsa
Enter file in which to save the key (/home/platfora/.ssh/
id_rsa): ENTER
Enter passphrase (empty for no passphrase): ENTER
Enter same passphrase again: ENTER
Page 30
Platfora Installation Guide - Install Platfora Software and Dependencies
5. Append the public key to the ~/.ssh/authorized_keys file (this allows SSH access from the
current host to itself):
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
6. Make sure the home directory, .ssh directory, and the files it contains have the correct permissions:
$ chmod 700 $HOME && chmod 700 ~/.ssh && chmod 600 ~/.ssh/*
7. Test that you can SSH to localhost without a password prompt.
If prompted to add localhost to the list of known hosts, enter yes :
$ ssh localhost
The authenticity of host 'localhost (127.0.0.1)' can't be
established...
Are you sure you want to continue connecting (yes/no)? yes
Set OS Kernel Parameters
This section has the Linux OS kernel settings required for Platfora. You must have root or sudo
permissions to change kernel parameter settings. Changing kernel settings requires a system reboot in
order for the changes to take effect.
Kernel ulimit Setting
Linux operating systems set limits on the number of open files and connections a process can have. For
some applications, such as Platfora and Hadoop, having a lot of open file handlers during processing is
normal. Having the limit set too low can cause Platfora lens builds to fail.
There are two places file limits are set in the Linux operating system:
• A global limit for the entire system (set in /etc/sysctl.conf)
• A per-user process limit (set in /etc/security/limits.conf)
You must have root or sudo permissions to change OS ulimit settings.
You can check the global limit by running the command:
$ cat /proc/sys/fs/file-nr
This should return a set of three numbers like this:
704 0 294180
The first number is the number of currently opened file descriptors. The second number is the number
of allocated file descriptors. The third number is the maximum number of file descriptors for the whole
system. This limit should be at least 250000.
To increase the global limit, edit /etc/sysctl.conf (as root) and set the property:
fs.file-max = 294180
You can check the per-user process limit by running the command:
$ ulimit -n
Page 31
Platfora Installation Guide - Install Platfora Software and Dependencies
This should return the file limit for the currently logged in user, for example:
1024
This limit should be at least 20000 for the platfora user (or whatever user runs the Platfora server).
To increase the limit, edit /etc/security/limits.conf (as root) and the following lines (the *
increases the limit for all system users):
*
*
root
root
hard
soft
hard
soft
nofile
nofile
nofile
nofile
65536
65536
65536
65536
Reboot the server for the changes to take effect.
$ sudo reboot
Kernel Memory Overcommit Setting
Linux operating systems allow memory to be overcommitted, meaning the OS will allow an application
to reserve more memory than actually exists within the system. Allowing overcommit prevents the OS
from killing processes when a process requests more memory than is available.
If you are using a version 1.6 Java Runtime Environment (JRE), you must configure your OS to allow
memory overcommit. If you are using a version 1.7 JRE, overcommit is not necessary.
You must have root or sudo permissions to change kernel memory overcommit settings.
1. Check your version of Java.
$ java -version
If you are running a 1.6 version, proceed to the next steps. If you are running a 1.7 version, you do
not need to make any further changes.
2. Edit the /etc/systcl.conf file.
$ sudo vi /etc/systcl.conf
3. Set the following value:
vm.overcommit_memory=1
4. Save and close the file.
5. Reboot your system for the change to take effect:
$ sudo reboot
Kernel Shared Memory Settings
Some default OS installations have the system shared memory values set too low for Platfora. You may
need to increase the shared memory settings if they are set too low.
You must have root or sudo permissions to set the system shared memory parameters.
1. In /etc/sysctl.conf, make sure the shared memory parameters have the minimum values or
higher.
Page 32
Platfora Installation Guide - Install Platfora Software and Dependencies
If your settings are lower than these minimum values, you will need to change them. If they are
higher than the minimum, leave them as is.
kernel.shmmax=17179869184
kernel.shmall=4194304
2. If you made changes to /etc/sysctl.conf, reboot the server for the changes to take effect.
$ sudo reboot
Install Dependent Software
If using the TAR installation package to install Platfora, you must install all of the dependencies
yourself. This section provides instructions for manually installing the prerequisite software on a
Platfora node.
If you are provisioning a Platfora master node, you must install all dependencies.
If you are provisioning a Platfora worker node, you can skip the task for installing PostgreSQL.
PostgreSQL is only needed on the Platfora master node.
Confirm Linux OS Utilities
Platfora requires several standard Linux utilities to be installed on your system and in your environment
PATH. Check your system for the required utilites before installing Platfora.
Most Linux operating systems already have these utilities installed by default.
• rsync
• ssh
• scp
• tail
• tar
• cp
• wget
• ntp
• sysctl (/usr/sbin must be in your PATH)
To verify that a utility is installed and can be found in the PATH, you can check its location using the
which command. For example:
$ which rsync
$ which tar
$ which sysctl
If a utility is not installed, you will need to install it before installing Platfora. Check your OS
documentation for instructions on installing these utilities.
Page 33
Platfora Installation Guide - Install Platfora Software and Dependencies
Install Java
The Platfora server requires a Java Runtime Environment (JRE) version 1.7 or higher. Platfora
recommends installing the full Java Development Kit (JDK) for access to the latest Java features and
diagnostic tools.
The instructions in this section are for installing version 1.7 of the Open Java Development Kit
(OpenJDK).
You must have root or sudo permissions to install Java.
1. Check if Java 1.7 or higher is already installed.
$ java -version
If java is not found, you will need to install it.
2. Install OpenJDK using your OS package manager.
On Ubuntu Systems:
$ sudo apt-get install openjdk-7-jdk
On RedHat/CentOS Systems:
$ su -c "yum install java-1.7.0-openjdk"
3. Set the JAVA_HOME environment variable in the platfora user’s profile file. For example, where
java_directory is the versioned directory where Java is installed:
$ echo "export JAVA_HOME=/usr/lib/jvm/java_directory/jre" >> /home/
platfora/.bashrc
$ echo "export PATH=$JAVA_HOME/bin:$PATH" >> /home/platfora/.bashrc
$ source /home/platfora/.bashrc
4. Make sure JAVA_HOME is set correctly for the platfora user:
$ su - platfora
$ echo $JAVA_HOME
Confirm Python Installation
The Platfora management utilities require Python version 2.6.8, 2.7.1, or 2.7.3 through 2.7.6. Python
version 3.0 is not supported. Most Linux operating systems already have Python installed by default, but
you need to make sure the version is compatible with Platfora.
To check if the correct version of Python is installed:
$ python -V
If Python is not installed (or you have an incompatible version of Python) you will need to install or
upgrade/downgrade it before installing Platfora. Check your OS documentation for instructions on
installing or upgrading/downgrading Python to version 2.6.8 or higher 2.x version.
Page 34
Platfora Installation Guide - Install Platfora Software and Dependencies
Install PostgreSQL (Master Only)
Platfora stores its metadata catalog in a PostgreSQL relational database. PostgreSQL version 9.2 or 9.3
must be installed (but not running) on the Platfora master server before you start Platfora for the first
time. Platfora worker nodes do not require a PostgreSQL installation.
You must have root or sudo permissions to install PostgreSQL.
Install PostgreSQL 9.2 on Ubuntu Systems
These instructions are for installing PostgreSQL 9.2 on Linux Ubuntu operating systems.
1. Install the dependent libraries:
$ sudo apt-get install libpq-dev
2. Add the PostgreSQL repository to your system configuration:
$ sudo add-apt-repository ppa:pitti/postgresql
$ sudo apt-get update
3. Install PostgreSQL 9.2:
$ sudo apt-get install postgresql-9.2
4. Stop the PostgreSQL service.
$ sudo service postgresql stop
5. Remove the PostgreSQL automatic start-up scripts:
$ sudo rm /etc/rc*/*postgresql
6. Create and change the ownership on the directory where PostgreSQL writes its lock files:
$ sudo mkdir /var/run/postgresql
$ sudo chown platfora /var/run/postgresql
7. Update the platfora user’s PATH environment variable to include the PostgreSQL executable
directory and /usr/sbin:
$ echo "export PATH=/usr/lib/postgresql/9.2/bin:/usr/sbin:$PATH" >> /
home/platfora/.bashrc
$ source /home/platfora/.bashrc
Install PostgreSQL 9.2 on RedHat/CentOS Systems
These instructions are for installing PostgreSQL 9.2 on RedHat Enterprise Linux (RHEL) or CentOS
operating systems.
1. Download the appropriate PostgreSQL 9.2 YUM repository for your operating system.
Go to the PostgreSQL yum repository website, copy the URL link for the appropriate YUM
repository configuration, and download it using wget.
For example, to download the YUM repository configuration for PostgreSQL 9.2 on a 64-bit RHEL 6
operating system.
$ wget http://yum.pgrpms.org/9.2/redhat/rhel-6-x86_64/pgdgredhat92-9.2-7.noarch.rpm
2. Add the PostgreSQL YUM repository to your system configuration:
$ sudo rpm -i pgdg-redhat92-9.2-7.noarch.rpm
Page 35
Platfora Installation Guide - Install Platfora Software and Dependencies
3. Install PostgreSQL:
$ sudo yum install postgresql92 postgresql92-server
4. If it is enabled, disable the PostgreSQL automatic start-up.
Each operating system has its own technique for auto starting PostgreSQL. If your system uses
chkconfig to manage init scripts, you can remove PostgreSQL from the chkconfig control using
the following command:
chkconfig --del postgresql
For some operating systems, the PostgreSQL start.conf file configures the auto-start of a
specific PostgreSQL cluster.
5. Create and change the ownership on the directory where PostgreSQL writes its lock files:
$ sudo mkdir /var/run/postgresql
$ sudo chown platfora /var/run/postgresql
6. Update platfora user’s PATH environment variable to include the PostgreSQL executable
directory and /usr/sbin:
$ echo "export PATH=/usr/pgsql-9.2/bin:/usr/sbin:$PATH" >> /home/
platfora/.bashrc
$ source /home/platfora/.bashrc
Confirm OpenSSL Installation
Platfora uses OpenSSL for secure communications between the Platfora worker servers and its metadata
catalog database. If you decide to enable SSL for the Platfora catalog, which is optional, you will need
OpenSSL version 1.0.1 or higher on your Platfora nodes.
As an optional security feature, you can choose to enable SSL communications between the Platfora
metadata catalog and the Platfora worker nodes. If you decide to enable this, you will need to have:
• SSL-enabled PostgreSQL. If using the RPM installation packages, Platfora provides an optional
platfora-security package that contains SSL-enabled PostgreSQL. If using the TAR
installation packages, the packages provided in the PostgreSQL public repo come with SSL enabled.
• OpenSSL. If using the RPM installation packages, Platfora provides an optional platforasecurity RPM package that pulls this dependency from the public repo. If using the TAR
installation packages, you will have to install the openssl package yourself.
Many Linux operating systems already have OpenSSL installed by default, but you need to make sure
the version is compatible with the version that PostgreSQL uses.
1. Check that OpenSSL version 1.0.1 or higher is installed.
$ openssl version
2. If OpenSSL is not installed (or you have an incompatible version) you will need to install or upgrade
it before enabling SSL for the Platfora catalog. Check your OS documentation for instructions on
installing or upgrading the openssl package.
Page 36
Platfora Installation Guide - Install Platfora Software and Dependencies
Install Platfora TAR Package (Master Only)
The TAR installation package contains the Platfora server software only. You only need to install this
package on the Platfora master node. You can skip this task if you are provisioning a Platfora worker
node.
The platfora tar package uses the following file naming convention, where version-build.no is
the version and build number of the Platfora software and hadoop_distro corresponds to the Hadoop
distribution you are using.
platfora-version-build.num-hadoop_distro.tgz
Make sure to download the correct Platfora installer package for your Hadoop
distribution and version. See Supported Hadoop and Hive Versions if you are not
sure which Platfora package to use for your chosen Hadoop distribution.
This task requires root permissions. Commands that begin with sudo denote root commands.
1. Log on to the machine on which you are installing the Platfora master.
2. Create a Platfora installation directory and ensure that it is owned by the platfora system user.
For example:
$ sudo mkdir /usr/local/platfora
$ sudo chown platfora /usr/local/platfora -R
3. Log in as the platfora user and go to the installation directory that you just created:
$ su - platfora
$ cd /usr/local/platfora
4. Download the 4.5.0 release package and checksum file using the URLs provided by Platfora
Customer Support.
Make sure to download the correct packages for your Hadoop distribution version. For example:
$ wget http://downloads.platfora.com/release/platfora-versionbuild.num-hadoop_distro.tgz
$ wget http://downloads.platfora.com/release/platfora-versionbuild.num-hadoop_distro.tgz.sha
5. After downloading the package and checksum file, make sure the package is valid using the shasum
command.
For example:
$ shasum -c platfora-version-build.num-hadoop_distro.tgz.sha
If the package is valid, you should see a message such as:
platfora-version-build.num-hadoop_distro.tgz: OK
6. Unpack the package within the installation directory.
For example:
$ tar -zxvf platfora-version-build.num-hadoop_distro.tgz
7. Create a symbolic link named platfora-server that points to the actual installation directory.
Page 37
Platfora Installation Guide - Install Platfora Software and Dependencies
For example:
$ ln -s platfora-version-build.num-hadoop_distro platfora-server
8. Set the PLATFORA_HOME environment variable for the platfora system user.
$ echo "export PLATFORA_HOME=/usr/local/platfora/platfora-server" >>
$HOME/.bashrc
9. Set the PATH environment variable for the platfora system user.
The PATH should include /usr/sbin, $PLATFORA_HOME/bin, and the PostgreSQL executable
directories. If your system has more than one version of PostgreSQL installed, make sure that 9.2 is
listed first in the PATH of the platfora user.
For example (Ubuntu):
$ echo "export PATH=/usr/lib/postgresql/9.2/bin:/usr/sbin:
$PLATFORA_HOME/bin:$PATH" >> $HOME/.bashrc
$ source $HOME/.bashrc
For example (RedHat/CentOS):
$ echo "export PATH=/usr/pgsql-9.2/bin:/usr/sbin:$PLATFORA_HOME/bin:
$PATH" >> $HOME/.bashrc
$ source $HOME/.bashrc
10.Make sure the JAVA_HOME environment variable is set (if it's not, see Install Java).
$ echo $JAVA_HOME
Install PDF Dependencies (Master Only)
One feature of Platfora is the ability to save a vizboard as a PDF document. In order for the Platfora
server to render PDFs, it needs PhantomJS and the OpenSans font to be installed on the Platfora master
node. You can skip this task if you are provisioning a Platfora worker node.
The PhantomJS installation relies on several fonts that ship with the Platfora software. For this reason,
the PhantomJS installation must be done after installing the Platfora software.
To install PhantomJS, do the following:
1. Log into the Platfora master node as the platfora user.
2. Install the PhantomJS dependencies.
On Ubuntu
$
$
$
$
sudo
sudo
sudo
sudo
apt-get
apt-get
apt-get
apt-get
On Redhat/CentOS
install
install
install
install
fontconfig
libfreetype6
libfontconfig1
libstdc++6
Page 38
$
$
$
$
$
sudo
sudo
sudo
sudo
sudo
yum
yum
yum
yum
yum
install
install
install
install
install
fontconfig
freetype
libfreetype.so.6
libfontconfig.so.1
libstdc++.so.6
Platfora Installation Guide - Install Platfora Software and Dependencies
3. Download the compiled PhantomJS executable.
$ sudo wget https://bitbucket.org/ariya/phantomjs/downloads/
phantomjs-1.9.7-linux-x86_64.tar.bz2
4. Extract the files.
$ sudo tar xjf phantomjs-1.9.7-linux-x86_64.tar.bz2
5. Copy the PhantomJS binary to an accessible bin directory.
You should choose a bin directory that is common to most user environments.
$ sudo cp phantomjs-1.9.7-linux-x86_64/bin/phantomjs /usr/local/bin
6. Verify the phantomjs command is accessible to the platfora user.
$ which phantomjs
/usr/local/bin/phantomjs
If the command is not found, add the bin directory to the platfora user's environment:
$ echo "export PATH=/usr/local/bin:/usr/sbin:$PATH" >> /home/
platfora/.bashrc
$ source /home/platfora/.bashrc
7. Install the OpenSans font for use by the PDF feature.
a) Make a directory to contain the typeface.
$ sudo mkdir -p /usr/share/fonts/truetype
b) Copy the font to the truetype directory.
$ sudo cp -r $PLATFORA_HOME/server/webapps/proton/dist/fonts/
OpenSans /usr/share/fonts/truetype
c) Refresh the font cache.
$ sudo fc-cache -f
After installing, you'll want to verify the installation is running correctly. One easy way to do this is
using examples that came with the PhantomJS tarball:
$ phantomjs phantomjs-1.9.7-linux-x86_64/examples/hello.js
Hello, world!
You can also output a PDF to verify the fonts were installed correctly. to output to PDF choose Share
> Prepare PDF for Download from an open vizboard. In the example PDF output below, the left
Page 39
Platfora Installation Guide - Install Platfora Software and Dependencies
side shows the output when the fonts are installed. The right side was rendered without the proper fonts
installed:
Page 40
Chapter
4
Configure Environment on Platfora Nodes
This section describes how to configure a Platfora node's operating system and network environment. You
should perform these tasks on every node in the Platfora cluster (master and workers) after you have installed the
Platfora dependencies and software, but before you initialize Platfora (or initialize a new worker node).
Topics:
•
Install the MapR Client Software (MapR Only)
•
Configure Network Environment
•
Configure Passwordless SSH
•
Synchronize the System Clocks
•
Create Local Storage Directories
•
Verify Environment Variables
Install the MapR Client Software (MapR Only)
If you are using MapR as your Hadoop distribution, you must install the MapR client software on all
Platfora nodes (master and workers). If you are not using MapR with Platfora, you can skip this task.
Platfora uses the MapR client to submit MapReduce jobs and file system commands directly to the
MapR cluster. For more information about the MapR client, see the MapR documentation.
If you use MapR 4.1, Platfora requires that you install the MapR 4.0.2 client
software.
You must have root or sudo permissions to install the MapR client.
Installing the MapR Client on Ubuntu
1. Add the following line to the /etc/apt/sources.list file:
deb http://package.mapr.com/releases/version/ubuntu/ mapr optional
Platfora supports MapR client versions: v3.0.x, v3.1.1, v4.0.x.
Page 41
Platfora Installation Guide - Configure Environment on Platfora Nodes
2. Update the repository and install the MapR client:
$ sudo apt-get update
$ sudo apt-get install mapr-client
3. Configure the MapR client where clusterName is the name of your MapR cluster and cldbhost
is the hostname and port of the MapR CLDB node:
$ sudo /opt/mapr/server/configure.sh –N clusterName -c C cldbhost:7222
4. Check if the /opt/mapr/hostname file exists on the node.
$ sudo ls /opt/mapr
If the file doesn't exist, create it:
$ sudo hostname -f > /opt/mapr/hostname
5. Set the PLATFORA_HADOOP_LIB environment variable. For example (check the path for your
version of the MapR client):
$ echo "export PLATFORA_HADOOP_LIB=/opt/mapr/hadoop/lib" >>
$HOME/.bashrc
Installing the MapR Client on RedHat/CentOS
1. Create the file /etc/yum.repos.d/maprtech.repo with the following contents:
[maprtech]
name=MapR Technologies
baseurl=http://package.mapr.com/releases/version/redhat/
enabled=1
gpgcheck=0
protect=1
Platfora supports MapR client versions: v3.0.x, v3.1.1, v4.0.x.
2. Install the MapR client. For example, on 64-bit operating systems:
$ sudo yum install mapr-client.x86_64
3. Configure the MapR client where clusterName is the name of your MapR cluster and
cldbhost:port is the hostname and port of the MapR CLDB node:
$ sudo /opt/mapr/server/configure.sh –N clusterName -c C cldbhost:port
4. Check if the /opt/mapr/hostname file exists on the node.
$ sudo ls /opt/mapr
If the file doesn't exist, create it:
$ sudo hostname -f | sudo tee /opt/mapr/hostname
5. Set the PLATFORA_HADOOP_LIB environment variable. For example (check the path for your
version of the MapR client):
$ echo "export PLATFORA_HADOOP_LIB=/opt/mapr/hadoop/lib" >>
$HOME/.bashrc
Page 42
Platfora Installation Guide - Configure Environment on Platfora Nodes
Configure Network Environment
A Platfora node needs to be able to connect to other Platfora nodes over the network, and to the Hadoop
services it needs. This section describes how to check the network connections between nodes, and make
sure the required ports are open to connections from a Platfora node.
Configure /etc/hosts File
The /etc/hosts file is a system file that identifies the hostnames and IP addresses of other machines
in the network so that they can find each other.
Platfora uses the /etc/hosts system file to find other nodes over the network. This means that each
node in a Platfora cluster must have the same entries. When you add, change, or remove a node, you
should update the /etc/hosts file on all Platfora nodes. For on-premise Hadoop installations, you
will also need to specify the address of your Hadoop NameNode.
A typical /etc/hosts file on a Platfora node might look something like this:
# Platfora IP
127.0.0.1
10.202.123.45
10.202.123.46
10.202.123.47
10.202.123.48
Hostname
localhost
ip-10-202-123-45
ip-10-202-123-46
ip-10-202-123-47
ip-10-202-123-48
Alias
platfora-master
platfora-worker-1
platfora-worker-2
platfora-worker-3
# Hadoop IP
Hostname
Alias
10.202.123.55 ip-10-202-123-55 hadoop-namenode
Platfora relies on the IP address associated with a node's network interface.
Host File Configuration on Amazon EC2 Instances
If you are installing your Platfora nodes on Amazon EC2 instances, the entries in the /etc/hosts file
should use the Amazon internal IP addresses and hostnames.
If you are using standard EC2 instances, the internal IP address is associated with the network interface
of the instance. When you stop or restart a standard EC2 instance, its internal IP address and hostname
changes. This means that whenever you stop and restart an instance, you'll need to update the /etc/
hosts files to reflect the new internal IP addresses and hostnames that are assigned to the instance.
Platfora recommends using virtual private cloud (VPC) EC2 instances to run your Platfora nodes. EC2VPC instances maintain their assigned internal IP address and hostname through restarts.
Amazon Elastic MapReduce (EMR) Hadoop instances are launched on EC2-VPC instances by default.
You do not need to put Hadoop node entries in your Platfora node /etc/hosts files if you are using
EMR as your Hadoop distribution. You only need Hadoop entries if you are running your own managed
Hadoop cluster on EC2.
Page 43
Platfora Installation Guide - Configure Environment on Platfora Nodes
Verify Connectivity Between Platfora Nodes
The Platfora master and worker nodes must be able to accept incoming network connections from each
other on the ports designated for Platfora intra-node communications. This sections explains how you
can test network connectivity between Platfora nodes and verify that the required ports are open.
In a multi-node Platfora cluster, all of the nodes must be able to connect to each other over the network.
Platfora services use certain ports for intra-node communications. Before you initialize Platfora, you
should decide what ports to use for these services, and make sure that they are open to connections from
other Platfora nodes.
The following table shows the default Platfora ports:
Platfora Service
Default
Port
Allow connections from…
Master Web Services Port
(HTTP)
8001
External user network
Platfora worker servers
localhost
Secure Master Web Services
Port (HTTPS)
8443
External user network
Platfora worker servers
localhost
Master Server Management
Port
8002
Platfora worker servers
localhost
Worker Server Management
Port
8002
Platfora master server
other Platfora worker servers
localhost
Master Data Port
8003
Platfora worker servers
localhost
Worker Data Port
8003
Platfora master server
other Platfora worker servers
localhost
Master PostgreSQL Database
Port
5432
Platfora worker servers
localhost
One way to verify that these ports are open to connections from another Platfora node is to use the
telnet command.
For example, to test if port 8002 was open on a remote node with the IP address 10.10.10.9, you
could run the following command to test the connection:
$ telnet 10.10.10.9 8002
Page 44
Platfora Installation Guide - Configure Environment on Platfora Nodes
If a connection is not allowed, you will need to configure the firewall on your Platfora nodes to open the
appropriate ports and allow incoming connections from the other Platfora nodes.
On Amazon EC2 instances, you may need to configure the port firewall rules on the Platfora server
instances in addition to the EC2 Security Group Settings.
Verify Connectivity to Hadoop Nodes
The Platfora master and worker nodes must be able to connect to certain Hadoop services. This topic
explains how you can test network connectivity between Platfora nodes and an on-premise Hadoop
installation to verify that the required ports are open.
The following table shows the default Hadoop service ports that Platfora needs open:
Hadoop Service
Default Ports by Hadoop Allow connections from…
Distro
CDH,
HDP,
Pivotal
Apache MapR
Hadoop
HDFS NameNode
8020
9000
N/A
Platfora master and worker servers
HDFS DataNodes
50010
50010
N/A
Platfora master and worker servers
MapRFS CLDB
N/A
N/A
7222
Platfora master and worker servers
MapRFS DataNodes
N/A
N/A
5660
Platfora master and worker servers
MRv1 JobTracker
8021
9001
9001
Platfora master server
MRv1 JobTracker
Web UI
50030
50030
50030
External user network (optional)
YARN
ResourceManager
8032
8032
8032
Platfora master server
YARN
ResourceManager
Web UI
8088
8088
8088
External user network (optional)
YARN Job History
Server
10020
10020
10020
Platfora master server
YARN Job History
Server Web UI
19888
19888
19888
External user network (optional)
Page 45
Platfora Installation Guide - Configure Environment on Platfora Nodes
Hadoop Service
Default Ports by Hadoop Allow connections from…
Distro
CDH,
HDP,
Pivotal
Apache MapR
Hadoop
HiveServer Thrift
Port
10000
10000
10000
Platfora master server
Hive Metastore DB
2
Port
9083
9933
(HDP2)
N/A
9083
Platfora master server
To determine the interfaces and ports your particular Hadoop cluster is using for its file system and data
processing services, look at the core-site.xml and mapred-site.xml or yarn-site.xml
configuration files on your Hadoop NameNode (typically located in Hadoop's conf directory).
One way to verify that these ports are open to connections from a Platfora node is to use the telnet
command.
For example, to test if port 8020 was open on the Hadoop NameNode with the IP address
10.10.10.9, you could run the following command to test the connection:
$ telnet 10.10.10.9 8020
If a connection is not allowed, you will need to configure the firewall on your Hadoop nodes to open the
appropriate ports and allow incoming connections from the Platfora nodes.
Also, make sure your Hadoop services are actually up and running.
Note for Amazon Users: If you are using Amazon Elastic Map Reduce as your Hadoop cluster, the
EC2 Security Group Settings are sufficient to allow connectivity between Platfora instances on EC2 and
the EMR instances. If you are running your own Hadoop cluster on designated Amazon EC2 instances,
you may need to configure the port firewall rules on the Hadoop server instances in addition to the EC2
Security Group Settings.
Open Firewall Ports
If using firewall software in your network, you must open the required ports in the firewall software to
allow incoming connections from the other servers in your Platfora and Hadoop clusters. On Amazon
EC2 clusters, this is in addition to configuring your EC2 security group settings.
For a list of the default Platfora and Hadoop ports, see Port Configuration Requirements.
The process to open a firewall port depends on your server's operating system.
For RedHat / CentOS Servers:
2
If connecting to Hive directly using JDBC
Page 46
Platfora Installation Guide - Configure Environment on Platfora Nodes
Add a line to the /etc/sysconfig/iptables file to open the required port. For example (for port
8001):
-A INPUT -m state --state NEW -m tcp -p tcp --dport 8001 -j ACCEPT
Restart the firewall for your changes to take effect. For example:
$ sudo /etc/init.d/iptables restart
For Ubuntu Servers:
Open the required port in the firewall. For example (for port 8001):
$ sudo ufw allow 8001
Configure Passwordless SSH
The Platfora management utilities require a trusted-host environment (the ability to SSH to a remote
system in the Platfora cluster without a password prompt). Even in single-node installations, you must
exchange SSH keys for the localhost.
Verify Local SSH Access
This task confirms that local SSH access was set up correctly during installation. If it wasn't, then you'll
have to configure it before initializing Platfora.
If you installed Platfora using the RPM packages, the package installer creates the platfora user,
generates an SSH key, and authorizes it for the localhost. If you installed using the TAR package, you
should have done these steps manually as part of installing the dependencies.
Test that you can SSH to localhost without a password prompt.
$ su - platfora
$ ssh localhost
If prompted to add localhost to the list of known hosts, enter yes:
The authenticity of host 'localhost (127.0.0.1)' can't be established...
Are you sure you want to continue connecting (yes/no)? yes
If you get a password prompt, see Generate and Authorize an SSH Key.
Exchange SSH Keys (Multi-Node Only)
In multi-node installations of Platfora, each Platfora node must have the public SSH key for itself and
all other nodes in the Platfora cluster in its list of authorized keys. This task applies only when adding a
worker node to a Platfora cluster.
You must exchange SSH keys between all Platfora nodes as the platfora user (master and all worker
nodes). This procedure should be executed from each new worker node that you add to the Platfora
cluster.
Page 47
Platfora Installation Guide - Configure Environment on Platfora Nodes
Before you can exchange an SSH key, you have to generate and authorize it. If you installed Platfora
using the RPM packages, the installer should have done this for you automatically. See Verify Local SSH
Access to confirm this was set up correctly.
If you installed using the TAR package, you should have done this prior to installing the Platfora
software. See Generate and Authorize an SSH Key.
1. Make sure you are logged in to the Platfora worker node as the platfora system user.
$ su - platfora
2. Copy the public key of the current worker node to the other Platfora nodes in the cluster (master and
other worker nodes).
If you have password authentication enabled between the Platfora hosts, you can add the public key
to each of the remote hosts as follows:
$ ssh-copy-id platfora@master_hostname
$ ssh-copy-id platfora@worker1_hostname
$ ssh-copy-id platfora@worker2_hostname
If password authentication is not enabled between hosts (such as on Amazon EC2 instances), login to
each remote server in a separate terminal session and copy/paste the public key of the current worker
host into each remote server's authorized_keys file.
3. Copy the public keys from all other Platfora nodes to the current worker node (master and other
worker nodes).
One way to do this is to copy the entire contents of the master’s authorized_keys file (which
should have the keys of all nodes in the cluster) into the current node’s authorized_keys file.
If you have password authentication enabled between the Platfora hosts, you can copy the master's
authorized_keys file to the current node's authorized_keys file as follows:
$ scp platfora@master_hostname:/home/platfora/.ssh/authorized_keys
~/.ssh/authorized_keys
If password authentication is not enabled between hosts (such as on Amazon EC2 instances), login
to the master server in a separate terminal session, copy the contents of its authorized_keys file,
and paste into the ~/.ssh/authorized_keys file of the current node.
4. Test that you can ssh to the other Platfora nodes without a password prompt.
For example (if prompted to add the other host to the list of known hosts, enter yes):
$ ssh worker_hostname
The authenticity of host 'worker_hostname (110.123.4.5)' can't be
established...
Are you sure you want to continue connecting (yes/no)? yes
Synchronize the System Clocks
Platfora uses NTP (Network Time Protocol) to synchronize the system clocks on the Platfora servers.
Page 48
Platfora Installation Guide - Configure Environment on Platfora Nodes
Network Time Protocol (NTP) ensures that the system clocks on your Platfora servers stay accurate.
Accurate system clocks are important for consistent timestamps in your Platfora log files and for
accurate scheduling of lens builds. See www.ntp.org for more information about using NTP.
Synchronizing the system clock involves installing the NTP software, making sure all Platfora servers
are using the same list of NTP time servers (as configured in the /etc/ntp.conf), and starting the
NTP daemon (ntpd).
1. Install the NTP software.
On RedHat/CentOS
$ sudo yum install ntp
On Ubuntu
$ sudo apt-get install ntp
2. Verify that NTP is configured to use the correct time server for your network in /etc/ntp.conf.
3. Start the NTP daemon service.
$ sudo service ntpd start
Create Local Storage Directories
The Platfora server needs local file system locations for its data files and configuration files. These must
be the same locations on all Platfora servers. When you add a worker node, the locations used on the
master are created on the worker node for you (provided the platfora system user has write access to
these locations). If not, you'll have to create these locations on the worker nodes ahead of time.
Create the Platfora Data Directory
Each Platfora server needs a location where it can store its data and work files. This location should
have enough disk space to accommodate the Platfora server log files, the metadata catalog database, and
materialized lens data. This directory must be writable by the platfora system user.
For example:
$mkdir /data/platfora_data
Set the PLATFORA_DATA_DIR environment variable for the platfora system user, for example:
$ echo "export PLATFORA_DATA_DIR=/data/platfora_data" >> $HOME/.bashrc
Create the Platfora Configuration Directory
Each Platfora server needs a location where it can store its configuration files. This directory must be
writable by the platfora system user. For example:
$mkdir /home/platfora/platfora_conf
Page 49
Platfora Installation Guide - Configure Environment on Platfora Nodes
Set the PLATFORA_CONF_DIR environment variable for the platfora system user, for example:
$ echo "export PLATFORA_CONF_DIR=/home/platfora/platfora_conf" >>
$HOME/.bashrc
Source the ~/.bashrc file.
$ source $HOME/.bashrc
Verify Environment Variables
The Platfora installation uses several system environment variables which are typically set during the
installation process. These environment variables are used by the Platfora software to determine the
location of various directories and files.
Verify the platfora user environment by looking at the .bashrc file in the platfora user's home
directory.
Variable Name
Description
PLATFORA_HOME
Location of the Platfora installation files.
PLATFORA_DATA_DIR
Location of the Platfora data directory containing the metadata
catalog, lens data, and work files.
PLATFORA_CONF_DIR
Local directory where Platfora stores its configuration files.
HADOOP_CONF_DIR
Location of the local Hadoop configuration files that Platfora uses
to connect to the various Hadoop services.
JAVA_HOME
Location of the Java installation on your system.
PATH
Locations of system executables.
LD_LIBRARY_PATH
Locations of system library files.
If you use data compression, make sure that LD_LIBRARY_PATH
also contains the paths to the compression libraries you are
using.
PLATFORA_HADOOP_LIBLocation of the MapR client library files for Hadoop. Only needed
(MapR Only)
if you are using MapR.
Page 50
Chapter
5
Initialize Platfora Master Node
This section describes how to set up a new Platfora cluster by initializing the master node. Once the Platfora
master node is up and running, you will have a fully functioning single-node Platfora cluster. You can then use
the master node to add the worker nodes into the cluster.
Topics:
•
Connect Platfora to Your Hadoop Services
•
Initialize the Platfora Master
•
Configure Platfora for Amazon EMR
•
Troubleshoot Setup Issues
Before you initialize the Platfora master, make sure you have done all the tasks described in Install Platfora
Software and Dependencies and Configure Environment on Platfora Nodes.
Connect Platfora to Your Hadoop Services
In order to initialize a new Platfora cluster, the master node must be able to connect to the Hadoop
services it needs. This section explains how to configure Platfora to connect to your Hadoop file system
and data processing services. This process is different depending on the type of Hadoop deployment you
have.
Understand How Platfora Connects to Hadoop
The Platfora servers use native Hadoop protocols to connect to Hadoop services using remote procedure
calls (RPC). Platfora is a client of Hadoop, and uses the standard Hadoop configuration files to connect
to its services.
Platfora uses the Hadoop configuration files to connect to Hadoop. These files must be in a local
directory on the Platfora master node. You can either obtain a copy of these files from your Hadoop
environment or recreate these files with the minimum required properties.
Page 51
Platfora Installation Guide - Initialize Platfora Master Node
If you are using Amazon Elastic MapReduce (EMR) as your primary Hadoop distribution, you only
need the core-site.xml file to connect to Amazon S3. You then set Platfora configuration properties
to connect to EMR for data processing services.
Hadoop File
Description
Connects to...
core-site.xml
Platfora uses the coresite.xml configuration file to
connect to the distributed
file system service for your
Hadoop deployment. For
example: HDFS for Cloudera
and Hortonworks, MapRFS
for MapR, or S3 for Amazon
EMR.
On-Premise Hadoop: HDFS
Platfora uses the hdfssite.xml configuration file
to configure how Platfora
data is stored in the remote
Hadoop distributed file
system (HDFS).
On-Premise Hadoop: HDFS
Platfora uses the mapredsite.xml configuration file to
connect to the MapReduce
JobTracker service and to
pass in runtime properties
for lens build MapReduce
jobs. This file is required for
Hadoop deployments using
MapReduce v1 or YARN.
On-Premise Hadoop:
Platfora uses the yarnsite.xml configuration file
to connect to the YARN
ResourceManager service
and to pass in runtime
properties for map and
reduce task containers. This
file is required for Hadoop
deployments using YARN.
On-Premise Hadoop: YARN
hdfs-site.xml
mapred-site.xml
yarn-site.xml
Page 52
NameNode
Amazon EMR: S3 Bucket
NameNode
Amazon EMR: not used
MapReduce JobTracker
Amazon EMR: not used
ResourceManager
Amazon EMR: not used
Platfora Installation Guide - Initialize Platfora Master Node
Hadoop File
Description
Connects to...
hive-site.xml
You only need to configure a
hive-site.xml file if you plan
to use Hive as a data source
for Platfora.
Hive Metastore
Create Local Hadoop Configuration Directory
This section describes the minimum Hadoop configuration properties that Platfora needs as a client of
Hadoop's services.
The Platfora master node machine requires a local directory where it can find copies of the standard
Hadoop configuration files. When you initialize the Platfora master, you must provide the location of a
local Hadoop configuration directory.
1. Create a configuration directory location owned by the platfora system user.
$ su - platfora
$ mkdir /home/platfora/hadoop_conf
2. Set the HADOOP_CONF_DIR environment variable for the platfora system user, for example:
$ echo "export HADOOP_CONF_DIR=/home/platfora/hadoop_conf" >>
$HOME/.bashrc
3. In this directory, copy or recreate the Hadoop configuration files needed for your Hadoop
distribution.
core-site.xml (Amazon S3)
Platfora uses the core-site.xml configuration file to connect to the distributed file system service
for your Hadoop deployment. For Amazon Elastic MapReduce deployments, the primary file system
service is Amazon S3.
To configure Platfora to connect to Amazon S3, you will need the Amazon Web Services (AWS)
security credentials for the IAM user that you created for Platfora. These can be found on the AWS
Management Console page under Users. See Creating an IAM User for Platfora.
You also need to provide the name of an Amazon S3 bucket to use for Platfora MapReduce output
(lenses). If you go to your AWS Management Console S3 Home Page, you can see the list of buckets
you have created for your account.
Amazon EMR
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>fs.defaultFS</name>
<value>s3n://your_s3_bucket_name</value>
Page 53
Platfora Installation Guide - Initialize Platfora Master Node
</property>
<property>
<name>fs.s3.awsAccessKeyId</name>
<value>platfora_iam_user_access_key_id</value>
</property>
<property>
<name>fs.s3.awsSecretAccessKey</name>
<value>platfora_iam_user_secret_access_key</value>
</property>
<property>
<name>fs.s3n.awsAccessKeyId</name>
<value>platfora_iam_user_access_key_id</value>
</property>
<property>
<name>fs.s3n.awsSecretAccessKey</name>
<value>platfora_iam_user_secret_access_key</value>
</property>
</configuration>
hive-site.xml
Platfora uses a local hive-site.xml configuration file to connect to the Hive metastore service. You
only need a local hive-site.xml file if you plan to use Hive as a data source for Platfora.
There are two ways to configure how clients connect to the Hive metastore service in your Hadoop
environment. You can set up the HiveServer or HiveServer2 Thrift service, which allows various remote
clients to connect to the Hive metastore indirectly. This is called a remote metastore client configuration,
and is the recommended configuration by the Hadoop vendors. If you add a Hive datasource through the
Platfora web application, you can connect to the Hive Thrift service without the need for a Platfora copy
of the hive-site.xml file.
Optionally, you can connect directly to the Hive metastore database using a JDBC connection. This
requires that you have the login credentials for the Hive metastore database. This is called a local
metastore configuration because you are connecting directly to the metastore database rather than
through a service. If you want to connect to the Hive metastore database directly using JDBC, then you
must specify the connection information in a hive-site.xml.
Platfora can only connect to a single Hive instance via a remote or a local metastore configuration.
Page 54
Platfora Installation Guide - Initialize Platfora Master Node
Remote Metastore (Thrift) Server Configuration
If you are using the Hive Thrift remote metastore, in addition to the URI, you may want to include the
following performance properties:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>hive.metastore.uris</name>
<value>thrift://hostname:hiveserver_thrift_port</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>120</value>
<description>
Number of seconds to wait for the client to retieve all of
the objects (tables and partitions) from Hive. For tables
with thousands of partitions, you may need to increase.
</description>
</property>
<property>
<name>hive.metastore.batch.retrieve.max</name>
<value>100</value>
<description>
Maximum number of objects to get from metastore in one batch.
A higher number means less round trips to the Hive metastore
server,
but may also require more memory on the client side.
</description>
</property>
</configuration>
Local JDBC Configuration
To have Platfora connect directly to a local JDBC metastore requires additional configuration on the
Platfora servers. Each Platfora server requires a hive-site.xml file with the correct connection
information, as well as the appropriate JDBC driver installed. Here is an example hive-site.xml
to connect to a MySQL local metastore:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
Page 55
Platfora Installation Guide - Initialize Platfora Master Node
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hive_hostname:metastore_db_port/metastore</
value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive_username</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>120</value>
</property>
<property>
<name>hive.metastore.batch.retrieve.max</name>
<value>100</value>
</property>
</configuration>
The Platfora server would also need the MySQL JDBC driver installed in order to use this
configuration. You can place the JDBC driver .jar files in $PLATFORA_DATA_DIR/extlib to
install them (requires a restart of the Platfora server).
Initialize the Platfora Master
The Platfora setup utility (setup.py) verifies your operating system environment, configures the
Platfora software, and initializes the platfora metadata catalog database. You must run this setup
utility successfully before starting Platfora for the first time.
To run the setup utility:
$ $PLATFORA_HOME/setup.py
Page 56
Platfora Installation Guide - Initialize Platfora Master Node
The setup.py utility prompts you for the following information about your environment.
Information
Requested
Description
Platfora Configuration
Directory
This is the local $PLATFORA_CONF_DIR directory location that
you created earlier where Platfora will store its configuration
files.
For example: /home/platfora/platfora_conf.
Hadoop Distribution
and Version
This tells Platfora what distribution of Hadoop you are using.
Choose the number that corresponds to your Hadoop distribution
and version.
Platfora Web Services
Port
This sets the port number for the Platfora web application server.
This is the port used for HTTP client connections to the Platfora
application.
Defaults to 8001.
Platfora Server
Management Port
This sets the port number for TCP management connections
between Platfora servers. This is the port used for server-toserver heartbeat and management utility connections.
Defaults to 8002.
Platfora Data Transfer
Port
This sets the port number for TCP data connections between
Platfora servers. This is the port used for server-to-server data
transfers during query processing.
Defaults to 8003.
Hadoop Configuration
File Directory
This is the local directory containing your Hadoop configuration
files that you created earlier.
For example: /home/platfora/hadoop_conf.
Platfora Data Directory
This is the local $PLATFORA_DATA_DIR directory location that
you created earlier where Platfora will store its metadata catalog
database, lens data, and log files.
For example: /data/platfora_data.
Platfora Catalog
Database Port
This is the port of the PostgreSQL database server instance
where the Platfora metadata catalog database will be initialized.
Defaults to 5432.
Company Name
Used as an identifier for system diagnostic bundles. If you
encounter issues or problems, Platfora Support may request
that you generate a system diagnostic bundle. Enter the correct
company name to aid possible troubleshooting in the future.
Setup Platfora for
Secure Connection
If yes, configures Platfora to use HTTPS for secure
communications between the Platfora master server and web
browser clients. If no, uses regular HTTP connections. See
Configure SSL for Client Connections .
Page 57
Platfora Installation Guide - Initialize Platfora Master Node
Information
Requested
Description
Send Metrics to
Platfora
If yes, configures the Platfora server to send anonymous system
diagnostic data to Platfora over an HTTPS connection. See About
System Diagnostic Data for details.
Remote DFS Data
Directory
This is the remote data directory location in the configured
Hadoop file system. Setup will make sure that Platfora has write
permissions to this location before proceeding.
Maximum Java Virtual The maximum JVM size allocated to the Platfora server process.
Machine (JVM) Memory On a dedicated machine, this should be about 80 percent of
total system memory. Setup will use this guideline to suggest a
default (M=megabytes, G=gigabytes).
Relative Platfora Disk
Cache Size
When a lens is built in Hadoop, lens data files are copied over
to Platfora local disk in order to improve the performance of
lens queries. This sets the maximum amount of local disk space
on the Platfora server to use for storing lens data. The limit
is determined by taking a percentage of the total disk space
capacity on the Platfora server. The default is 0.8 or 80% of total
disk space.
After running the setup.py command, run the hadoop-check command to check your Hadoop
settings. The hadoop-check utility verifies that Hadoop is correctly configured for use with Platfora.
It also collects system information from the Hadoop cluster environment. If you are using MapR, you
should run this utility as it collects important information, but be aware that it can report misleading
configuration information.
Configure SSL for Client Connections
When running setup, you have the option to configure SSL connections between the Platfora master
server and browser clients. If you do not have your own certificate, you can have the setup utility
generate a self-signed certificate for you.
Data sent over an HTTPS connection will be encrypted regardless of whether the server certificate
is CA-signed or self-signed. However, most web browsers will only trust certificates signed by a
trusted certificate authority (CA) and will display security warnings when presented with self-signed
certificates.
For production installations, you may want to replace the self-signed certificate with one signed by a
trusted CA.
Page 58
Platfora Installation Guide - Initialize Platfora Master Node
If you choose to configure SSL during setup, the setup utility will ask the following additional questions
to configure SSL:
Information
Requested at
Setup
Description
Platfora Secure
Connection
If yes, configures Platfora to use HTTPS for secure communications
between the Platfora master server and web browser clients. If no,
uses regular HTTP connections.
If you choose to enable secure communications, you can use your
own server certificate (if you have one), or you can have Platfora
generate one for you.
TCP Port for HTTPS
Connections
Enter the TCP port the Platfora master server should use when web
browsers connect to the Platfora web application using HTTPS. Note
that when telemetry is enabled, the Platfora server uses this port
to securely send telemetry data to Platfora using HTTPS. Default is
8443.
KeyStore Location,
Password and Type
The keystore contains the master server’s private key, and its
certificate with the corresponding public key. The keystore is used
to provide credentials.
Location: Accept the default location if you want Platfora to generate
a keystore for you, otherwise enter the path to your own keystore.
Password: If using your own keystore, enter the password to access
that keystore, otherwise set a password for the keystore that
Platfora will create.
Type: If using your own keystore, enter the type of keystore format
you are using. Allowed types are JKS (Java Keystore) or PKCS12
(Public Key Cryptography Standards #12 Keystore). If you plan to
have Platfora generate the keystore for you, use JKS (the default).
Generate SelfSigned SSL
Certificate
Enter Y if you do not have a server certificate and want Platfora to
generate one for you. Enter N if you already have a certificate that
you want to use.
Page 59
Platfora Installation Guide - Initialize Platfora Master Node
Information
Requested at
Setup
Description
TrustStore Location, A truststore contains certificates to trust. The truststore is used to
Password and Type verify certificate authority (CA) credentials.
Location: By default, Platfora uses the default truststore that
comes with your Java installation. This default truststore is already
configured to trust all of the recognized certificate authorities
(Verisign, Symantec, Thawte, etc.). If you have your own
truststore, you can enter the path to that truststore instead.
Password: If using the default truststore that ships with Java, the
default password is changeit. If you have changed this password or
are using your own truststore, enter the correct password for the
truststore.
Type: If using your own truststore, enter the type of truststore
format you are using. Allowed types are JKS or PKCS12. If you use
the default truststore that comes with your Java installation, use
JKS (the default).
When SSL is enabled, ensure that the keystore and truststore passwords entered
in Platfora always match the passwords configured at the keystore and truststore
locations. Changing the passwords in Platfora does not change the passwords at
the keystore or truststore locations. If the passwords entered in Platfora do not
match the passwords at the keystore and truststore locations, the Platfora server
fails to start.
Configure SSL for Catalog Connections
For added security, you can encrypt the communications between the Platfora worker nodes and the
metadata catalog on the Platfora master node.
If you decide to enable SSL for the Platfora catalog, you must have SSL-enabled PostgreSQL installed
on your Platfora master node, and OpenSSL version 1.0.1 or higher installed on all Platfora nodes
(master and worker nodes).
If you are enabling this optional security feature at installation time, you would do so after running
setup.py but before starting the Platfora servers.
1. On the Platfora master node, log in as the platfora system user.
$ su - platfora
2. Make sure the Platfora servers are not running.
$ platfora-services stop
Page 60
Platfora Installation Guide - Initialize Platfora Master Node
3. Run the platfora-catalog ssl utility to configure secure connections to the catalog. For
example, if using a self-signed server certificate and private key:
$ $ platfora-catalog ssl --enable --self
About System Diagnostic Data
During setup, you have the option to enable collection of system diagnostic data. This collects
anonymous statistics about product usage and performance, and will help Platfora improve the product
in future releases. Sending system diagnostic data to Platfora is optional. A system administrator can
choose to enable or disable diagnostic data collection at any time by running setup.py.
What Data is Collected?
Platfora respects your privacy and security. We do not collect any business data,
only diagnostic system metrics. The Platfora server sends system metrics data
over the configured SSL port.
System diagnostic data is completely anonymous. Platfora does not collect any names (data source,
dataset, lens, vizboard, or user names), permissions used, or any personally identifiable information.
Here is a list of some of the diagnostic data that Platfora does collect:
• Actions taken in the UI
• Dataset size
• Lens size (estimated and actual)
• Build duration (how long did a lens build take)
• Scheduled lens build times
• Client browser type
• Screen resolution
• Page load times
• Rest API call duration (how long did an API call take to return)
• Help files viewed
• User metrics (number of users and groups in Platfora)
• Number of logins
• Server startup time (how long did it take for the Platfora server to start)
• Permissions performance metrics (how many times the system used cached permissions versus
having to look up permissions in the catalog)
To see a sample of the data collected, you can look at the system diagnostic logs in
$PLATFORA_DATA_DIR/telemetry.
How to Configure System Diagnostic Collection
If you decide to enable the collection of system diagnostic data, Platfora will log usage information and
send the log files to Platfora Customer Support every 15 minutes by default. If an attempt to send the
data fails, Platfora will only keep the logs for an hour by default to conserve disk space. The following
Page 61
Platfora Installation Guide - Initialize Platfora Master Node
server configuration properties can be used to configure the system diagnostics feature. Changing any of
these properties requires a system restart.
• platfora.support.identifier - The name used to identify a system diagnostic bundle sent
to Platfora support.
• platfora.telemetry.aggregate.frequency - The number of times each send interval to
attempt to aggregate and send the logs to Platfora's telemetry server. Default is 1.
• platfora.telemetry.enabled - Whether or not to send collected diagnostic data to Platfora.
• platfora.telemetry.file.lifespan - The number of seconds to keep historical diagnostic
data between send attempts. The default is 3600 (one hour).
• platfora.telemetry.send.frequency - The number of seconds between send intervals. The
default is 900 (15 minutes).
• platfora.telemetry.url - The URL of Platfora's telemetry server where the diagnostic data is
sent. Default is https://telemetry.platfora.com.
• platfora.telemetry.logparser.enabled - Turns on the ability to parse the log files using
the Platfora application.
Configure Platfora for Amazon EMR
If you are using Amazon Elastic MapReduce (EMR) as your Hadoop deployment, you must set the
Platfora server configuration properties to enable access to EMR. When EMR is enabled, Platfora
initializes an EMR cluster for each lens build and (by default) terminates the cluster instances after the
lens build has completed.
To set these configuration properties, you must start the Platfora server and set them using the platforaconfig utility. For example:
$ platfora-services start
$ platfora-config set --key
$ platfora-config set --key
platfora-job
$ platfora-config set --key
logs
$ platfora-config set --key
$ platfora-config set --key
$ platfora-config set --key
value VPC_subnet_identifier
$ platfora-services stop
platfora.emr.instance.count --value 6
platfora.dfs.intermediatedir --value /
platfora.emr.log.uri --value s3n://platforaplatfora.emr.jobflow.role --value role_name
platfora.emr.service.role --value role_name
platfora.emr.subnet.id --
Amazon Elastic MapReduce (Amazon EMR) defines a default configuration for
Hadoop, depending on the Amazon Machine Image (AMI) that you specify when
you launch the cluster. Platfora recommends using the AMI default Hadoop
configurations to avoid lens build failures due to misconfiguration.
Page 62
Platfora Installation Guide - Initialize Platfora Master Node
Required or Recommended EMR Properties
platfora.emr.instance.count
Sets the number of instances to use for the EMR
cluster. Required to enable the use of EMR. This
must be a number greater than 0 otherwise lens
builds will not use EMR and fail.
platfora.dfs.intermediatedir
Specifies an absolute directory on the local HDFS
of the EMR job flow, for example /platforajob. When set, Platfora will write intermediate
results of a lens build job to the local HDFS file
system rather than to S3 in order to reduce calls to
the remote file system and improve performance.
Only the final results of a lens build are copied
over to S3. If not set, lens build jobs will write all
output to the final S3 data directory.
platfora.emr.instancetype.master
The EC2 instance type to use for the EMR
JobTracker instance. Defaults to m1.large which
is insufficient for Platfora. Review Amazon EMR
Instance Requirements for appropriate sizing.
platfora.emr.instancetype.slave
The EC2 instance type to use for the EMR
TaskTracker instances. Defaults to m1.large
which is insufficient for Platfora. Review Amazon
EMR Instance Requirements for appropriate
sizing.
platfora.emr.log.uri
The S3 Native URI where EMR periodically
synchronizes its job log files. If not set, logs are
lost when the EMR instance terminates. Example
URI: s3n://platfora-logs.
platfora.emr.jobflow.role
The Amazon IAM Role to allow the EMR cluster
to access other AWS services, such as EC2, on
behalf of the IAM User. The role name you enter
here should have the proper permissions to allow
Platfora to start an EMR cluster by accessing the
required services, including starting some EC2
instances.
Page 63
Platfora Installation Guide - Initialize Platfora Master Node
platfora.emr.service.role
The Amazon IAM Role to allow EC2 instances
in an EMR cluster to access other AWS services,
such as S3. The role name you enter here should
have the proper permissions an EMR cluster needs
to access processes required during lens builds,
including access to S3.
platfora.emr.subnet.id
Specifies the Amazon VPC subnet identifier in
which to launch the Amazon EMR cluster. You
must ensure the Platfora server can communicate
with the subnet in the VPC. If the Platfora server
is on the same subnet as the Amazon EMR
cluster, this happens automatically. The value
must not exceed 255 characters in length. For
more information on setting up and using an
Amazon VPC with Amazon EMR, see http://
docs.aws.amazon.com/ElasticMapReduce/latest/
DeveloperGuide/emr-plan-vpc-subnet.html.
platfora.reduce.tasks
The maximum number of Hadoop reduce tasks
Platfora uses for MapReduce jobs. A value of 0
causes Platfora to use the maximum number of
reduce slots in the Hadoop cluster.
EMR installations should always set this to a nonzero value. To calculate the proper value for an
EMR cluster, use the formula:
(#_core_nodes + #_task_nodes) *
(total reduce slots per node)
For example, if your installation uses
2 cc2.8xlarge nodes (core and task)
and each node has 6 reduce slots, set
platfora.reduce.tasks to 12. Refer to the
Amazon documentation to find out the total reduce
slots per node.
Optional EMR Properties
platfora.emr.ami.version
The EMR AMI to use to initialize EC2 instances
with Hadoop on them. Defaults to latest, meaning
use the latest AMI available for the configured
Hadoop version.
platfora.emr.hadoop.version
The Hadoop version to use. Defaults to 2.4.0.
Page 64
Platfora Installation Guide - Initialize Platfora Master Node
platfora.emr.cluster.name
Specifies the name of the Amazon EMR cluster
that Platfora creates when running lens builds.
The name you enter here appears in the Amazon
AWS Cluster List. When no name is specified
here, Platfora generates a name. The value must
not exceed 255 characters in length.
platfora.emr.termination.protected
False means EMR instances are terminated after
each Platfora lens build. True means instances are
not terminated (requires you to manually terminate
them in the AWS console). Setting to true can be
helpful when debugging lens build issues. Defaults
to false.
platfora.emr.visibility.allusers
If set to true, all users sharing the Amazon Web
Services (AWS) account can see the EMR job
flows generated by Platfora in the AWS console.
Defaults to false.
platfora.emr.jobflow.reuse
If set to true, EMR job flows will not terminate
after a lens build completes, but will stay
running until the next lens build or until
platfora.emr.shutdown.timeout
is reached. New lens builds will look for a
compatible EMR job flow before launching a new
one. Defaults to true.
platfora.emr.shutdown.timeout
When platfora.emr.jobflow.reuse=true,
if a Platfora-initiated job flow is inactive for this
number of seconds, it will shut down before the
next EMR payment increment hits. Defaults to
1800 seconds.
platfora.emr.shutdown.runway
When platfora.emr.jobflow.reuse=true,
configures the amount of time in seconds to
begin initiating a shutdown before an EMR hour
payment increment event. Defaults to 300 seconds.
platfora.emr.status.timeout
When platfora.emr.jobflow.reuse=true,
sets the time in seconds to wait for a response
from the EMR job flow before shutting it down.
Defaults to 1800 seconds.
platfora.emr.accesskey
Your AWS account access key ID. If not set
in platfora.properties, uses the same
credentials as configured for S3 in coresite.xml.
platfora.emr.secretkey
Your AWS account secret key. If not set in
platfora.properties, uses the same
credentials as configured for S3 in coresite.xml.
platfora.emr.ec2key.name
The name of the EC2 key pair used to SSH into
EMR EC2 instances. If not set, SSH access is
disabled.
Page 65
Platfora Installation Guide - Initialize Platfora Master Node
platfora.emr.endpoint
The Amazon EMR endpoint to use. See Regions
and Endpoints in the AWS documentation for
possible values. If not specified, uses the endpoint
of the default region configured for your AWS
account.
platfora.emr.bootstrap.action
Bootstrap actions allow you to pass a reference
to a script stored in Amazon S3. Platfora invokes
the script without arguments when initializing the
EMR job flow to build a lens. Defaults to: none.
platfora.emr.jardir
S3 directory where Platfora writes its jar files.
Defaults to /emrjar in the configured S3 bucket
for Platfora.
platfora.emr.jobconfdir
S3 directory where Platfora writes its job
metadata files. Defaults to /emrjobconfs in the
configured S3 bucket for Platfora.
Troubleshoot Setup Issues
This section describes typical errors encountered during installation and setup, and how to resolve them.
View the Platfora Log Files
If you encounter errors when initializing, starting or running Platfora, check the Platfora log files. The
logs can provide more information about the cause of the error. You can view the logs on the Platfora
master or in the Platfora web application.
The Platfora master server log file is located at $PLATFORA_DATA_DIR/logs/platforaserver.log.
If the Platfora server is running, you can also access the Platfora server log file in the browser. Go to:
http://hostname:port/debug/view-log/125
Where hostname:port is the Platfora server hostname and web application port (8001 is the default
port) and 125 is the thousand number of bytes to display (the default is 50 or 50,000 bytes).
Setup Fails Setting up Catalog Metadata Service
Platfora uses a PostgreSQL database to store its metadata catalog. If PostgreSQL is already running,
setup will fail when it tries to start PostgreSQL. You must stop PostgreSQL, cleanup the Platfora data
directory, and then try again. You may also get an error if the /var/run/postgresql directory is
missing or has the wrong file permissions.
When this error occurs, you might see errors such as the following from setup.py:
Command failed due to: Error occurred command: /usr/lib/postgresql/9.2/
bin/pg_ctl start -D
Page 66
Platfora Installation Guide - Initialize Platfora Master Node
In the PostgreSQL log file (located in PLATFORA_DATA_DIR/logs/pg.log), you may see an error
such as:
LOG: could not bind IPv4 socket: Address already in use
HINT: Is another postmaster already running on port 5432?
In the PostgreSQL log file, you may also see an error such as this:
FATAL: could not create lock file "/var/run/
postgresql/.s.PGSQL.5432.lock": No such file or directory
This means that the platfora system user does not have permission to write to the location where
PostgreSQL writes its lock files. Make sure to create /var/run/postgresql and give ownership
to the platfora user. Note that a system reboot sometimes clears /var/run, so you may need to
recreate this directory if you have rebooted your server.
1. Check if the PostgreSQL data process is running.
$ ps ax | grep postgres
2. If it is running, kill the process.
$ kill process_id
3. Make sure you have removed the automatic startup scripts for PostgeSQL, otherwise you will
probably hit this error again.
On RedHat/CentOS:
$ sudo rm /etc/init.d/postgresql-9.2
On Ubuntu:
$ sudo rm /etc/rc*/*postgresql
4. Clean out the Platfora data directory location before trying setup.py again. For example:
$ rm -rf /data/PLATFORA_DATA/*
5. Make sure the /var/run/postgresql directory exists and has the correct permissions.
$ sudo mkdir /var/run/postgresql
$ sudo chown platfora /var/run/postgresql
TEST FAILED: Checking integrity of binaries
When you run setup.py utility (which runs the platfora-syscheck utility by default), it does a
checksum of all of the files in the installation package to make sure the package is not corrupt. If you
add, remove, or change any files inside the Platfora installation directory, this checksum test will fail.
When this error occurs, you might see an error such as the following when you try to initialize or
upgrade Platfora using setup.py (or run a system verification check using platfora-syscheck):
Verifying System Requirements
Checking integrity of binaries......
-=-=-=-=-=-=-=-=-=-=- TEST FAILED -=-=-=-=-=-=-=-=-=-=Reason: ....
To avoid this error, you should not add, remove, or modify any files inside $PLATFORA_HOME after
you have downloaded and unpacked the installation package. If you have not made any changes to the
Page 67
Platfora Installation Guide - Initialize Platfora Master Node
Platfora installation files, this error means that the package you downloaded may be corrupt. Contact
Platfora Customer Support to obtain a new installation package.
If you have intentionally made changes to your Platfora installation and want to bypass this check when
running setup.py (and you have successfully ran platfora-syscheck in the past), you can skip
the system checks using the --skip_syscheck option. For example:
$ setup.py --skip_syscheck
Page 68
Chapter
6
Start Platfora
After installing and initializing the Platfora master server, you are ready to start Platfora. After Platfora is started,
log in to the Platfora web application, upload your license, and change the default administrator password.
Optionally, you may want to load the tutorial data to make sure everything is working as expected.
Topics:
•
Start the Platfora Server
•
Log in to the Platfora Web Application
•
Add a License Key
•
Change the Default Admin Password
•
Load the Tutorial Data
Start the Platfora Server
After you have successfully completed setup, you are ready to start the Platfora server for the first time.
Starting the Platfora server also starts the metadata catalog service (PostgreSQL).
Before you can start Platfora, make sure your Hadoop services are up and running. Platfora will not start
if it cannot connect to the Hadoop file system and data processing services you have configured.
PostgreSQL must be installed and in your PATH, but not running.
To start the Platfora server:
$ $PLATFORA_HOME/bin/platfora-services start
To confirm the master server has started correctly (it should be Enabled, Available, and Running):
$ $PLATFORA_HOME/bin/platfora-services status
ID TYPE
HOST
PORT
ENABLED
STATUS
PROCESS
-------------------------------------------------------------------------0
Master ip-10-xxx-xxx-xxx
8002
Enabled
Available
Running
Page 69
Platfora Installation Guide - Start Platfora
Log in to the Platfora Web Application
After Platfora is started, you can open a web browser, and go to the URL of the Platfora master server
process. To log in for the first time, use admin and admin as the username and password.
Enter the following URL in your browser location field, where hostname is the IP address or public
DNS hostname of the Platfora master server and port is the HTTP web services port entered during
setup (the default port is 8001):
http://hostname:port
If SSL is enabled, the Platfora web server redirects the browser to use the HTTPS port instead (8443 by
default).
When prompted for a username and password, use admin and admin to log in for the first time. This
is the default credentials for the Platfora System Administrator account.
Page 70
Platfora Installation Guide - Start Platfora
After logging in for the first time, you will be prompted to accept the Platfora license agreement. You
must accept the license agreement to continue.
Page 71
Platfora Installation Guide - Start Platfora
Add a License Key
When the Platfora software is in an unlicensed state, a system administrator must upload a valid license
key to activate the product functionality.
1. Go to the System > License page.
2. Click Upload.
3. Navigate to the license key file stored on your local machine and select the license key file.
4. Click OK in the message window after the license is successfully installed.
Change the Default Admin Password
After logging in to the Platfora web application for the first time, it is a good idea to change the default
Platfora System Administrator password from admin to something more secure.
Page 72
Platfora Installation Guide - Start Platfora
You can change the default System Administrator (admin) user's password and profile picture in
the Platfora web application.
1. In the top right corner of the page header, open the System pull-down menu and select User
Profile.
2. In the user profile dialog, click Change Password.
3. Enter a new password. Type carefully (there is no password confirmation).
4. Click Update Password.
Load the Tutorial Data
Platfora installs with some sample data that you can load to see examples of how datasets and lenses are
created. Loading the sample data is also a good way to test that Platfora is working correctly with your
configured Hadoop implementation. The Platfora server has a client load utility that you can run via the
command-line to automatically load the sample data. This client utility creates four sample datasets and
one sample lens in the Platfora web application.
If you have not received a valid license file from Platfora Customer Support, and
enabled it within the Platfora web application, you will not be able to load the
tutorial data. You must have a valid license in order to create datasets and lenses.
Page 73
Platfora Installation Guide - Start Platfora
Log in to the Platfora master server in a terminal session, and run the following command:
$PLATFORA_HOME/client/bin/run_python $PLATFORA_HOME/client/examples/
flights/load_flights.py -u admin -p admin -s localhost:8001
If you have changed the default Platfora administrator password (admin) or web
server port (8001), you will need to alter the load command to supply the correct
connection information for your Platfora server.
The command-line does not return until the lens build job completes, which can take several minutes.
In the meantime, you can access the Platfora application in a web browser using the following URL
(replace hostname with the actual IP or hostname of your Platfora master server:
http://hostname:8001
Page 74
Chapter
7
Initialize a Worker Node
Worker nodes are initialized and added to a Platfora cluster by running a utility on the Platfora master node.
Before you can initialize a worker node, make sure you have provisioned and configured the worker node
machine.
Before you initialize a Platfora worker, you must do the following tasks on the worker node machine:
1. Install the prerequisite software directly on the worker node.
• If using the RPM installer packages, Install Dependencies RPM Package.
• If using the TAR installer packages, you must manually Create the Platfora System User, Set OS
Kernel Parameters, and Install Dependent Software.
2. Configure Environment on Platfora Nodes.
After the worker node has been correctly provisioned, you can add it in to the Platfora cluster from the
master. The platfora-node add utility will copy the Platfora software and configurations from the
master over to the worker node, start it, and bring the node into the Platfora cluster.
1. On the master node, add the worker node to the cluster:
$ platfora-node add --host worker_hostname
2. After the command completes, check the status of the cluster.
When the new child node is Enabled and Available then it is ready to serve viz queries. For example:
$ platfora-services status
ID
TYPE
HOST
MGMT_PORT
WEB_PORT
ENABLED
STATUS
PROCESS
-------------------------------------------------------------------------------------0
Master
ip-10-xxx-xxx-xxx
8002
8001
Enabled
Available
Running
1
Child
ip-10-xxx-xxx-xxx
8002
8001
Enabled
Available
Running
A newly added node may have a status of Not Ready until it is finished copying the lens data blocks it needs
over from the Hadoop file system.
Page 75
Appendix
A
Platfora Utilities Reference
The Platfora command-line management utilities are located in $PLATFORA_HOME/bin of your Platfora server
installation. All utility commands should be executed from the Platfora master node.
Topics:
•
setup.py
•
hadoop-check
•
hadoopcp
•
hadoopfs
•
install-node
•
platfora-catalog
•
platfora-config
•
platfora-export
•
platfora-import
•
platfora-license
•
platfora-node
•
platfora-services
•
platfora-syscapture
•
platfora-syscheck
setup.py
Initializes a new Platfora instance or upgrades an existing one. Can also be used to reset bootstrap
system configuration properties.
Synopsis
setup.py [-h] [-q] [-v] [-V]
setup.py [--hadoop_conf path] [--platfora_conf path] [--datadir path]
[--dfs_dir dfs_path] [--port admin_port] [-data_port data_port]
Page 76
Platfora Installation Guide - Platfora Utilities Reference
[--websvc_port http_port] [--ssl_port https_port] [-jvmsize jvm_size]
[--hadoop_version string] [--extraclasspath path] [-extrajavalib path]
[--skip_checks] [--skip_syscheck] [--skip_sync] [-skip_setup_ssl] [--skip_setup_dfscachesize]
[--skip_setup_telemetry] [--upgrade_catalog] [--nochanges]
[--verbose]
Description
The setup.py utility is run on the Platfora master node after installing the Platfora software, but
before starting the Platfora server for the first time.
For new installations, setup.py:
• Runs platfora-syscheck to verify that all system prerequisites have been met.
• Confirms that you have installed the correct Platfora software package for your intended Hadoop
distribution.
• Prompts for bootstrap configuration information, such as port numbers, directory locations, memory
resources, secure connections, and diagnostic data collection.
• Verifies that the supplied ports are open and that permissions and disk space are sufficient on both
the local and remote DFS file systems.
• Initializes the Platfora metadata catalog database in PostgreSQL.
• Creates the default System Administrator user account.
• Copies setup files to the Platfora storage location in the configured Hadoop DFS.
For upgrade installations, setup.py:
• Runs platfora-syscheck to verify that all system prerequisites have been met.
• Confirms that you have installed the correct Platfora software package for your intended Hadoop
distribution.
• Displays your current bootstrap configuration settings and prompts if you want to make changes.
• Upgrades the Platfora metadata catalog database in PostgreSQL if necessary.
• Copies any updated library files to the Platfora storage location in the configured Hadoop DFS.
• Synchronizes the Platfora software and configuration files on the worker nodes in a multi-node
installation.
Required Arguments
No required arguments.
Optional Arguments
-c | --hadoop_conf path
This is the local directory containing your Hadoop configuration files (such as core-site.xml and
mapred-site.xml). Platfora uses the information in these files to connect to your Hadoop cluster.
Page 77
Platfora Installation Guide - Platfora Utilities Reference
-C | --platfora_conf path
This is the local directory where Platfora will store its configuration files. Defaults to
$PLATFORA_CONF_DIR if set.
-d | --datadir path
This is the local directory where Platfora will store its metadata catalog database, lens data, and log files.
Defaults to $PLATFORA_DATA_DIR if set.
--data_port
This is the data transfer port used during query proccessing on multi-node Platfora clusters. By default,
uses the same port number as the master node.
--db_port port
This is the port of the PostgreSQL database instance where the Platfora metadata catalog database
resides. The default PostgreSQL port is 5432.
--db_dump_path path
This is the path where the backup SQL file of the Platfora metadata catalog database will be created
prior to upgrading the catalog. Defaults to the current directory.
-g | --dfs_dir dfs_path
This is the remote directory in the configured Hadoop distributed file system (DFS) where Platfora will
store its library files and MapReduce output (lens data).
-j | --extraclasspath path
This is the path where the Platfora server will look for additional custom Java classes (.jar files), such as
those for Hive JDBC connectors, custom Hive SerDes, or user-defined functions. These are not included
in Lens Building in Hadoop. They are deprecated, please use $PLATFORA_DATA_DIR/extlib instead.
-l | --extrajavalib path
This is the path where the Platfora server should look for native Java libraries. These are not included in
Lens Building in Hadoop. They are deprecated, please use $PLATFORA_DATA_DIR/extlib instead.
-n | --nochanges
On upgrade, do not prompt the user if they want to make changes to their current Platfora bootstrap
configuration settings.
-p | --port admin_port
This is the server administration port used for management utility and API calls to the Platfora server.
This is also the port that multi-node Platfora servers use to connect to each other. The default is 8002.
-s | --jvmsize jvm_size
Page 78
Platfora Installation Guide - Platfora Utilities Reference
The maximum amount of Java virtual memory (JVM) allocated to a Platfora server process. On a
dedicated machine, this should be about 80 percent of total system memory. You can specify size using
M for megabytes or G for gigabytes.
--skip_checks
Do not do safety checks, such as verifying ports, disk space, and file permissions.
--skip_setup_dfscachesize
Do not prompt to configure the maximum local disk space utilization for storing lens data. If
this question is skipped, Platfora will set the maximum to 80 percent of the available space in
$PLATFORA_DATA_DIR. When this limit is reached, lens builds will fail during the pre-fetch stage.
--skip_setup_ssl
Do not prompt to configure secure connections (SSL) between browser clients and the Platfora server. If
these questions are skipped, the default is no (do not use SSL).
--skip_sync
Do not sync the installation directory to the worker nodes.
--skip_syscheck
Do not run the platfora-syscheck utility prior to setup.
--skip_setup_telemetry
Do not prompt to disable/enable diagnostic data collection. If these questions are skipped, the default is
yes (enable diagnostic data collection), and the company name is set to default (anonymous).
-t | --hadoop_version version_string
The version string corresponding to the Hadoop distribution you are using with Platfora. Valid values
are cdh5 (Cloudera 5.0.x an5.1.x), cdh52 (Cloudera 5.2.x and 5.3.x), cdh54 (Cloudera 5.4.x), mapr4
(MapR 4.0.1), mapr402 (MapR 4.0.2 and 4.1.x), emr3 (Amazon Elastic Map Reduce), HDP_2.1
(Hortonworks 2.1.x), HDP_2.2 (Hortonworks 2.2.x), pivotal_3 (PivotalHD 3.0).
--upgrade_catalog
Automatically upgrade the metadata catalog schema if necessary. The catalog update check is run by
default.
-v | --verbose
Runs in verbose mode. Show all output messages.
-w | --websvc_port http_port
This is the HTTP listener port for the Platfora web application server. This is the port that browser
clients use to connect to Platfora. The default is 8001.
-W | --ssl_port https_port
Page 79
Platfora Installation Guide - Platfora Utilities Reference
This is the HTTPS listener port for the Platfora web application server. This is the SSL port that browser
clients use to connect to Platfora. The default is 8443.
Examples
Run setup without doing the prerequisite checks first:
$ setup.py --skip_syscheck
Run initial setup without any prompts using the specified bootstrap configuration settings (or use the
default settings when not specified):
$ setup.py --hadoop_conf /home/platfora/hadoop_conf --platfora_conf /
home/platfora/platfora_conf \
--datadir /data/platfora --dfs_dir /user/platfora --jvmsize 12G -hadoop_version cdh4 \
--skip_setup_ssl --skip_setup_dfscachesize --skip_setup_telemetry
Run upgrade setup without any prompts and keep all previous configuration settings:
$ setup.py --upgrade_catalog --nochanges
hadoop-check
Checks the Hadoop cluster connected to Platfora to make sure it is not misconfigured. Collects
information about the Hadoop environment for troubleshooting purposes.
Synopsis
hadoop-check [-h] [-v] [-vv] [-V]
Description
The hadoop-check utility verifies that Hadoop is correctly configured for use with Platfora. It also
collects system information from the Hadoop cluster environment. You must complete setup.py
before running this utility.
Output from this utility is logged in $PLATFORA_DATA_DIR/logs/hadoop-check.log.
It performs the following checks:
• Root DFS Test. This test makes sure that Platfora can connect to the configured Hadoop file system,
and that file permissions are correct on the directories that Platfora needs to write to. It also makes
sure that any jar files that have been placed in $PLATFORA_DATA_DIR/extlib have the correct
file permissions.
• File Codec Test. This test makes sure that Platfora has the codecs (file compression libraries) it
needs to recognize and read the compression types supported in Hadoop. If Hadoop is configured
to support a compression type that Platfora does not recognize, then this test will fail. You can put
the jar files for any additional codecs in $PLATFORA_DATA_DIR/extlib of the Platfora server
(requires a restart).
Page 80
Platfora Installation Guide - Platfora Utilities Reference
• Hadoop Host Configuration Test. This test runs a small MapReduce job on the Hadoop cluster
and reports back information from the Hadoop environment. It makes sure that memory is not oversubscribed on the Hadoop MapReduce cluster. These tests assume that all nodes in the Hadoop
cluster have the same resource configuration (same amount of memory, CPU cores, etc.).
The check retunrs a RC (return code) value. A return code 0 means all tests passed. Return code 1 means
one or more tests failed.
Root DFS Test
This test is skipped if Platfora is configured to use Amazon S3.
Test DFS file system information and returns the following:
Total
The total disk space in the Platfora storage directory on the
Hadoop file system.
Used
The used disk space in the Platfora storage directory on the
Hadoop file system.
Available
The available disk space in the Platfora storage directory on
the Hadoop file system.
Permissions on the Platfora
DFS Directory
Permissions on Platfora DFS Directory The platfora system
user has write permissions to the Platfora storage directory
on the Hadoop file system (PASSED or FAILED).
File Codec Test
Codecs Installed
The file compression libraries that are installed in Hadoop.
Output compression in
Hadoop Conf
Checks if the mapred-site.xml property
mapred.output.compress is enabled, and if it is
makes sure the compression library specified in
mapred.output.compression.codec is also installed in
Platfora.
Hadoop Host Configuration Test
JobTracker Status
Ensures the server is up and running.
(ResourceManager for YARN)
Black Listed Tasktrackers
(NodeManagers for YARN)
Total Cluster Map Tasks
Lists the number of servers marked unavailable in the
Hadoop cluster.
Total number of map task slots available. This is the
value of mapred.tasktracker.map.tasks.maximum in the
JobTracker for pre-YARN distributions. This is the value
of mapreduce.tasktracker.map.tasks.maximum in the
ResourceManager for YARN distributions.
Page 81
Platfora Installation Guide - Platfora Utilities Reference
Total Cluster Map Tasks
Total number of map task slots available. This is the
value of mapred.tasktracker.map.tasks.maximum in the
JobTracker.
Total Cluster Map Tasks
Total number of map task slots available. This is the value
of mapreduce.tasktracker.map.tasks.maximum in the
ResourceManager.
Map Tasks Occupied
The number of map task slots that were occupied at the
time of the test.
Total Cluster Reduce Tasks
Total number of reduce task
slots available. This is the value of
mapred.tasktracker.reduce.tasks.maximum in the JobTracker.
This is the mapreduce.tasktracker.reduce.tasks.maximum in
the ResourceManager for YARN distributions.
Reduce Tasks Occupied
The number of reduce task slots that were occupied at the
time of the test.
Job Submission Took
How long it took for Platfora to submit the test MapReduce
job.
Hadoop Host
The host name of the JobTracker.The host name of the
ResourceManager node for YARN distributions.
Hadoop Version
The version of Hadoop that is running.
CPUs
Number of CPUs per TaskTracker node.Number of CPUs for the
NodeManager in YARN distributions.
RAM
The available memory per TaskTracker. The available memory per
NodeManager in YARN distributions.
Map Slots
Maximum map task slots available.
Reduce Slots
Maximum reduce task slots available.
Hadoop Configured Memory
The configured amount of memory available to
MapReduce processes. Looks at maximum JVM size
per task (mapred.child.java.opts) times the total
number of tasks slots. The total number of task slots is
equal to mapred.tasktracker.map.tasks.maximum plus
mapred.tasktracker.reduce.tasks.maximum for preYARN distributions. The total number of task slots is equal
to mapreduce.tasktracker.map.tasks.maximum plus
mapreduce.tasktracker.reduce.tasks.maximum on YARN
distributions.
This test will fail if the Hadoop configured memory exceeds
available RAM.
Page 82
Platfora Installation Guide - Platfora Utilities Reference
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-v | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
-vv
Runs in extra verbose mode.
Examples
Test and collect information from the Hadoop cluster that Platfora is configured to use:
$ hadoop-check
hadoopcp
Copies a file from one location in the configured DFS to another location in the configured DFS with the
ability to transcode files.
Synopsis
hadoopcp source_dfs_uri destination_dfs_uri
Description
The hadoopcp utility allows you to copy a file residing in the remote Hadoop DFS from one location
to another and optionally transcode the file.
File paths must be specified in URI format using the appropriate DFS file system protocol. For example,
hdfs:// for Cloudera, Apache, or Hortonworks Hadoop, maprfs:// for MapR, s3n:// for
Amazon S3.
This command executes as the currently logged in system user (the platfora user, for example). The
target directory location must exist, and this user must have write permissions to the directory.
Required Arguments
source_dfs_uri
The source location in a remote Hadoop file system in URI format. For example:
Page 83
Platfora Installation Guide - Platfora Utilities Reference
hdfs://hostname:[port]/dfs_path
destination_dfs_uri
The target location in a remote Hadoop file system in URI format. For example:
hdfs://hostname:[port]/dfs_path
Optional Arguments
-h
Shows the command-line syntax help and then exits.
Examples
Copy the file /mydata/foo.csv residing in HDFS to the same location in HDFS but transcode it to a gzip
compressed file:
$ hadoodcp hdfs://localhost/mydata/foo.csv hdfs://localhost/mydata/
foo.csv.gz
hadoopfs
Executes the specified hadoop fs command on the remote Hadoop file system.
Synopsis
hadoopfs -command
Description
The hadoopfs utility allows you to run Hadoop file system commands from the Platfora server. This is
analagous to running the specified hadoop fs command on the Hadoop NameNode server.
The command executes as the currently logged in system user (the platfora user, for example). This
user must have sufficient Hadoop file system permissions to perform the command.
Required Arguments
-command
A Hadoop file system shell command. See the Hadoop Shell Command Documentation for the list of
possible commands.
Optional Arguments
No optional arguments.
Examples
List the contents of the /platfora/uploads directory in the configured Hadoop file system:
$ hadoopfs -ls /platfora/uploads
Page 84
Platfora Installation Guide - Platfora Utilities Reference
Remove the file /platfora/uploads/test.csv in the configured Hadoop file system:
$ hadoopfs -rm /platfora/uploads/test.csv
install-node
Copies the Platfora software and configuration directories from the current node to the specified remote
node(s).
Synopsis
install-node --host hostname | --hostsfile filename [-h] [-q] [-v] [-V]
Description
The install-node utility copies the $PLATFORA_HOME directory from the current node to the
specified remote nodes. It also synchronizes the configuration files in the $PLATFORA_CONF_DIR
directory. You can use the install-node utility to copy a Platfora software installation to a remote
node that has not yet been added to your Platfora cluster configuration.
This utility is also called indirectly by the platfora-services sync, platfora-node add,
platfora-node sync, and setup.py upgrade utilities. Platfora recommends using these utilities
when adding new nodes or upgrading existing nodes in your Platfora cluster configuration.
Files are copied to the remote node as the currently logged in system user. The $PLATFORA_HOME and
$PLATFORA_CONF_DIR directory locations must exist on the remote node, and the current system
must have sufficient file system permissions to write to these locations.
Required Arguments
One of either --host or --hostsfile is required.
--host hostname
Copies the Platfora software and configuration directories to the specified host name or IP address.
--host hostsfile
Copies the Platfora software and configuration directories to the host names or IP addresses specified in
the named file, one host per line.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
Page 85
Platfora Installation Guide - Platfora Utilities Reference
-V | --version
Shows the software version information and then exits.
Examples
Install the Platfora software on the remote host named myremotehost by copying over the Platfora
installation installed on the local host:
$ install-node --host myremotehost
platfora-catalog
Manages the Platfora metadata catalog database in PostgreSQL.
Synopsis
platfora-catalog [-h] [-q] [-v] [-V] init | start | stop | status | backup
| restore | upgrade | pswd | keygen | ssl [sub-command options]
Description
Use the platfora-catalog utility to manage the Platfora metadata catalog database in PostgreSQL.
When you first install and initialize Platfora using setup.py, it initializes a PostgreSQL database
instance using the default PostgreSQL port (5432) and creates a platfora database in the
$PLATFORA_DATA_DIR location. You run this utility by passing one its subcommands either directly
or indirectly through the setup.py and platfora-services utilities. The following subcommands
you can call directly.
Subcommand
Description
backup
Dumps the contents of the platfora catalog database to a backup file.
restore
Restores the platfora catalog database using a backup file.
pswd
Creates a new encrypted superuser password for the platfora
metadata catalog database. Platfora encrypts the stored password using
128-bit AES encryption. This command is called by setup.py during
new installations (in 4.1.3 and later releases). You must run platforaservices stop before running this command.
keygen
Generates a new key that is used to encrypt the password used to
access the platfora metadata catalog database and re-encrypts the
password using the new key. You must run platfora-services stop
before running this command.
Page 86
Platfora Installation Guide - Platfora Utilities Reference
Subcommand
Description
ssl
Controls whether or not worker nodes use an SSL connection to
communicate with the metadata catalog database. You must run
platfora-services stop before running this command.
These subcommands are called indirectly, but you can also call them directly:
Subcommand
Description
init
Initializes a new Platfora metadata catalog database. This command is
called by setup.py during new installations.
start
Starts the PostgreSQL database server. This command is called by
platfora-services start.
stop
Stops the PostgreSQL database server. This command is called by
platfora-services stop.
status
Shows the status of the PostgreSQL database server process. This
command is called by platfora-services status.
migrate
Migrates individual elements in the platfora metadata catalog
database from one DFS location to another.
upgrade
Upgrades the schema in the platfora catalog to the latest installed
Platfora version. This command is called by setup.py during upgrade.
Required Arguments
Requires one of the following sub-commands: init, start, stop, status, backup, restore,
pswd, keygen, ssl, or upgrade. To see the arguments available with a sub-command, enter the
following command-line string:
platfora-catalog sub-command --help
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Page 87
Platfora Installation Guide - Platfora Utilities Reference
Shows the software version information and then exits.
platfora-catalog ssl
Controls whether or not worker nodes use an SSL connection to communicate with the metadata catalog
database in PostgreSQL.
Synopsis
platfora-catalog ssl [-h] [--enable] [--disable] [--self] [--manual] [-cert_file certificate_file] [--key_file private_key_file]
Description
The platfora-catalog ssl command controls whether or not worker nodes use an SSL
connection to communicate with the metadata catalog database in PostgreSQL. By default, SSL
connections are not enabled. Note that the Platfora server must be stopped to run this command.
To enable SSL connections between worker nodes and the metadata database on
the master node, the PostgreSQL database that Platfora uses must support and
enable SSL.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
--enable
Specifies that worker nodes should use an SSL connection when communicating with the metadata
database on the master node. When enabled, Platfora distributes the server certificate to the worker
nodes every time the server starts. Enabling SSL may increase lens build times. Platfora only
recommends enabling this feature if your organization's security requirements deem it necessary.
--disable
Disables SSL connections between worker nodes and the metadata database on the master node.
--self
Specifies to use a self-signed server certificate and key when enabling SSL connections. When you use
this argument, Platfora generates and signs its own server certificate and private key.
--manual
Page 88
Platfora Installation Guide - Platfora Utilities Reference
Specifies to use a server certificate and private key uploaded to Platfora, typically generated by a
certificate authority (CA). You must specify the certificate and private key using the --cert_file and -key_file arguments.
--cert_file certificate_file
The path and file name of the server certificate to use.
--key_file private_key_file
The path and file name of the server private key to use.
Examples
Use SSL connections between worker nodes and the PostgreSQL database using a self-signed server
certificate and private key:
$ platfora-catalog ssl --enable --self
Use SSL connections between worker nodes and the PostgreSQL database using a server certificate and
private key generated by a certificate authority (CA).
$ platfora-catalog ssl --enable --manual --cert_file file.crt --key_file
file.key
Disable SSL connections between the worker node and the PostgreSQL database:
$ platfora-catalog ssl --disable
platfora-config
Displays the current settings of Platfora configuration properties, and allows you to update property
settings. Requires one of the following sub-commands: get, set, load, server.
Synopsis
platfora-config
options]
[-h] [-q] [-v] [-V] get | reset | list | set |
load | server | get_dfs_path | set_dfs_path [sub-command
Description
The platfora-config command is used to manage Platfora server configuration properties. The
Platfora server does not need to be running to use this utility. After resetting a property, you must restart
Platfora for your changes to take effect.
platfora-config must be run with one of the following sub-commands:
• get - Display all configuration properties and their current settings on the Platfora master or on the
specified worker node.
• reset - Reset a configuration property to its default value on the Platfora master or on the specified
worker node.
Page 89
Platfora Installation Guide - Platfora Utilities Reference
• list - Display all configuration properties and their current settings on the Platfora master or on the
specified worker node. Same functionality as get.
• set - Change the value of the specified configuration property.
• load - Sets the properties specified in a configuration file on the specified Platfora worker node.
• server - List the client-side Hadoop configuration property settings.
• get_dfs_path - Get the current URI path of the given datasource in the remote file system.
• set_dfs_path - Udate the URI path of the given datasource in the remote file system.
Required Arguments
Requires either --help or one of the following sub-commands: get, list, set, load, server,
get_dfs_path, or set_dfs_path.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
Examples
Show all configuration properties and their currently set values:
platfora-config get
Set a configuration property:
platfora-config set --key platfora.license.expirationwarningdays --value
30
Update the datasource path of the Uploads and System data sources when you are migrating Platfora to a
new Hadoop NameNode:
# To get the old paths
$ platfora-config get_dfs_path --datasource System
$ platfora-config get_dfs_path --datasource Uploads
# To set the new paths
$ platfora-config set_dfs_path --datasource System \
--old_path 'protocol://old_namenode_host:port/platfora/system' \
--new_path 'protocol://new_namenode_host:port/platfora/system'
Page 90
Platfora Installation Guide - Platfora Utilities Reference
$ platfora-config set_dfs_path --datasource System \
--old_path 'protocol://old_namenode_host:port/platfora/uploads' \
--new_path 'protocol://new_namenode_host:port/platfora/uploads'
platfora-export
Exports Platfora object metadata from the catalog database to one JSON file per object.
Synopsis
platfora-export [-h] [-q] [-v] [-V] --username username
--password password [--server server_name] [--port port]
[--protocol http|https][--all] [--namespace namespace_name]
[--export-datasources data_source_name [...]] [--exportdatasets dataset_name [...]][--export-lenses lens_name [...]] [--exportvizboards vizboard_title [...]] [--export-users user_name [...]] [-export-groups group_name [...]] [--include-referenced-datasources] [-include-referenced-datasets] [--include-referenced-lenses] [--includereferenced-segments] [--include-permissions] [--lazy-fetch] [--skipobjects-by-name object_name]
Description
The platfora-export command exports Platfora object metadata from the catalog database. You
can export one or more object types. When specifying an objects you use the name or, for vizboards, the
title. For names or titles with spaces, enclose the name in quotes. You can also export multiple objects of
each type. Separate each object with a space or user an * (asterisk) to export everything of that type.
The command exports objects to .json files to a subdirectory in the current directory. The command
labels the subdirectory with a type. Exported file names are URL-encoded along with the exported
objects current version. For example, if you export the Web Logs the data source the command
creates file here: datasources/Web%20Logs%20.json If a particular filename already exists, the
command silently overwrites it.
Vizboards are the exception. Vizboard names need not be unique. For this reason,
the export utility appends a unique identifier to the exported vizboard filename.
When using one of the --include arguments to export referenced objects of a particular type, you
must include all object types in between. For example, if you export a vizboards and want to include
data sources (--include-referenced-datasources), you must also include lenses and datasets.
If you forget to provide the proper includes, the command produces the exported object(s) you requested
but none of the objects refrenced by them.
Required Arguments
--username username
Page 91
Platfora Installation Guide - Platfora Utilities Reference
Username of a Platfora user account that has the appropriate object permissions on the objects to export.
For example, to export an object, the user must be able to view the object in the web application.
--password password
Password for the specified user account.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
--server server_name
Hostname or IP address for the Platfora master node. Defaults to localhost.
--port port
Port for the Platfora master node. Defaults to 8001.
--protocol http|https
Specify which protocol to use to access the Platfora server, either http or https. Defaults to https when
the port ends with 443, otherwise defaults to http.
--namespace namespace_name
Export objects from the specified namespace. You can only export objects from one namespace in a
single call. Defaults to default.
--export-datasources data_source_name [...]
Export the specified data source. You can list multiple names to export multiple objects. Include names
in double quotes if they contain spaces or other special characters.
--export-datasets dataset_name [...]
Export the specified dataset. Use this flag to export segments which are a special kind of dataset.
Segments have two supporting lenses: segment members and segment refresh prerequisites; Include
these using the --include-referenced-lenses flag. Include names in double quotes if they contain
spaces or other special characters.
--export-lenses lens_name [...]
Page 92
Platfora Installation Guide - Platfora Utilities Reference
Export the specified lens. You can list multiple names to export multiple objects. Include names in
double quotes if they contain spaces or other special characters.
--export-vizboards vizboard_title [...]
Export the specified vizboard by title. A vizboard title is the name users assign the vizboard in the
Platfora web application. You can list multiple titles to export multiple objects. Include titles in double
quotes if they contain spaces or other special characters. Vizboard title names may not be unique. If
multiple vizboards use the same title, all vizboards with that title are exported, and each one is assigned
a unique identifier.
--export-users user_name [...]
Export one or more users. Only administrators can export users and groups.
--export-groups group_name [...]
Export one or more groups. Only administrators can export users and groups.
--include-permissions
Export all permissions for all exported Platfora objects such as lenses or datasets. Users and groups do
not have permissions.
--include-referenced-datasources
Use this argument to export all data sources referenced by a specified object.
--include-referenced-datasets
Use this argument to export all datasets referenced by a specified object.
--include-referenced-lenses
Use this argument to export all lenses referenced by a specified object. This option applies to lens
references from segments and vizboards. This option does not support following references from
datasets to the lenses that use them.
--include-referenced-segments
Use this argument to export all segment datasets and segment lenses referenced by a specified vizboard
object. This argument only works when exporting vizboards.
--include-permissions
Use this argument to export all permissions for all exported objects.
--lazy-fetch
When exporting a number of objects that are significantly less than the total number of objects in the
catalog, use this argument to improve export performance. Defaults to false.
--skip-objects-by-name object_name [...]
When exporting multiple objects, use this argument to skip exporting objects with the specified names.
This applies to exporting all objects with the * wildcard as well as referenced objects when using one
Page 93
Platfora Installation Guide - Platfora Utilities Reference
of the --include-referenced-* arguments. By default, this command does not export system-created
objects, these objects are:
Object Type
Excluded by Default
group
Everyone
user
system
admin
data sources
System
Uploads
datasets
Date
Time
Latitude, Longitude with Name
Latitude, Longitude
To override the defaults, provide an * (asterisk) or specify an object name to skip.
--all
Export the entire catalog. This flags behavior is equivalent to:
• --export-datasources "*"
• --export-datasets "*"
• --export-lenses "*"
You must explicitly export permissions, users, and groups.
Examples
Export the "event log" vizboard and the lenses, data sources, and datasets that are used by that vizboard:
$ platfora-export --username admin --password password --exportvizboards "event log" --include-referenced-lenses --include-referenceddatasets --include-referenced-datasources
Export vizboards together with their permissions:
$platfora-export -vvv --username admin --password admin --exportvizboards "o_viz" --include-permissions
Export all data sources and the datasets used by those data sources:
$ platfora-export --username admin --password password --exportdatasources "*" --include-referenced-datasets
Others may find this useful:
Page 94
Platfora Installation Guide - Platfora Utilities Reference
$ platfora-export --username admin --password admin --server localhost
--export-datasets "airports" "batting" "Carriers" --include-referenceddatasources
About to export datasets: [airports, batting, Carriers]
Exporting Dataset: "airports" to file: "datasets/airports.json"
Exporting Dataset: "batting" to file: "datasets/batting.json"
Exporting datasource: "hive on cdh1" to file: "datasources/hive%20on
%20cdh1.json"
Exporting Dataset: "Carriers" to file: "datasets/Carriers.json"
platfora-import
Imports Platfora object metadata from one or more JSON files into the catalog database.
Synopsis
platfora-import [-h] [-q] [-v] [-V] --username username
--password password [--server server_name] [--port port] [--protocol
http|https]
[--import-files file_name [...]] [--handle-conflicts reuse|fail] [-s] [m]
Description
The platfora-import command is used to import Platfora object metadata from one or more JSON
formatted files into the catalog database. You can obtain these files using the platfora-export
command. Only import objects from files that were exported from the same minor release. Platfora does
not support importing objects exported from a different minor release.
If your system uses HDFS Delegated Authorization, the importing user must
have READ permission on the underlying DFS data. If the user does not have this
permissions, the catalog import succeeds but the Platfora instance is unable to
access the underlying data.
Each JSON file should contain a single object definition. After importing an object, the object owner is
assigned the username given in the --username argument. If an object exists in both the catalog and
in one of the imported JSON files, then the --handle-conflicts argument determines whether the
import fails or uses the object in the catalog instead of importing the object from the JSON file.
When importing an object that references another object, the referenced object must exist either in one
of the imported JSON files or in the Platfora catalog. If any referenced object doesn't exist in either
location, the entire import fails.
Vizboards are a special case. They have both a title visible through the user interface (UI) and unique
name which is only used internally and is not visible in the UI. When you import a vizboard, the system
assigns the vizboard a unique name and keeps the visible title unchanged. Vizboard permissions are tied
Page 95
Platfora Installation Guide - Platfora Utilities Reference
to the unique name Platfora uses internally. Therefore, if you want to ensure that imported vizboards
inherit the same object permissions as they did in the original Platfora catalog, you must export both
vizboards and their permissions. Then you must import both the vizboard and their permissions in a
single call using platfora-import.
Required Arguments
--username username
Username of a Platfora user account that has the appropriate object permissions on the objects to import.
For example, to import an object the user must have Own or Edit permission on the object type.
--password password
Password for the specified user account.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
--server server_name
Hostname or IP address for the Platfora master node. Defaults to localhost.
--port port
Port for the Platfora master node. Defaults to 8001.
--protocol http|https
Specify which protocol to use to access the Platfora server, either http or https. Defaults to https when
the port ends with 443, otherwise defaults to http.
--handle-conflicts reuse|fail
Specifies how to handle objects that already exist in the catalog with the same name. Choose reuse to
keep the existing object in the catalog and ignore the imported object with the same name. Choose fail to
stop the import process without importing any object. Defaults to fail.
--import-files file_name [...]
Page 96
Platfora Installation Guide - Platfora Utilities Reference
Import the object in the file. You can list multiple names to import multiple objects. When listing
multiple objects, the order does not matter.
Always import user and groups together. This is because the two object types are interdependent.
Importing groups fails if all the members do not also exist. Similarly, users are not imported unless their
corresponding group exists.
--skip_checks
Skips version checks between imported data and the Platfora instance. Set this when importing JSON
without metadata fields.
-s | --run-as-super-admin
Run the import job in Super Administrator mode. The specified --username must be eligible to switch to
Super Administrator mode.
-m | --skip-objects-with-missing-references
Skips importing any objects that reference other objects that cannot be found. Platfora lists which objects
were not imported because they reference objects that can't be found. Search for "Warning: Removing"
in the command response to find the objects that were not imported. This does not apply to users and
groups.
Examples
Import the lens in the flights_lens.json file. If a lens with the same name already exists, then
keep the existing lens:
$ platfora-import --username admin --password password --import-files
flights_lens.json --handle-conflicts reuse
Use the following to import vizboards and the permissions associated with them.
$platfora-import -vvv --username admin --password admin --import-files
vizboards/* permissions/vizboards/*
platfora-license
Installs, uninstalls, or views a Platfora license. Requires one of the following sub-commands: install,
uninstall, or view.
Synopsis
platfora-license
options]
[-h] [-q] [-v] [-V] install | uninstall | view [sub-command
Description
The platfora-license command is used to manage the license on Platfora. The Platfora server
must be running to use this utility.
Page 97
Platfora Installation Guide - Platfora Utilities Reference
Required Arguments
Requires one of the following sub-commands: install, uninstall, or view.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
platfora-license install
Installs a Platfora license by uploading a license key file.
Synopsis
platfora-license install [--license license_file]
[-h]
Description
The platfora-license install command is used to upload a license key file to Platfora. The
Platfora server must be running to use this utility.
Required Arguments
--license license_file
The path and license key file name to upload to the Platfora server. If no directory is specified, Platfora
looks for the file in the current directory.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
Examples
Upload the license key file named licensekey.license to the Platfora server:
$ platfora-license install --license licensekey.license
Page 98
Platfora Installation Guide - Platfora Utilities Reference
platfora-license uninstall
Uninstalls the current Platfora license.
Synopsis
platfora-license uninstall
[-h]
Description
The platfora-license uninstall command is used to uninstall the license currently installed on
Platfora. The Platfora server becomes in the unlicensed state after running this command. The Platfora
server must be running to use this utility.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
Examples
Uninstalls the current license from the Platfora server:
$ platfora-license uninstall
platfora-license view
Displays the details of the currently installed license.
Synopsis
platfora-license view
[-h]
Description
The platfora-license view command is used to view the details of the currently installed
Platfora license. The Platfora server must be running to use this utility.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
Page 99
Platfora Installation Guide - Platfora Utilities Reference
Examples
Views the current Platfora license:
$ platfora-license view
platfora-node
Starts, stops, restarts, checks, updates, disables, enables, or removes a worker node in a multinode Platfora cluster. Requires one of the following sub-commands: add, remove, start, stop,
restart, status, sync, config, enable, or disable.
Synopsis
platfora-node [-h] [-q] [-v] [-V] status | enable | stop | sync | remove | start
| add | disable | config | restart [sub-command options]
Description
The platfora-node utility is used to manage worker nodes in a multi-node Platfora cluster, and is
always executed from the Platfora master. It must be run with one of the following sub-commands:
• add - Adds and initializes a new worker node to a Platfora cluster.
• remove - Removes an existing worker node from a Platfora cluster.
• start - Starts the Platfora server process on the designated worker node(s).
• stop - Stops the the Platfora server process on the designated worker node(s).
• restart - Issues a stop immediately followed by a start.
• status - Shows the status of the Platfora server process on the designated worker node(s).
• disable - Takes a worker node out of operation. Disabled nodes remain in the cluster
configuration, but are not available to process queries. Typically you would disable a node to do
server maintenance, and then enable it again after maintenance is complete. When a node is disabled,
other nodes in the cluster will take over the lens data and processing work it was responsible for
serving.
• enable - Brings a disabled worker node back into operation. When a node comes up, it must
retrieve the latest lens data it is responsible for serving before it will be fully available to work on
queries.
• sync - Copies the Platfora software binaries from the master to the designated worker node(s).
• config - Configures the management port and host name of an existing node.
You can also use the platfora-services utility to run the start, stop, restart, status,
sync, and config commands on all nodes at once. This utility is mainly used for adding new worker
nodes, or taking nodes in and out of the cluster for server maintenance.
A node is identified by its unique node ID. This corresponds to the order that the node was added to the
Platfora cluster configuration. Usually the master node is 0, the first worker node added is 1, the second
worker node added is 2, and so on. You can run platfora-services status to see the IDs of all
nodes in the Platfora cluster.
Page 100
Platfora Installation Guide - Platfora Utilities Reference
Finally, this utility ensures the clock on remote, worker nodes are in the acceptable tolerance from the
master node clock. The tolerance is 60 seconds. If the node is not within the acceptable tolerance, the
utility logs an error and, depending on the context, the node is not started/added/enabled.
Required Arguments
Requires one of the following sub-commands: add, remove, start, stop, restart, status,
sync, config, enable, or disable.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
platfora-node add
Adds a new worker node to a Platfora cluster configuration.
Synopsis
platfora-node add --host hostname [--port admin_port] [-data_port data_port]
[--websvc_port http_port] [--datadir path] [-disabled] [--skip_syscheck]
| [-h]
Description
The platfora-node add command checks the remote node for the required software, registers a
new worker node in the Platfora metadata catalog, copies the Platfora installation files to the remote
node, starts the Platfora server on the new node, and enables the node to begin serving query requests.
This command is run from the Platfora master.
Before you can add a node to the Platfora cluster, the remote server has to be
correctly provisioned with the required prerequisite software and OS configuration
settings. See the Provisioning a Platfora Server section of the Platfora Installation
Guide for more information.
Page 101
Platfora Installation Guide - Platfora Utilities Reference
Required Arguments
--host hostname
The host name or IP address of the new worker node to add to the Platfora cluster.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
--datadir path
The local directory where Platfora will store lens data and log files on the worker node. By default, uses
the same $PLATFORA_DATA_DIR location as the master node.
--data_port
This is the data transfer port used during query proccessing on multi-node Platfora clusters. By default,
uses the same port number as the master node.
--disabled
Adds the node to the cluster configuration but in a disabled state. The node will not participate in query
processing until it is enabled.
--port admin_port
This is the server administration port used for management utility and API calls to the Platfora server.
This is also the port that multi-node Platfora servers use to connect to each other. By default, uses the
same port number as the master node.
--skip_syscheck
Do not run the platfora-syscheck utility prior to adding the node.
--websvc_port web_service_port
The web service port of the Platfora application server. By default, uses the same port number as the
master node.
Examples
Add a new worker node with the host name of platfora-worker-1 to the Platfora cluster:
$ platfora-node add --host platfora-worker-1
platfora-node config
Changes the configured host name and/or server administration port for a Platfora worker node.
Synopsis
platfora-node config --id number [--host hostname] [--port admin_port]
| [-h]
Page 102
Platfora Installation Guide - Platfora Utilities Reference
Description
The platfora-node config changes the configured management port and/or host name of an
existing Platfora worker node.
Required Arguments
--id number
This node ID number in the Platfora catalog database. Usually the master node is 0, the first worker node
added is 1, the second worker node added is 2, and so on. You can run platfora-services status to
see the IDs of all nodes in the Platfora cluster.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
--host hostname
The updated host name or IP address of the worker node.
-p | --port admin_port
The updated server administration port.
Examples
Update the port of the worker node ID number 2:
$ platfora-node config --id 2 --port 8004
platfora-services
Starts, stops, restarts, or checks the status of Platfora server processes. Can also be used to syncronize
Platfora software and configuration files in multi-node installations. Requires one of the following subcommands: start, stop, restart, status, or sync.
Synopsis
platfora-services [-h] [-q] [-v] [-V] start | stop | restart | status | sync
[sub-command options]
Description
The platfora-services utility is used to manage Platfora server processes. It must be run with one
of the following sub-commands:
• start - Starts the Platfora server processes. In multi-node installations, starts the master server first
and then the worker servers in sequential order.
Page 103
Platfora Installation Guide - Platfora Utilities Reference
• stop - Stops the Platfora server processes. In multi-node installations, sequentially stops the worker
servers first and then the master server.
• restart - Issues a stop immediately followed by a start.
• status - Shows the status of the Platfora server processes.
• sync - Copies the Platfora software binaries and global configuration settings from the master to the
worker nodes.
The following sub-commands are issued internally by the platfora-services utility. DO NOT
USE without explicit directions from Platfora customer support.
• watchdog - Starts the watch dog daemon for the Platfora server process.
• launch - Includes the specified Java class in the Platfora environment.
Finally, this utility ensures the clock on the master node is not skewed. The tolerance is 60 seconds. If
this master node is not within the acceptable tolerance, the utility logs an error and, depending on the
context, the node process is not started/added/enabled.
Required Arguments
Requires one of the following sub-commands: start, stop, restart, status, or sync.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
platfora-services start
Starts the Platfora server processes. In multi-node installations, starts the master server first and then the
worker servers in sequential order.
Synopsis
platfora-services start [-h] [-d [DEBUG]] [--hadoop_conf path] [-platfora_conf path] [--logdir path] [--profile] [-n node_id] [p management_port] [-w web_port] [--datadir path] [-P pid_path] [s jvm_size] [--no_watchdog] [--nowait] [--heapdump] [--gc] [--gclogging]
[-jvmopts options]
Page 104
Platfora Installation Guide - Platfora Utilities Reference
Description
The platfora-services start command starts the Platfora server processes. If you do not
specify any arguments, the command uses the configuration information specified during setup.
This configuration is stored the Platfora metadata catalog. To view your current configuration, see
your Global Settings in Platfora or the platfora.properties configuration file located in the
$PLATFORA_CONF_DIR.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-d | --debug
Starts the Platfora server with the Java debugger listener enabled.
--datadir path
The path of the Platfora data directory where the catalog database, lens data, and logs reside. Defaults to
$PLATFORA_DATA_DIR or what was specified during setup.
--hadoop_conf path
Local directory path where the Hadoop configuration files reside. Defaults to what is specified for the
env.platfora.hadoopconf property.
--heapdump
Enables the JVM to provide a heap dump to $PLATFORA_DATA/log/platfora-heapdump.hprof when
an out of memory error occurs.
--gc G1|SmallHeap
Sets the garbage collection algorithm.
--gclogging
Enables JVM garbage collection logging.
--jvmopts options
Adds additional JVM options to the server process.
--logdir path
The directory of the Platfora server log files. Defaults to $PLATFORA_DATA_DIR/logs.
-n | --nodeid node_id
Page 105
Platfora Installation Guide - Platfora Utilities Reference
Starts the Platfora server process on the given node. The master node id is usually 0. Worker node ids
can be determined by running platfora-services status.
--no_watchdog
Do not start a watch dog daemon process to monitor and restart the server process if needed.
--nowait
Do not wait for the server startup tasks to complete before returning the command prompt.
-P | --piddir pid_path
The path of the server PID file. Defaults to $PLATFORA_DATA_DIR/platfora.pid.
--platfora_conf path
The directory that contains the Platfora server configuration files. Defaults to $PLATFORA_CONF_DIR
or what was specified during setup.
-p | --port management_port
The API port of the Platfora server used by the management utilities. Defaults to what is specified for
the platfora.server.management.port property.
--profile
Starts server with the Java profiler listener enabled.
-s | --jvmsize jvm_size
The size of the Java Virtual Memory (JVM) to allocate to the Platfora server process (M=megabytes,
G=gigabytes). Defaults to what is set for the env.platfora.jvm.maxsize property.
-w | --websvc_port web_service_port
The web service port of the Platfora application server. Defaults to what is specified for the
platfora.webservice.port property.
Examples
Start the Platfora server on all nodes in the cluster (master and workers) using the default settings:
$ platfora-services start
Start the Platfora server on worker node 3 only with a 8 GB JVM:
$ platfora-services start -n 3 -s 8G
platfora-services stop
Stops the Platfora server processes.
Synopsis
platfora-services stop [-h] [--datadir path] [--logdir path] [-master_only] [-n node_id] [-P pid_path] [--no_watchdog] [--force]
Page 106
Platfora Installation Guide - Platfora Utilities Reference
Description
The platfora-services stop command is used to stop the Platfora server processes. If no
arguments are given, it uses the configuration information specified during startup.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
--datadir path
The path of the Platfora data directory where the catalog database, lens data, and logs reside. Defaults to
$PLATFORA_DATA_DIR or what was specified during setup.
--logdir path
The directory of the Platfora server log files. Defaults to $PLATFORA_DATA_DIR/logs.
--master-only
Stop the master server process only.
-n | --node node_id
Stops the Platfora server process on the given node. The master node id is usually 0. Worker node ids
can be determined by running platfora-services status.
-P | --piddir pid_path
The path of the server PID file. Defaults to $PLATFORA_DATA_DIR/platfora.pid.
--no_watchdog
Do not stop the watch dog daemon process to monitor and restart the server process if needed.
--force
Stop all nodes in the cluster immediately without waiting for processes to finish gracefully. This is
similar to the kill -9 UNIX command.
Examples
Stop the Platfora server on all nodes in the cluster (master and workers) using the default settings:
$ platfora-services stop
Stop the Platfora server on worker node 3 only:
$ platfora-services stop -n 3
Page 107
Platfora Installation Guide - Platfora Utilities Reference
platfora-services restart
Stops the Platfora server processes immediately followed by a start of the Platfora server processes.
Synopsis
platfora-services restart [-h] [-d [DEBUG]] [--hadoop_conf path] [-profile] [-n node_id] [-p management_port] [-w web_port] [-P pid_path]
[-s jvm_size] [--no_watchdog]
Description
The platfora-services restart command restarts the Platfora server processes. If you do
not specify any arguments, the command uses the configuration information specified during setup.
This configuration is stored the Platfora metadata catalog. To view your current configuration, see
your Global Settings in Platfora or the platfora.properties configuration file located in the
$PLATFORA_CONF_DIR.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
-d | --debug
Starts the Platfora server with the Java debugger listener enabled.
--hadoop_conf path
Local directory path where the Hadoop configuration files reside. Defaults to what is specified for the
env.platfora.hadoopconf property.
-n | --node node_id
Starts the Platfora server process on the given node. The master node id is usually 0. Worker node ids
can be determined by running platfora-services status.
--no_watchdog
Do not start a watch dog daemon process to monitor and restart the server process if needed.
-P | --piddir pid_path
The path of the server PID file. Defaults to $PLATFORA_DATA_DIR/platfora.pid.
-p | --port management_port
The API port of the Platfora server used by the management utilities. Defaults to what is specified for
the platfora.server.management.port property.
Page 108
Platfora Installation Guide - Platfora Utilities Reference
--profile
Starts server with the Java profiler listener enabled.
-s | --jvmsize jvm_size
The size of the Java Virtual Memory (JVM) to allocate to the Platfora server process (M=megabytes,
G=gigabytes). Defaults to what is set for the env.platfora.jvm.maxsize property.
-w | --websvc_port web_service_port
The web service port of the Platfora application server. Defaults to what is specified for the
platfora.webservice.port property.
Examples
Restart the Platfora server on all nodes in the cluster (master and workers) using the default settings:
$ platfora-services restart
Restart the Platfora server on worker node 3 only:
$ platfora-services restart -n 3
platfora-services status
Shows the status of the Platfora server processes.
Synopsis
platfora-services status [-h] [-P pid_path] [-n node_id] [p management_port] [-w web_port] [--logdir path] [--datadir path]
Description
The platfora-services status command is used to query the status and availability of the
Platfora server processes. If no arguments are given, it uses the configuration information specified at
startup. It reports the following information about the servers in a Platfora cluster:
Information
Description
ID
The system assigned node ID.
Type
The type of node: Master or Child (worker).
Host
The host name of the node.
Port
The management port of the node.
Enabled
The cluster status of the node: Enabled or Disabled or Not Ready.
A node is disabled when an administrator takes it offline from
query processing.
Page 109
Platfora Installation Guide - Platfora Utilities Reference
Information
Description
Status
The network status of the node: Available, Unavailable, or Not
Ready. A node is unavailable when it cannot be reached by the
master or is not responding. A node is not ready when it has been
newly added or re-enabled, but has not yet finished copying the
data blocks it needs to answer queries.
Process
The status of the Platfora server process on a node: Running
or Stopped or Unhealthy. A node is Unhealthy if Platfora cannot
determine the process status. For example, a node is Unhealthy if
the server is Running but not processing ping messages.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
--datadir path
The path of the Platfora data directory where the catalog database, lens data, and logs reside. Defaults
to $PLATFORA_DATA_DIR or what is specified for the platfora.data.dir in Platfora's Global
Settings.
--logdir path
The directory of the Platfora server log files. Defaults to $PLATFORA_CONF_DIR/logs.
-n | --node node_id
Starts the Platfora server process on the given node. The master node id is usually 0. Worker node ids
can be determined by running platfora-services status.
-P | --piddir pid_path
The path of the server PID file. Defaults to $PLATFORA_DATA_DIR/platfora.pid.
-p | --port management_port
The API port of the Platfora server used by the management utilities. Defaults to what is specified for
the platfora.server.management.port property in $PLATFORA_CONF_DIR/platfora.properties.
-w | --websvc_port web_service_port
The web service port of the Platfora application server. Defaults to what is specified for the
platfora.webservice.port property in $PLATFORA_CONF_DIR/platfora.properties.
Page 110
Platfora Installation Guide - Platfora Utilities Reference
Examples
Check the status of all nodes in a Platfora cluster:
$ platfora-services status
ID
TYPE
HOST
PORT
ENABLED
STATUS
PROCESS
---------------------------------------------------------------------------------0
Master
ip-10-212-123-456
8002
Enabled
Available
Running
1
Child
ip-10-212-123-567
8002
Enabled
Available
Running
2
Child
ip-10-212-123-678
8002
Enabled
Not Ready
Running
platfora-services sync
Syncronizes the Platfora software binaries and global configuration settings of the master to the worker
nodes.
Synopsis
platfora-services sync [-h]
Description
The platfora-services sync command is used to push software and configuration file settings
from the master node to the worker nodes.
Required Arguments
No required arguments.
Optional Arguments
-h | --help
Shows the command-line syntax help and then exits.
Examples
Push configuration file changes and software binaries from the master to the workers:
$ platfora-services sync
platfora-syscapture
Captures the Platfora log files, configuration files, metadata catalog, and system environment
information needed by Platfora Customer Support to troubleshoot issues.
Page 111
Platfora Installation Guide - Platfora Utilities Reference
Synopsis
platfora-syscapture [--all | --last "number time_units"] [ --hostsfile
filename [--child] ] [--tempdir] [--with-catalog] [-h] [-q] [-v] [-V]
Description
The platfora-syscapture utility captures files needed by Platfora Customer Support, and creates
a compressed tar file in the current directory. It captures the following information from your Platfora
installation:
• The Platfora server log files. By default only master log files from the past 7 days are captured.
• The Platfora configuration files.
• The Hadoop configuration files used by Platfora.
• The OS settings on the master host (provided that /sbin/sysctl is in your PATH).
• System resource information such as memory and CPU.
• The version of Java you are using.
• Optionally, a database dump of the Platfora metadata catalog database.
• Optionally, the list of files in the Platfora directory of DFS.
• Optionally, your Platfora data directory.
Required Arguments
No required arguments.
Optional Arguments
--all
Captures all log files. By default, only log files that have changed within the past 7 days are captured.
--last number time_units
Captures only the log files within the specified time period (relative to now). By default, only log files
that have changed within the past 7 days are captured. Allowed time units are weeks, days, hours,
or minutes.
--outfile filename
The file where you want to to store the syscapture data.
--child
If --hostsfile is used, also captures worker node configuration files in addition to the log files.
--tempdir
Specifies a temporary directory for writing interim results. Defaults to the $PLATFORA_DATA_DIR
directory. The utility automatically cleans up the temporary directory upon success and failure.
--with-catalog
Captures the contents of the Platfora metadata catalog database.
Page 112
Platfora Installation Guide - Platfora Utilities Reference
--with-telemetry
Captures the telemetry data for your Platfora instance.
--with-dfs-ls
Includes a DFS directory listing with the capture.
--datadir
Captures the contents of the Platfora data directory.
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
Examples
Capture log files on the Platfora master for the last 2 days:
$ platfora-syscapture --last "2 days"
Capture log files for the last 36 hours on the Platfora master and on the worker nodes (as named in the
hosts file):
$ platfora-syscapture --last "36 hours" --hostsfile /home/platfora/
worker_nodes.txt
platfora-syscheck
Checks the operating system on the master andworker nodes.
Synopsis
platfora-syscheck [--skipdb] [-h] [-q] [-v] [-V]
Description
The platfora-syscheck utility verifies that the operating system environment on each Platfora node
(master and workers) meets the requirements needed to run the Platfora server software. It performs the
following checks:
• Verifies that the installation package is not corrupt by doing a checksum of the files in $PLATFORAHOME.
Page 113
Platfora Installation Guide - Platfora Utilities Reference
• Verifies that the required Unix OS utilities are installed and can be found in the $PATH.
• Verifies that ulimit is sized appropriately.
• Verifies that ssh keys were correctly configured. This checks against the local host, the fully qualified
domain name, and the hostname.
• Verifies that a compatible Java Runtime Environment (JRE) is installed.
• Verifies that a compatible version of PostgreSQL is installed, and the system shared memory settings
are sized appropriately for PostgreSQL.
• Reports the amount of free disk space in the configured environment variable
$PLATFORA_DATA_DIR. If $PLATFORA_DATA_DIR is not set, checks the disk space of the current
user's home directory.
The utility does not check disabled nodes.
Required Arguments
No required arguments.
Optional Arguments
--skipdb
Skips the database-related checks. This option can be used when verifying the operating system
environment of a Platfora worker node, since the PostgreSQL database software is only required on the
Platfora master.
-h | --help
Shows the command-line syntax help and then exits.
-q | --quiet
Runs in quiet mode. Do not send output messages to STDOUT.
-v, -vv, -vvv | --verbose
Runs in verbose mode. Show all output messages.
-V | --version
Shows the software version information and then exits.
Examples
Run a system check on the Platfora master:
$ export PLATFORA_DATA_DIR=/home/platfora/PLATFORA_DATA
$ platfora-syscheck
cmd line: /usr/local/platfora/current/bin/platfora-syscheck
Verifying System Requirements
Checking integrity of binaries......[SUCCESS]
Checking unix utilities......[SUCCESS]
Checking file and directory permissions......[SUCCESS]
Page 114
Platfora Installation Guide - Platfora Utilities Reference
Checking
Checking
Checking
Checking
ssh to localhost......[SUCCESS]
java version......[SUCCESS]
postgres version......[SUCCESS]
shared memory settings......[SUCCESS]
System Resources:
Platfora Data Directory fs has xx GB free space.
System Memory total: xxMB used: xxMB free: xxMB
Page 115
Appendix
B
Glossary
The glossary defines Platfora product terminology and concepts.
Topics:
•
aggregate lens
•
field
•
aggregation
•
filter
•
Amazon EMR
•
focus
•
Amazon S3
•
funnel
•
categorical data
•
geographic analysis
•
column
•
geo map
•
computed field
•
geo reference
•
CSV
•
granularity
•
data catalog
•
Hadoop
•
dataset
•
HDFS
•
data source
•
Hive
•
derived dataset
•
key
•
dimension dataset
•
location field
•
dimension
•
lens
•
distributed file system
•
MapReduce
•
drill down
•
measure
•
elastic dataset
•
quantitative data
•
entity-centric data model
•
reference
•
event
•
regular expressions
•
event series lens
•
ROLLUP measure
•
expression
•
row
•
fact dataset
•
segment
•
fact-centric data model
•
visualization (viz)
Page 116
Platfora Installation Guide - Glossary
•
vizboard
aggregate lens
An aggregate lens contains a selection of measure and dimension fields chosen from the focal point of a
single transactional (or fact) dataset. A completed or built lens can be thought of as a table that contains
aggregated measure data values grouped by the selected dimension values. An aggregate lens can be
built from any dataset. There are no special data modeling requirements to build an aggregate lens.
aggregation
An aggregation is the result of a function that takes all values of a numeric column, and returns a single
value of more significant meaning or measurement. An aggregate function groups the values of multiple
rows together based on some defined input expression.
Examples of aggregate functions include SUM, COUNT, DISTINCT, MIN, MAX, and STDDEV. In
Platfora, measure fields are always the result of an aggregation.
Amazon EMR
Amazon Elastic MapReduce (Amazon EMR) is a Hadoop framework hosted by Amazon Web Services
(AWS). It utilizes Amazon Elastic Compute Cloud (Amazon EC2) for compute resources and Amazon
Simple Storage Service (Amazon S3) for data storage.
Platfora can be configured to use Amazon EMR as its backend Hadoop processing framework, and
Amazon S3 as its primary data source and storage system.
Amazon S3
Amazon Simple Storage Service (Amazon S3) is a data storage service provided by Amazon Web
Services (AWS).
It is a distributed file system hosted by Amazon where you pay a monthly fee for storage space and data
transfer bandwidth. Data transfer is free between S3 and Amazon Elastic Compute Cloud (EC2) clusters,
making S3 an attractive choice for users who run Hadoop clusters on AWS or utilize the Amazon EMR
service.
Hadoop supports two S3 file system protocols as an alternative to HDFS: S3 Native File System (s3n)
and S3 Block File System (s3). Platfora supports the S3 Native File System (s3n) only.
Page 117
Platfora Installation Guide - Glossary
categorical data
Categorical data is data with unconnected data points that can be represented in a visualization as
a categorical grouping or individual data point. Categorical data is countable and often finite (for
example, the number of products sold or the number of people in a city). In Platfora, categorical values
in a visualization are evenly spaced by sort order. By default, dimension fields in a visualization are
categorical, but numeric or datetime dimensions can be changed to quantitative. Categorical data is
sometimes referred to as discrete data.
column
A column is a set of data values of a particular data type, with one value for each row in the dataset.
Columns provide the structure for composing a row. The terms column and field are often used
interchangeably, although many consider it more correct to use field to refer specifically to the single
item that exists at the intersection of one row and one column.
computed field
A computed field generates its values based on a calculation or condition, and returns a value for
each input row. Values are computed based on expressions that can contain values from other fields,
constants, mathematical operators, comparison operators, or built-in row functions.
Computed fields are useful for deriving meaningful values from base fields (such as calculating
someone's age based on their birthday), doing data cleansing and pre-processing (such as grouping
similar values together or substituting one value for another), or for computing new data values based
on a number of input variables (such as calculating a profit margin value based on revenue and costs).
A computed field that does an aggregate calculation is called a measure, which is a special kind of
computed field in Platfora.
CSV
Comma-separated values (CSV) is a plain text file format for describing tabular data. CSV, in general,
refers to any file that is plain text (typically ASCII or Unicode characters), has one record per line, has
records divided into fields separated by delimiters (typically a comma), and has the same sequence of
fields for every record.
Within these general constraints, there are many variations of CSV in use. For example, some CSV
formats use quotation marks around field values, some use delimiters other than a comma (such as a
tab or a semi-colon), and some reserve the very first line of the file as a header of field names. Platfora
supports the typical CSV formatting conventions, and allows for some configuration to support different
variations.
Page 118
Platfora Installation Guide - Glossary
data catalog
The data catalog is a collection of data items available and visible to Platfora users. Data administrators
build the data catalog by defining and modeling datasets in Platfora that point to source data in Hadoop.
When users request data from a dataset, that request is materialized in Platfora as a lens. The data
catalog shows all of the datasets (data available for request) and lenses (data that is ready for analysis)
that have been created by Platfora users.
dataset
A dataset is a collection of external data files residing in a data source that can be described in table
form (rows and columns). Source data is mapped into Platfora by creating a dataset definition.
A dataset definition describes the rows and columns, the base fields and their associated data types,
computed fields, measure aggregations, and references (or joins) to other related datasets. The collection
of dataset definitions make up the data catalog (the data items available to Platfora users).
data source
A data source is a connection to a mount point or directory on an external data server, such as a file
system or database server. Platfora currently provides data source adapters for Hive, HDFS, Amazon S3,
and MapR FS.
Platfora has one default data source named Uploads (for data files that you upload from your local file
system). This default data source resides in the distributed file system (DFS) that the Platfora server is
configured to use as its primary data source.
derived dataset
A dataset whose underlying data is produced from the results of a Platfora lens query or visualization.
There are two types of derived datasets -- static (lens query results are saved to a static file) or dynamic
(lens query results are refreshed each time the lens is rebuilt).
A derived dataset allows you to save the query results from a lens as a new dataset in Platfora. Once a
derived dataset is saved, you can use it as you would any other dataset in Platfora - you can edit it, add
additional computed fields, and join it by reference to other datasets in the Platfora data catalog.
dimension dataset
Page 119
Platfora Installation Guide - Glossary
A type of dataset that has a primary key and contain attributes (additional dimension fields) that
describe some aspect of a fact or event record (such as a person, item, date, etc.). Dimension datasets are
referenced by a fact dataset.
dimension
A dimension is a type of field (or a collection of fields) that allows you to analyze a measure from
different perspectives to derive meaning from the data. Dimensions are used to summarize, filter,
categorize, and group quantitative measure data in order to answer business questions.
For example, a product dimension can help you understand which products generate the most revenue
for your business. A date dimension can show you the breakdown of sales by year, quarter, month, or
day.
Dimension fields can be character-type data (such as product categories), datetime-type data (such as
months, days, or hours), or categorical numeric-type data (such as customer ratings on a scale of 1-10).
distributed file system
A distributed file system (DFS) is any file system that allows access to files from multiple hosts over
a computer network. It makes it possible for multiple machines and users to share files and storage
resources. HDFS is the primary distributed file system for Hadoop, however Hadoop supports other
distributed file systems as well, such as Amazon S3.
drill down
Drill down (or drill up) is a data analysis technique for navigating from the most summarized to the most
detailed categorization of a particular dimension.
Drill down allows exploration of multi-dimensional data by moving from one level of detail to the next.
A drill-down path is defined by specifying a hierarchy of categories for a dimension or between related
dimensions. For example, a date dimension might have categories defined for year, quarter, month,
week, day, and so on. A product dimension might have categories defined for division, type, and model.
Drill-down levels depend on the granularity of the fields available in the source data.
elastic dataset
Elastic datasets are a special kind of dataset used for entity-centric data modeling in Platfora. They are
used to consolidate unique key values from other datasets into one place for the purpose of defining
segments or event series lenses. They are elastic because the data they contain is dynamically generated
at lens build time.
Page 120
Platfora Installation Guide - Glossary
Elastic datasets are not backed by source files like regular datasets. Instead, they consolidate the unique
foreign keys from any dataset that points to it via a reference. Because they do not contain any records of
their own, elastic datasets cannot be used as the focus for an aggregate lens or event.
entity-centric data model
An entity-centric data model 'pivots' a fact-centric data model to focus an analysis around a particular
dimension (or entity). Modeling the data in this way allows you to do event series analysis and segment
analysis in Platfora.
For example, modeling different fact datasets around a central customer dataset allows you to analyze
different aspects of a customer's behavior. For example, instead of asking "how many customers visited
my web site?" (fact-centric), you could ask questions like "which customers visit my site more than once
a day?" (entity-centric).
event
An event is similar to a reference, but the direction of the join is reversed. An event joins the primary
key field(s) of a dimension dataset to the corresponding foreign key field(s) in a fact dataset, plus
designates a timestamp field for ordering the event records.
event series lens
An event series lens contains a selection of dimension fields chosen from the focal point of a single
entity dataset, including any fields from event datasets associated with that entity. A completed or built
lens can be thought of as a table that contains individual event records partitioned by the primary key of
the entity dataset, and ordered by a time.
An event series lens can only be built from datasets that have at least one event reference defined in
them. It contains non-aggregated event records of various types, partitioned by some common entity
(typically a user id), and sorted by the time the events occurred. Choose this lens type if you specifically
want to do funnel analysis.
expression
An expression computes or produces a value by combining fields (or columns), constant values,
operators, and functions.
An expression's result can be any data type, such as numeric, string, datetime, or Boolean (true/false)
values. Simple expressions can be a single constant value, field (or column), or a function call. You can
use operators to join two or more simple expressions into a complex expression.
Page 121
Platfora Installation Guide - Glossary
fact dataset
In multi-dimentional data models, a fact dataset (or table) contains records (or rows) that represent a
single real-world event that has occurred (such as a sales transaction, a page view, a user registration, an
airline flight, and so on).
A fact record contains the quantitative measure data (such as the dollar amount of a sale), and several
descriptive attributes (or dimensions) that give the measure context (such as the date, the customer, the
product, and so on). Facts are stored at a uniform level of detail (or grain) within a fact dataset.
fact-centric data model
A fact-centric data model is centered around a particular real-world event that has happened, such as
web page views or sales transactions. Datasets are modeled so that a central fact dataset is the focus of
an analysis, and dimension datasets are referenced to provide more information about the fact. In data
warehousing and business intelligence (BI) applications, this type of data model is often referred to as a
star schema.
For example, you may have web server logs that serve as the source of your central fact data about pages
viewed on your web site. Additional dimension datasets can then be related (or joined) to the central fact
to provide more in-depth analysis opportunities.
field
A field is an atomic unit of data that has a name, a value, a data type, and a role of either dimension or
measure. When working with visualizations, fields are the same thing as the dimensions and measures
used to analyze the data.
Fields describe a single aspect of a record (or row) in a dataset. An order record, for example, might
contain an order date field, a product name field, a quantity field, and so on. All records in a dataset have
exactly the same fields, although the values in each field vary from record to record.
filter
A filter is a field value or expression used as a condition for limiting the data that is selected from a lens
and shown in a visualization. A filter can be applied to a visualization to exclude (or include) data that
meets the filter criteria.
For example, if you wanted to show only the sales for the US west coast, you could use the state field
as a filter and just include the values for California, Oregon and Washington. All of the other values for
state would then be filtered out (not shown in the visualization).
Page 122
Platfora Installation Guide - Glossary
focus
A focus sets the central topic for a data exploration and analysis. You set a focus by choosing a single
dataset from the data browser.
For example, if you wanted to explore the characteristics of users who registered on your web site in the
past month, you might choose the user dataset or the registration dataset as the focus of your analysis.
Choosing a focus allows you to find or build a lens of optimized data to work with in a visualization.
funnel
A funnel is a visual analysis type that tracks users' (entities') behavior across a sequence of events, with
each step in the sequence defined as a stage.
Each funnel stage shows progressively decreasing proportions of the original set of users. The first stage
has 100% of the original group of users by definition.
A funnel is always based on an event series lens. The users in the funnel are from the focus dimension
dataset in the lens, and their behaviors are from one or more fact datasets in the lens. The funnel
analyzes their behaviors performed sequentially and counts the number of users that meet the criteria
defined in each stage.
geographic analysis
Geographic analysis is a type of data analysis that involves understanding the role that location plays in
the occurence of other factors. By looking at the geo-spatial distribution of data on a map, analysts can
see how location impacts different variables.
In Platfora, geographic analysis is performed in a geo map viz type.
geo map
A geo map is a viz type that allows analysts to perform geographic analysis on a lens that contains
location data. It includes the Geography drop zone that places positions (using a location field) on a
map background.
Geo map visualizations can be made from an aggregate lens that has at least one location field included.
geo reference
A geo reference is a special type of reference to a dataset that contains a location field.
Page 123
Platfora Installation Guide - Glossary
In addition to at least one location field, the dataset referenced in a geo reference typically contains
primarily location data. For example, this might include population, voting district information, or
government data.
granularity
The granularity of data refers to the fineness with which data fields are sub-divided, and the level of
detail that data is stored within a dataset or lens. For example, a postal address can be recorded with
low granularity with the entire address in one field (address=123 Main St. San Mateo, CA 94403).
Or a higher granularity with the fields broken out (address=123 Main St., city=San Mateo, state=CA,
zipcode=94403).
Hadoop
Hadoop is open-source software framework designed for storing and processing large amounts of
complex, structured, and semi-structured data. It is a distributed system, meaning it runs on a collection
of commodity, shared-nothing servers. Hadoop consists of two key services: a for data storage and for
parallel data processing.
HDFS
Hadoop Distributed File System (HDFS) is the primary storage system for Hadoop applications. It is a
distributed file system, meaning it runs on a collection of commodity servers.
An HDFS cluster usually consists of a NameNode (the metadata management node that manages access
to files and directories) and multiple DataNodes (the storage nodes where file data resides). HDFS
creates multiple replicas of a file's data storage blocks and distributes them throughout the cluster to
enable extremely fast data processing. Platfora can be configured to use HDFS as its primary data
source.
Hive
Hive is an execution engine for Hadoop that lets you write data queries in an SQL-like language called
Hive Query Language (HQL). Hive allows you to create tables by describing the structure of files
residing in HDFS.
Platfora can use a Hive metastore server as a data source, and map a Hive table definition to a Platfora
dataset definition. Platfora uses the Hive table definition to obtain metadata about the source data, such
as which files to process, the parsing logic for rows and columns, and the field names and data types
contained in the source data. It is important to note that Platfora does not execute queries through Hive;
Page 124
Platfora Installation Guide - Glossary
it only uses Hive tables to obtain the metadata needed for defining datasets. Platfora generates and runs
its own MapReduce jobs directly in Hadoop.
key
A key is single field (or combination of fields) that uniquely identifies a row in a dataset, similar to a
primary key in a relational database. A dataset must have a key defined to be the target of a reference.
location field
A location field is a dataset field encoded with a complex datatype that includes geo coordinate
information (latitude and longitude) and a label that associates a location name with the coordinates.
Location fields are defined in the dataset. When defining a location field in a dataset, you can
optionally use the values in an existing dataset field as the location field label. If no label is defined,
Platfora creates a unique string from the coordinates as the label name (for example @(122.33063°W,
37.541886°N)). Use a location field in a geo map viz to place positions on a map.
lens
A lens is a type of data storage that is specific to Platfora. Platfora uses Hadoop as its data source and
processing engine to build and store its lenses. Once a lens is built, this prepared data is copied to
Platfora, where it is available for analysis. A lens can be thought of as a dynamic, on-demand data mart
purpose-built for a specific analysis project.
Platfora generates MapReduce jobs to pull the requested data from the Hadoop source system, and
prepares the data for fast, ad-hoc visual analysis. As users build visualizations, lens data is loaded into
memory on a column-by-column basis as it is needed.
Platfora has two types of lenses you can build: an aggregate lens or an event series lens. The type of lens
you build determines what kinds of visualizations you can create and what kinds of analyses you can
perform when using the lens in a vizboard.
MapReduce
MapReduce is a data-flow programming model for processing large amounts of data on a cluster of
commodity servers. It passes data items from one stage of processing to the next using user-defined
criteria (or jobs).
The MapReduce engine acts as an abstraction, allowing programmers to focus on their desired data
computations. The details of parallelism, distribution, load balancing and fault tolerance are all handled
by the MapReduce framework. Platfora defines and runs MapReduce jobs on the source data in Hadoop
Page 125
Platfora Installation Guide - Glossary
based on the and lens definitions created by Platfora users. The output of the MapReduce jobs executed
by Platfora are stored both in HDFS and Platfora.
MapReduce jobs typically start with a large data file that is broken down into smaller pieces called
splits, which are similar to database rows. Each split is parsed into key/value pairs (similar to fields)
and processed by the user-defined map criteria. The output of the map processing stage is then passed to
the reduce processing stage, which does final grouping and aggregation. Each stage of processing uses
parallelism to enable many map and reduce tasks to run at the same time on multiple machines.
measure
A measure is a numeric value representing an aggregation of some dataset metric (such as total dollars
sold, average number of users, and so on). To create measures, you add computed fields to a dataset or a
lens.
When a lens is built, the build calculates any measures and stores them in the lens. In a visualization,
measures provide the basis for quantitative analysis.
Measures represent a set of real-world events (or facts) and typically answer "how" questions about data
such as how many or how long? If you are familiar with SQL, measure values come from the aggregate
functions such as SUM(), COUNT(), MAX(), MIN(). Measure fields are typically derived from numeric
fields in a dataset, and their values are always the result of an aggregation (average, count, sum, min,
max, and so on).
quantitative data
Quantitative data can be characterized as a sequence or progression of values with connected data points
that can be represented as an unbroken line in a visualization. Quantitative fields usually have values
that can be shown in ordered progression, such as height, speed, or duration measurements. Quantitative
values are placed on a continuous axis, always displayed from low to high. In Platfora, measure data
is always quantitative, but numeric or datetime dimensions can be either quantitative or categorical.
Quantitative data is sometimes referred to as continuous data.
reference
A reference allows two datasets to be joined together on one or more fields that they share in common.
A reference creates a link from a field in one dataset to the primary key of another dataset.
Reference fields are typically created in a fact dataset, and point to a dimension dataset. Creating a
reference allows the datasets to be joined when building lenses or segments, similar to a foreign key to
primary key relationship in a relational database.
Page 126
Platfora Installation Guide - Glossary
regular expressions
Regular expressions, also referred to as regex or regexp, are a standardized collection of special
characters and constructs used for matching strings of text. They provide a flexible and precise language
for matching particular characters, words, or patterns of characters.
ROLLUP measure
ROLLUP is a modifier to an aggregate expression that allows you to define complex measure
expressions, such as windowed, partitioned, or adaptive measure expressions. This is useful when you
want to compute an aggregation for a subset of rows within the overall result of a viz query. It allows
you to compute things such as running totals, moving averages, benchmark comparisons, rank ordering,
percentiles, and so on.
row
A a row represents a single object or record in a dataset. A dataset or lens consists of rows of columns
(or fields).
Each row represents a set of related data, and every row has the same structure. For example, in a dataset
that represents customers, each row would represent a single customer. Columns might represent things
like customer name, email address, gender, age, and so on.
segment
A segment is a special type of dimension field that you can create to group together members of a
population that meet some defined common criteria. A segment is a based on members of a dimension
dataset (such as customers) that have some behavior in common (such as purchasing a particular
product).
In Platfora, a segment is always based on a dimension (or referenced) dataset, and must include at
least one condition from a fact or event dataset. For example, customers who are female would not be
considered a valid segment, however customers who are female that made a purchase would be. A
segment is not just people or things that share common attributes, but also share a common behavior or
action.
Behind the scenes, segments are saved as a special type of lens that can be used and updated
independently of the lens that they were created from. For example, you can create a segment from a
customer purchases lens but then use that segment in a different customer support calls lens. As long as
the lenses have a conforming dimension in common (such as customer), then segments can be used to
compare behaviors of a group of individuals across multiple fact or event datasets.
Page 127
Platfora Installation Guide - Glossary
visualization (viz)
A visualization (or viz for short) is a graphical representation of certain data fields chosen from the
perspective of a single Platfora lens. It is a query of lens data that is visually rendered based on the types
of fields chosen (measure or dimension), their order and placement in the Builder drop zones, and the
various appearance encodings applied to the data (color, size, shape, and so on).
A viz shows aggregated measure data grouped and filtered by the chosen dimensions. A chart in Platfora
can best be described as a recipe of dimension and measure fields, plus axis placement (X-axis and Yaxis), plus appearance encodings (Color, Size, Shape, Opacity, Labels), plus mark type (Point, Line, Bar,
Area, and so on).
vizboard
A vizboard is the starting point for data analysis, and can be thought of as a dashboard or project
workspace. The vizboard is the canvas for discovering and sharing data insights.
A vizboard contains one or more pages of visualizations that together are meant to tell a data story. The
individual visualizations on a vizboard page can be related (use the same underlying data), or unrelated
(use completely different data). A vizboard can be saved, versioned, and shared with others.
Page 128