Platfora Installation Guide Version 4.5 For On-Premise Hadoop Deployments Copyright Platfora 2015 Last Updated: 10:14 p.m. June 28, 2015 Contents Document Conventions............................................................................................. 5 Contact Platfora Support...........................................................................................6 Copyright Notices...................................................................................................... 6 Chapter 1: Installation Overview (On-Premise)......................................................... 8 On-Premise Hadoop Deployments........................................................................... 8 Master vs Worker Node Installations........................................................................9 Preinstall Check List............................................................................................... 10 High-Level Install Steps.......................................................................................... 11 Chapter 2: System Requirements (On-Premise)..................................................... 13 Platfora Server Requirements.................................................................................13 Port Configuration Requirements............................................................................14 Ports to Open on Platfora Nodes...................................................................... 15 Ports to Open on Hadoop Nodes......................................................................15 Supported Hadoop and Hive Versions................................................................... 17 Hadoop Resource Requirements............................................................................17 Browser Requirements............................................................................................18 Chapter 3: Configure Hadoop for Platfora Access................................................. 19 Create Platfora User on Hadoop Nodes.................................................................19 Create Platfora Directories and Permissions in Hadoop........................................ 19 HDFS Tuning for Platfora....................................................................................... 21 Increase Open File Limits..................................................................................21 Increase Platfora User Limits............................................................................ 22 Increase DataNode File Limits.......................................................................... 22 Allow Platfora Local Access.............................................................................. 22 MapReduce Tuning for Platfora..............................................................................23 YARN Tuning for Platfora....................................................................................... 25 Chapter 4: Install Platfora Software and Dependencies.........................................27 About the Platfora Installer Packages.................................................................... 27 Install Using RPM Packages.................................................................................. 28 Install Dependencies RPM Package................................................................. 28 Install Optional Security RPM Package.............................................................29 Install Platfora RPM Package (Master Only).....................................................30 Install Using the TAR Package...............................................................................31 Create the Platfora System User...................................................................... 31 Set OS Kernel Parameters................................................................................33 Install Dependent Software................................................................................35 Platfora Installation Guide - Contents Install Platfora TAR Package (Master Only)..................................................... 39 Install PDF Dependencies (Master Only).......................................................... 40 Chapter 5: Configure Environment on Platfora Nodes...........................................43 Install the MapR Client Software (MapR Only).......................................................43 Configure Network Environment............................................................................. 45 Configure /etc/hosts File.................................................................................... 45 Verify Connectivity Between Platfora Nodes..................................................... 46 Verify Connectivity to Hadoop Nodes................................................................47 Open Firewall Ports........................................................................................... 48 Configure Passwordless SSH................................................................................. 49 Verify Local SSH Access...................................................................................49 Exchange SSH Keys (Multi-Node Only)............................................................49 Synchronize the System Clocks............................................................................. 50 Create Local Storage Directories............................................................................51 Verify Environment Variables..................................................................................52 Chapter 6: Configure Platfora for Secure Hadoop Access.................................... 53 About Secure Hadoop.............................................................................................53 Configure Kerberos Authentication to Hadoop....................................................... 54 Obtain Kerberos Tickets for a Platfora Server.................................................. 54 Auto-Renew Kerberos Tickets for a Platfora Server......................................... 54 Configure Secure Impersonation in Hadoop...........................................................55 Chapter 7: Initialize Platfora Master Node............................................................... 57 Connect Platfora to Your Hadoop Services............................................................57 Understand How Platfora Connects to Hadoop................................................ 57 Obtain Hadoop Configuration Files................................................................... 59 Create Local Hadoop Configuration Directory...................................................59 Initialize the Platfora Master................................................................................... 69 Configure SSL for Client Connections...............................................................71 Configure SSL for Catalog Connections........................................................... 73 About System Diagnostic Data..........................................................................74 Troubleshoot Setup Issues..................................................................................... 75 View the Platfora Log Files............................................................................... 75 Setup Fails Setting up Catalog Metadata Service.............................................75 TEST FAILED: Checking integrity of binaries................................................... 76 Chapter 8: Start Platfora............................................................................................78 Start the Platfora Server......................................................................................... 78 Log in to the Platfora Web Application................................................................... 79 Add a License Key..................................................................................................81 Change the Default Admin Password.....................................................................81 Page 3 Platfora Installation Guide - Contents Load the Tutorial Data............................................................................................ 82 Chapter 9: Initialize a Worker Node......................................................................... 84 Appendix A: Command Line Utility Reference........................................................85 setup.py................................................................................................................... 85 hadoop-check.......................................................................................................... 89 hadoopcp................................................................................................................. 92 hadoopfs.................................................................................................................. 93 install-node.............................................................................................................. 94 platfora-catalog........................................................................................................ 95 platfora-catalog ssl.............................................................................................97 platfora-config.......................................................................................................... 98 platfora-export........................................................................................................100 platfora-import........................................................................................................104 platfora-license...................................................................................................... 106 platfora-license install...................................................................................... 107 platfora-license uninstall.................................................................................. 108 platfora-license view........................................................................................ 108 platfora-node..........................................................................................................109 platfora-node add.............................................................................................110 platfora-node config......................................................................................... 111 platfora-services.................................................................................................... 112 platfora-services start.......................................................................................113 platfora-services stop.......................................................................................115 platfora-services restart................................................................................... 117 platfora-services status.................................................................................... 118 platfora-services sync...................................................................................... 120 platfora-syscapture................................................................................................ 120 platfora-syscheck...................................................................................................122 Appendix B: Glossary..............................................................................................125 Page 4 Preface This guide provides information and instructions for installing and initializing a Platfora® cluster. This guide is intended for system administrators with knowledge of Linux/Unix system administration and basic Hadoop administration. This on-premise installation guide is for data center environments (either physical or virtual data centers) that have a permanent, managed Hadoop cluster. Platfora is installed in the same network as your Hadoop cluster. Document Conventions This documentation uses certain text conventions for language syntax and code examples. Convention Usage Example $ Command-line prompt proceeds a command to be entered in a command-line terminal session. $ ls $ sudo Command-line prompt $ sudo yum install open-jdk-1.7 for a command that requires root permissions (commands will be prefixed with sudo). UPPERCASE Function names and keywords are shown in all uppercase for readability, but keywords are caseinsensitive (can be written in upper or lower case). SUM(page_views) italics Italics indicate a usersupplied argument or variable. SUM(field_name) [ ] (square Square brackets denote optional syntax items. CONCAT(string_expression[,...]) ... (elipsis) An elipsis denotes a syntax item that can be repeated any number of times. CONCAT(string_expression[,...]) brackets) Page 5 Platfora Installation Guide - Introduction Contact Platfora Support For technical support, you can send an email to: [email protected] Or visit the Platfora support site for the most up-to-date product news, knowledge base articles, and product tips. http://support.platfora.com To access the support portal, you must have a valid support agreement with Platfora. Please contact your Platfora sales representative for details about obtaining a valid support agreement or with questions about your account. Copyright Notices Copyright © 2012-15 Platfora Corporation. All rights reserved. Platfora believes the information in this publication is accurate as of its publication date. The information is subject to change without notice. THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” PLATFORA CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Use, copying, and distribution of any Platfora software described in this publication requires an applicable software license. Platfora®, You Should Know™, Interest Driven Pipeline™, Fractal Cache™, and Adaptive Job Synthesis™ are trademarks of the Platfora Corporation. Apache Hadoop™ and Apache Hive™ are trademarks of the Apache Software Foundation. All other trademarks used herein are the property of their respective owners. Embedded Software Copyrights and License Agreements Platfora contains the following open source and third-party proprietary software subject to their respective copyrights and license agreements: • Apache Hive PDK • dom4j • freemarker • GeoNames • Google Maps API • javassist Page 6 Platfora Installation Guide - Introduction • javax.servlet • Mortbay Jetty 6.1.26 • OWASP CSRFGuard 3 • PostgreSQL JDBC 9.1-901 • Scala • sjsxp : 1.0.1 • Unboundid Page 7 Chapter 1 Installation Overview (On-Premise) This section provides an overview of the Platfora installation process for environments that will use an onpremise deployment of Hadoop with Platfora. Topics: • On-Premise Hadoop Deployments • Master vs Worker Node Installations • Preinstall Check List • High-Level Install Steps On-Premise Hadoop Deployments An on-premise Hadoop deployment means that you already have an existing Hadoop installation in your data center (either a physical data center or a virtual private cloud). Page 8 Platfora Installation Guide - Installation Overview (On-Premise) Platfora connects to the Hadoop cluster managed by your organization, and the majority of your organization's data is stored in the distributed file system of this primary Hadoop cluster. For on-premise Hadoop deployments, the Platfora servers should be on their own dedicated hardware co-located in the same data center as your Hadoop cluster. A data center can be a physical location with actual hardware resources, or a virtual private cloud environment with virtual server instances (such as Rackspace or Amazon EC2). Platfora recommends putting the Platfora servers on a network with at least 1 Gbps connectivity to the Hadoop nodes. Platfora users access the Platfora master node using an HTML5-compliant web browser. The Platfora master node accesses the HDFS NameNode and the MapReduce JobTracker or YARN Resource Manager using native Hadoop protocols. The Platfora worker nodes access the HDFS DataNodes directly. If using a firewall, Platfora recommends placing the Platfora servers on the same side of the firewall as your Hadoop cluster. Platfora software can run on a wide variety of server configurations – on as little as one server or scale across multiple servers. Since Platfora runs best with all of the active lenses readily available in RAM, Platfora recommends obtaining servers optimized for higher RAM capacity and a minimum of 8 CPUs. Master vs Worker Node Installations If you are installing Platfora for the very first time, you begin by installing, configuring and initializing the Platfora master node. Once you have the master node up and running, you can then add in additional worker nodes as needed. Page 9 Platfora Installation Guide - Installation Overview (On-Premise) All nodes in a Platfora cluster (master and workers) must meet the minimum system requirements and have the required prerequisite software installed. If you are using the RPM installer packages, you can use the base installer package to install the required software on each Platfora node. If you are using the TAR installer packages, you must manually install the required software on each Platfora node. You only need to install the Platfora server software, however, on the master node. Platfora copies the server software from the master to the worker nodes during the worker node initialization process. All nodes in a Platfora cluster also require you to configure the network environment so that all the nodes can talk to each other, as well as to the Hadoop cluster nodes. If you are adding additional worker nodes to an existing Platfora cluster, make sure to follow the instructions for installing dependencies and configuring the environment. You can skip any tasks denoted as 'Master Only' - these tasks are only required for first-time installations of the Platfora master node. Preinstall Check List Here is a list of items and information you will need in order to install a new Platfora cluster with an onpremise Hadoop deployment. Platfora must be able to connect to Hadoop services during setup, so you will also need information from your Hadoop installation. Platfora Checklist This is a list of things you will need in order to install Platfora nodes. What You Need Description Platfora License Platfora Customer Support must issue you a license file. Trial period licenses are available upon request for pilot installations. Platfora Software A Platfora customer support representative can give you the download link to the Platfora installation package for your chosen operating system and Hadoop distribrution and version. Platfora provides both rpm and tar installer packages. (MapR Only) MapR Client Software If you are using a MapR Hadoop cluster with Platfora, you will need the MapR client software for the version of MapR you are using. The MapR client software must be installed on all Platfora nodes. Page 10 Platfora Installation Guide - Installation Overview (On-Premise) Hadoop Checklist This is a list of things you will need from your Hadoop environment in order to install Platfora. What You Need Description Hadoop Distribution and Version Number When you install Platfora, you need to specify what Hadoop distribution you have (Cloudera, Hortonworks, MapR, etc.) and what version you are running. Hadoop Hostnames and Ports You will need to know the hostnames and ports of your Hadoop services (NameNode, Resource Manager or JobTracker, Hive Server, DataNodes, etc.) Hadoop Configuration Files Platfora requires local versions of Hadoop's configuration files. It uses these files to connect to Hadoop services: • core-site.xml and hdfs-site.xml for HDFS • mapred-site.xml and yarn-site.xml for data processing • hive-site.xml for the Hive metastore The locations of these files varies based on your Hadoop distribution. Platfora Data Directory Location in HDFS Platfora requires a directory location in HDFS to store its library files and output (lenses). High-Level Install Steps This section lists the high-level steps involved in installing Platfora to work with an on-premise Hadoop cluster. Note that there are different procedures if you are installing a new Platfora cluster verses adding a worker node to an existing Platfora cluster. New Platfora Installation When installing Platfora for the first time, you begin with installing and configuring the Platfora master node first. After the master node is installed, initialized and connected to the Hadoop services it needs, then you can use the master node to add additional worker nodes into the cluster. These are the high-level steps for installing Platfora for the first time: 1. Make sure your systems meet the minimum System Requirements. Page 11 Platfora Installation Guide - Installation Overview (On-Premise) 2. .Configure Hadoop for Platfora Access . 3. Install Platfora Software and Dependencies. 4. Configure Environment on Platfora Nodes. 5. (Secure Hadoop Only) Configure Platfora for Secure Hadoop Access. 6. Obtain a Copy of Your Hadoop Configuration Files. 7. Configure Access to Your Hadoop Services. 8. Initialize the Platfora Master. 9. Start Platfora. 10.Login to the Platfora Application. 11.Install the License File. 12.(Optional) Load the Tutorial Data (as a quick way to test that everything works). 13.Add Worker Nodes. Additional Worker Node Installation Once you have a Platfora master node up and running, you can use it to initialize additional worker nodes. Before you can initialize a worker node, however, you must make sure that it has the required dependencies installed. These are the high-level steps for adding a worker node to an existing Platfora cluster: 1. Install the prerequisite software only directly on the worker node. • If using the RPM installer packages, Install Dependencies RPM Package. • If using the TAR installer packages, you must manually Create the Platfora System User, Set OS Kernel Parameters, and Install Dependent Software. 2. Configure Environment on Platfora Nodes. 3. (Secure Hadoop Only) Configure Kerberos Authentication to Hadoop. 4. Add Worker Node to Platfora Cluster. Page 12 Chapter 2 System Requirements (On-Premise) The Platfora software runs on a scale-out cluster of servers. You can install Platfora on a single node to start, and then scale up storage and processing capacity by adding additional nodes. Platfora requires access to an existing, compatible Hadoop implementation in order to start. Users then access the Platfora application using a compatible web browser client. This section describes the system requirements for on-premise deployments of the Platfora servers, Hadoop source systems, network connectivity, and web browser clients. Topics: • Platfora Server Requirements • Port Configuration Requirements • Supported Hadoop and Hive Versions • Hadoop Resource Requirements • Browser Requirements Platfora Server Requirements Platfora recommends the following minimum system requirements for Platfora servers. For multi-node installations, the master server and all worker servers must be the same operating system (OS) and system configuration (same amount of memory, CPU, etc.). 64-bit Operating System or Amazon Machine Image (AMIs) 1 CentOS 6.2-6.5 (7.0 is not supported) RHEL 6.2-6.5 (7.0 is not supported) Scientific Linux 6.2 Amazon Linux AMI 2014.03+ Oracle Enterprise Linux 6.x Ubuntu 12.04.1 LTS or higher 1 Security-Enhanced Linux 6.2 If you wish to install Security-Enhanced Linux, refer to Platfora's Support site for installation instructions. Page 13 Platfora Installation Guide - System Requirements (On-Premise) Software Java 1.7 Python 2.6.8, 2.7.1, 2.7.3 through 2.7.6 (3.0 not supported) PostgreSQL 9.2.1-1, 9.2.5, 9.2.7 or 9.3 (master only) 2 OpenSSL 1.0.1 or higher Unix Utilities rsync, ssh, scp, cp, tar, tail, sysctl, ntp, wget Memory 64 GB minimum, 256 recommended The server needs enough memory to accommodate actively used lens data. Additionally, it needs 1-2 GB reserved for normal operations and the lens query engine workspace. CPU 8 cores minimum, 16 recommended Disk All Platfora nodes (master or worker) require 300MB for the Platfora installation. Every node requires high-speed local storage and a local disk cache configured as a single logical volume. Hardware RAID is recommended for the best performance. All nodes combined require appropriate free space for aggregated data structures (Platfora lenses). At a minimum, you will need twice the amount of disk space as the amount of system memory. The Platfora master node requires an additional, approximately 700 MB for metadata catalog (dataset definitions, vizboard and visualization definitions, lens definitions, etc.) Network 1 Gbps reliable network connectivity between Platfora master server and query processing servers 1 Gbps reliable network connectivity between Platfora master server and Hadoop NameNode and JobTracker/ResourceManager node Network bandwidth should be comparable to the amount of memory on the Platfora master server Port Configuration Requirements You must open ports in the firewall of your Platfora nodes to allow client access and intra-cluster communications. You also must open ports within your Hadoop cluster to allow access from Platfora. This section lists the default ports required. 2 Only required if you want to enable SSL for secure communications between Platfora servers Page 14 Platfora Installation Guide - System Requirements (On-Premise) Ports to Open on Platfora Nodes Your Platfora master node must allow HTTP connections from your user network. All nodes must allow connections from the other Platfora nodes in a multi-node cluster. On Amazon EC2 instances, you must configure the port firewall rules on the Platfora server instances in addition to the EC2 Security Group Settings. Platfora Service Default Port Allow connections from… Master Web Services Port (HTTP) 8001 External user network Platfora worker servers localhost Secure Master Web Services Port (HTTPS) 8443 External user network Platfora worker servers localhost Master Server Management Port 8002 Platfora worker servers localhost Worker Server Management Port 8002 Platfora master server other Platfora worker servers localhost Master Data Port 8003 Platfora worker servers localhost Worker Data Port 8003 Platfora master server other Platfora worker servers localhost Master PostgreSQL Database Port 5432 Platfora worker servers localhost Ports to Open on Hadoop Nodes Platfora must be able to access certain services of your Hadoop cluster. This section lists the Hadoop services Platfora needs to access and the default ports for those services. Page 15 Platfora Installation Guide - System Requirements (On-Premise) Note that this only applies to on-premise Hadoop deployments or to self-managed Hadoop deployments in a virtual private cloud, not to Amazon Elastic MapReduce (EMR). Hadoop Service 3 Default Ports by Hadoop Allow connections from… Distro CDH, HDP, Pivotal Apache MapR Hadoop HDFS NameNode 8020 9000 N/A Platfora master and worker servers HDFS DataNodes 50010 50010 N/A Platfora master and worker servers MapRFS CLDB N/A N/A 7222 Platfora master and worker servers MapRFS DataNodes N/A N/A 5660 Platfora master and worker servers MRv1 JobTracker 8021 9001 9001 Platfora master server MRv1 JobTracker Web UI 50030 50030 50030 External user network (optional) YARN ResourceManager 8032 8032 8032 Platfora master server YARN ResourceManager Web UI 8088 8088 8088 External user network (optional) YARN Job History Server 10020 10020 10020 Platfora master server YARN Job History Server Web UI 19888 19888 19888 External user network (optional) HiveServer Thrift Port 10000 10000 10000 Platfora master server Hive Metastore DB 3 Port 9083 9933 (HDP2) N/A 9083 Platfora master server If connecting to Hive directly using JDBC Page 16 Platfora Installation Guide - System Requirements (On-Premise) Supported Hadoop and Hive Versions This section lists the Hadoop distributions and versions that are compatible with the Platfora installation packages. If using Hive as a data source for Platfora, the version of Hive must be compatible with the version of Hadoop you are using. Hadoop Distro Version Hive Version M/R Version Platfora Package Cloudera 5 CDH5.0 0.12 YARN cdh5 CDH5.1 0.12 YARN cdh5 CDH5.2 0.13 YARN cdh52 CDH5.3 0.13.1 YARN cdh52 CDH5.4 1.1 YARN cdh54 HDP 2.1.x 0.13.0 YARN hadoop_2_4_0_hive_0_13_0 HDP 2.2.x 0.14.0 YARN hadoop_2_6_0_hive_0_14_0 MapR 4.0.1 0.12.0 YARN mapr4 MapR 4.0.2 0.13.0 YARN mapr402 MapR 4.1.0 0.13.0 YARN mapr402 Pivotal Labs PivotalHD 3.0 0.14.0 YARN hadoop_2_6_0_hive_0_14_0 Amazon EMR (AMI 3.7.x) Hadoop 2.4.0 YARN hadoop_2_4_0_hive_0_13_0 Hortonworks MapR 0.13.1 Hadoop Resource Requirements Platfora must be able to connect to an existing Hadoop installation. Platfora also requires permissions and resources in the Hadoop source system. This section describes the Hadoop resource requirements for Platfora. Platfora uses the remote Distributed File System (DFS) of the Hadoop cluster for persistent storage and as the primary data source. Optionally, you can also configure Platfora to use a Hive metastore server as a data source. Page 17 Platfora Installation Guide - System Requirements (On-Premise) Platfora uses the Hadoop MapReduce services to process data and build lenses. For larger lens builds to succeed, Platfora requires minimum resources on the Hadoop cluster for MapReduce tasks. DFS Disk Space Platfora requires a designated persistent storage directory in the remote distributed file system (DFS) with appropriate free space for Platfora system files and data structures (lenses). The location is configurable. DFS Permissions The platfora system user needs read permissions to source data directories and files. The platfora system user needs write permissions to Platfora's persistent storage directory on DFS. MapReduce Permissions The platfora system user needs to be added to the submit-jobs and administer-jobs access control list (or added to a group that has these permissions). DFS Resources Minimum Open File Limit = 5000 MapReduce Resources Minimum Memory for Task Processes = 1 GB Browser Requirements Users can connect to the Platfora web application using the latest HTML5-compliant web browsers. Platfora supports the latest releases of the following web browsers: • Chrome (preferred browser) • Firefox • Safari • Internet Explorer with the Compatibility View feature disabled (versions prior to IE 10 are not supported) Platfora supports these web browsers on desktop machines only. Page 18 Chapter 3 Configure Hadoop for Platfora Access Before initializing and starting Platfora for the first time, you must make sure that Platfora can connect to Hadoop and access the directories and services it needs. The tasks in this section are performed in your Hadoop environment, and apply to on-premise Hadoop installations only (not to Amazon EMR). Topics: • Create Platfora User on Hadoop Nodes • Create Platfora Directories and Permissions in Hadoop • HDFS Tuning for Platfora • MapReduce Tuning for Platfora • YARN Tuning for Platfora Create Platfora User on Hadoop Nodes Platfora requires a platfora system user account on each node in your Hadoop cluster. The Platfora server uses this system user account to submit jobs to the Hadoop cluster and to access the necessary files and directories in the Hadoop distributed file system (HDFS). Creating a system user requires root or sudo permissions. 1. Create the platfora user: $ sudo useradd -s /bin/bash -m -d /home/platfora platfora 2. Set a password for the platfora user: $ sudo passwd platfora Create Platfora Directories and Permissions in Hadoop Platfora requires read and write permissions to a designated directory in the Hadoop file system where it can store its metadata and MapReduce output. Platfora connects to HDFS as the platfora user and also runs its MapReduce jobs as this same user. Page 19 Platfora Installation Guide - Configure Hadoop for Platfora Access Create a data directory for Platfora and set the platfora system user as its owner. In the example below, the Hadoop file system has a user called hdfs, the directory is called /platfora and the Platfora server is running as the platfora system user: $ sudo -u hdfs hadoop fs -mkdir /platfora $ sudo -u hdfs hadoop fs -chown platfora /platfora $ sudo -u hdfs hadoop fs -chmod 711 /platfora Note that for MapR, run the command as the mapr user: $ sudo -u mapr hadoop fs -mkdir /platfora $ sudo -u mapr hadoop fs -chown platfora /platfora $ sudo -u mapr hadoop fs -chmod 711 /platfora The platfora system user needs access to the location where MapReduce writes its staging files. Depending on your Hadoop distribution, the location of the staging area is different. In Cloudera, MapR, Pivotal, and Hortonworks, the default location is /user/username. In Apache, the location is / tmp/xxx/xxx/username. Make sure this location exists and is writable by the platfora system user. For example, on Cloudera: $ sudo -u hdfs hadoop fs -mkdir /user/platfora $ sudo -u hdfs hadoop fs -chown platfora /user/platfora For example, on MapR: $ sudo -u mapr hadoop fs -mkdir /user/platfora $ sudo -u mapr hadoop fs -chown platfora /user/platfora During lens build processing, the platfora system user needs to be able to write to the intermediate and log directories on the Hadoop nodes. Check the following Hadoop configuration properties and make sure the specified locations exist in HDFS and are writable by the platfora system user. Property Hadoop Configuration File Description mapreduce.cluster.local.dir mapred-site.xml Tells the MapReduce servers where to store intermediate files for a job. mapreduce.jobtracker.system.dir mapred-site.xml The directory where MapReduce stores control files. mapreduce.cluster.temp.dir A shared directory for temporary files. mapred-site.xml Page 20 Platfora Installation Guide - Configure Hadoop for Platfora Access Property Hadoop Configuration File Description mapr.centrallog.dir (MapR Only) mapred-site.xml The central job log directory for MapR Hadoop. The platfora system user also needs to be added to the submit-jobs and administer-jobs access control lists (or added to a group that has these permissions). The platfora system user also needs read permissions to the source data directories and files that you want to analyze in Platfora. HDFS Tuning for Platfora Platfora opens files on the Hadoop NameNode and DataNodes as it does its work to build the lens. This section describes how to ensure your Hadoop cluster has file limits that support lens build operations. Increase Open File Limits Platfora opens files on the Hadoop NameNode and DataNodes as it builds the lens. For multiple lens builds or for lenses that have a lot of fields selected, a lens build can cause your Hadoop nodes to exceed the maximum open file limit. When this limit is exceeded, Platfora lens builds will fail with a "Too many open files..." exception. Linux operating systems limit the number of open files and connections a process can have. This prevents one application from slowing down the entire system by requesting too many file handlers. When an application exceeds the limit, the operating system prevents the application from requesting more file handlers, causing the process to fail with a "Too many open files..." error. Verify your file limits are adequate on each Hadoop node. Increase the limits on your Hadoop nodes where the limts are too low. There are two places file limits are set in the Linux operating system: • A global limit for the entire system (set in /etc/sysctl.conf) • A per-user process limit (set in /etc/security/limits.conf) You can check the global limit by running the command: $ cat /proc/sys/fs/file-nr This should return a set of three numbers like this: 704 0 294180 The first number is the number of currently opened file descriptors. The second number is the number of allocated file descriptors. The third number is the maximum number of file descriptors for the whole system. The maximum should be at least 250000. Page 21 Platfora Installation Guide - Configure Hadoop for Platfora Access To increase the global limit, edit /etc/sysctl.conf (as root) and set the property: fs.file-max = 294180 Increase Platfora User Limits You can check the per-user process limit by running the command: $ ulimit -n This should return the file limit for the currently logged in user, for example: 1024 This limit should be at least 5000 for the platfora system user (or whatever user runs Platfora lens build jobs). To increase the limit, edit /etc/security/limits.conf (as root) and add the following lines (the * increases the limit for all system users): * * root root hard soft hard soft nofile nofile nofile nofile 65536 65536 65536 65536 You must reboot the server whenever you change OS kernel settings. Increase DataNode File Limits A Hadoop HDFS DataNode has an upper bound on the number of files that it can serve at any one time. In your Hadoop configuration, make sure the DataNodes are tuned to have an upper bound of at least 5000 by setting the following properties in the hdfs-site.xml file (located in the conf directory on your Hadoop NameNode): Framework hdfs-site.xml Property Minimum Value MapReduce v1 dfs.datanode.max.xcievers 5000 YARN dfs.datanode.max.transfer.threads 5000 Allow Platfora Local Access If the platfora system user is not able to make HDFS calls during lens build processing, lens build jobs in Platfora will stall at 0% progress. To prevent this, make sure your hdfs-site.xml files contain the dfs.block.local-path-acess.user parameter and that its value includes the platfora system user. For example: <property> <name>dfs.block.local-path-access.user</name> <value>gpadmin,hdfs,mapred,yarn,hbase,hive,platfora</value> </property> Page 22 Platfora Installation Guide - Configure Hadoop for Platfora Access MapReduce Tuning for Platfora It is pretty common in Hadoop to customize configuration file properties to suit a specific MapReduce workload. This section lists the mapred-site.xml properties that Platfora needs for its lens builds. Platfora can pass in certain properties at runtime for its lens build jobs. Other properties must be set on the Hadoop nodes themselves. Runtime properties can be set in the Platfora local copy of the mapred-site.xml file, and are then passed to Hadoop with the lens build job configuration. Non-runtime properties must be configured in your Hadoop environment directly. Consult your Hadoop vendor's documentation for recommended memory configuration settings for Hadoop task/container nodes. These settings depend on the node hardware specifications, and can vary for each environment. Required Properties for MapReduce v1 Hadoop Clusters These properties must be set in order for lens build jobs to succeed. You can set these in the local mapred-site.xml file on the Platfora master, and they will be passed to Hadoop at runtime. Property Recommended Value Default Value Runtime? mapred.child.java.opts At least -Xmx1024m Can be set higher based on the amount of memory on your Hadoop nodes and the number of simultaneous task slots available per node. -Xmx200m YES 0.70 YES mapred.job.shuffle.input.buffer.percent 0.30 Page 23 Platfora Installation Guide - Configure Hadoop for Platfora Access Required Properties for YARN Hadoop Clusters These properties must be set in order for lens build jobs to succeed. You can set the runtime properties in the local mapred-site.xml file on the Platfora master, and they will be passed to Hadoop at runtime. Non-runtime properties must be configured in your Hadoop environment directly. Property Recommended Value Default Value Runtime? mapreduce.map.java.opts At least -Xmx200m YES -Xmx1024m mapreduce.reduce.java.opts Can be set higher based on the amount of memory on your Hadoop nodes and the number of simultaneous task slots available per node. YES mapreduce.map.shuffle.input.buffer.percent 0.30 The percentage of total JVM heap size to allocate to storing map outputs during the shuffle phase. 0.70 YES mapreduce.reduce.shuffle.input.buffer.percent 0.30 The percentage of total JVM heap size to allocate to storing reduce outputs during the shuffle phase. 0.70 YES mapreduce.map.memory.mb The calculated RAM per container size for your hardware specifications. Platfora requires at least 1024. 512 NO mapreduce.reduce.memory.mbThe calculated RAM per container size for your hardware specifications. Platfora requires at least 1024. 512 NO mapreduce.framework.name yarn local Make sure this is set to yarn to prevent jobs from running in local mode. Page 24 NO Platfora Installation Guide - Configure Hadoop for Platfora Access Optional Sort Tuning Properties These properties increase the number of streams to merge at once when sorting files and set a higher memory limit for sort operations. If the sort phase can fit the data in memory, performance will be better than if it spills to disk. You may decide to increase these if you notice that records are spilling when you look at the lens build job details. However, setting this too high can result in job failures. If too much of the JVM is reserved for sorting, then not enough will be left for other task operations. The following optional mapred-site.xml properties apply to both MapReduce v1 and YARN Hadoop clusters. Property Recommended Value Default Value Runtime? io.sort.factor 100 10 YES io.sort.mb 25-30% of the *.java.opts values. For example, if the java.opts properties are set to 1024MB, this should be about 256MB. 100 YES io.sort.record.percent 0.15 0.05 YES YARN Tuning for Platfora This configuration is only required for Hadoop MapReduce v2 clusters with YARN. This section lists the yarn-site.xml properties that Platfora needs for its lens builds. Platfora can pass in certain properties at runtime for its lens build jobs. Other properties must be set on the Hadoop nodes themselves. Runtime properties can be set in the Platfora local copy of the yarn-site.xml file, and are then passed to Hadoop with the lens build job configuration. Non-runtime properties must be configured in your Hadoop environment directly. Consult your Hadoop vendor's documentation for recommended memory configuration settings for Hadoop task/container nodes. These settings depend on the Hadoop node's hardware specifications, and can vary for each environment. Page 25 Platfora Installation Guide - Configure Hadoop for Platfora Access Required Properties for YARN Hadoop Clusters Tuning these properties properly on your Hadoop nodes will optimize Platfora lens build jobs. Property Recommended Value The total memory size for yarn.nodemanager.resource.memoryall containers on a node (in mb Default Value Runtime? 8192 NO 1024 YES MB). Should be the total amount of RAM on the node, minus 15-20% for reserved system memory space. yarn.scheduler.minimumallocation-mb The minimum memory size per container. Depends on the amount of total memory on a node: • 512 MB (on nodes with 4-8 GB total RAM) • 1024 MB ( on nodes with 8-24 GB total RAM) • 2048 MB (on nodes with more than 24 GB total RAM) yarn.scheduler.maximumallocation-mb The maximum memory 8192 size per container. Same as yarn.nodemanager.resource.memorymb. YES Determine Maximum Reduce Tasks for Platfora In addition to these YARN settings in Hadoop, you will need to determine the maximum number of MapReduce reduce tasks allowed for a Platfora lens build job. This number is then configured in Platfora after you initialize the Platfora master by setting the Platfora server configuration property: platfora.reduce.tasks. The number of reducer tasks can be determined using the following formula: (yarn.nodemanager.resource.memory-mb / mapreduce.reduce.memory.mb) * number_of_hadoop_nodes Page 26 Chapter 4 Install Platfora Software and Dependencies This section describes how to provision a Platfora node with the required prerequisites and Platfora software. If you are installing a new Platfora cluster, the master node needs everything (prerequisites and Platfora software). Worker nodes only need the prerequisite software installed prior to initialization. Most of the tasks in this section require root permissions. The example commands in the documentation use sudo to denote the commands that require root permissions. Topics: • About the Platfora Installer Packages • Install Using RPM Packages • Install Using the TAR Package About the Platfora Installer Packages Platfora provides RPM or TAR installer packages that are specific to the Hadoop distribution you are using. Platfora Customer Support can provide you with the link to download the installer packages for your environment. Make sure to download the correct Platfora installer packages for your Hadoop distribution and version. See Supported Hadoop and Hive Versions if you are not sure which Platfora package to use for your chosen Hadoop distribution. RPM Packages If you plan to install Platfora on a Linux operating system that supports the RPM packager manager, such as RedHat or CentOS, Platfora recommends using the RPM packages to install Platfora and its required dependencies. The platfora-base RPM package includes all the prerequisite software that Platfora needs, plus automates the OS configurations needed by Platfora. This package should be installed on all Platfora nodes (master and workers). Page 27 Platfora Installation Guide - Install Platfora Software and Dependencies The platfora-server package includes the Platfora software only, which only needs to be installed on the master node. The Platfora software is copied to the worker nodes during initialization or upgrade, so you don't need to install it on the worker nodes ahead of time. TAR Package If you plan to install Platfora on a Linux operating system that does not support the RPM package manager, such as Ubuntu, you have to use the TAR package. You may also use the TAR package if you just want to install and manage the dependent software that is installed in your environment yourself. The TAR package contains the Platfora server software only, which only needs to be installed on the master node. The TAR package does not contain the prerequisite software that Platfora needs. You must manually install the required prerequisite software and do the required OS configurations on all Platfora nodes prior to installing and initializing Platfora. Install Using RPM Packages Follow the instructions in this section to install the Platfora dependencies and server software using the RPM packages. Install the platfora-base RPM package on all Platfora nodes, and the platforaserver RPM package on the master node only. Install Dependencies RPM Package The platfora-base RPM package contains all of the dependent software required by Platfora, and also automates several OS configuration tasks. This package is installed on all Platfora nodes. This task requires root permissions. Commands that begin with sudo denote root commands. The platfora-base RPM package does the following: • Creates a /usr/local/platfora/base directory containing Platfora's third-party dependencies. • Creates the platfora system user. The platfora user has no password set. • Generates an SSH key for the platfora system user and adds the key to the user's authorized_keys file. • Adds the platfora system user to the sudoers file. This allows you to execute commands as root while logged in as the platfora user. • Ensures the OS kernel parameters are appropriate for Platfora and sets them if they are not. • Creates a .bashrc file for the platfora system user. Page 28 Platfora Installation Guide - Install Platfora Software and Dependencies The platfora-base package uses the following file naming convention, where version-build is the version and build number of the base package only, and x86_64 is the supported system architecture. The base and Platfora server packages use different versioning schemes. platfora-base-version-build-x86_64.rpm The base package is not updated every Platfora release. It is only updated when the Platfora dependencies change, which is not as often. When upgrading Platfora, check the release notes to see if upgrade of the base package is required. 1. Log on to the machine on which you are installing Platfora. 2. Using the download link provided by Platfora Customer Support, download the base package. For example: $ wget http://downloads.platfora.com/release/platforabase-version-build-x86_64.rpm 3. Install the package using the yum package manager (requires root permission). For example: $ sudo yum --nogpgcheck localinstall platfora-base-version-buildx86_64.rpm Confirm that the /usr/local/platfora/base directory was created. $ sudo ls -a /usr/local/platfora/base Install Optional Security RPM Package The platfora-security RPM package contains SSL-enabled PostgreSQL and the OpenSSL package it depends on. This package is only needed if you plan to enable SSL communications between the Platfora worker nodes and the Platfora metadata catalog database. This task requires root permissions. Commands that begin with sudo denote root commands. The platfora-security package is installed after the platfora-base package. The platfora-security RPM package does the following: • Creates a /usr/local/platfora/security directory containing the SSL-enabled version of PostgreSQL. • Checks if OpenSSL version 1.0.1 or later is installed, and if not downloads and installs the openssl package dependency from the OpenSSL public repo. • Edits the .bashrc file for the platfora system user and changes the PATH environment variable so that secure PostgreSQL is listed before the default PostgreSQL installed by the platfora-base package. The platfora-security package uses the following file naming convention, where version-build is the version and build number of the base package only, and x86_64 is the Page 29 Platfora Installation Guide - Install Platfora Software and Dependencies supported system architecture. The base, security and Platfora server packages use different versioning schemes. platfora-security-version-build-x86_64.rpm The security package only needs to be upgraded when the base package is upgraded, which is not every release. When upgrading Platfora, check the release notes to see if upgrade of the base and security packages is required. 1. Log on to the machine on which you are installing Platfora. 2. Using the download link provided by Platfora Customer Support, download the security package. For example: $ wget http://downloads.platfora.com/release/platforasecurity-version-build-x86_64.rpm 3. Install the package using the yum package manager (requires root permission). For example: $ sudo yum --nogpgcheck localinstall platfora-security-version-buildx86_64.rpm Confirm that the /usr/local/platfora/security directory was created. $ sudo ls -a /usr/local/platfora/security Install Platfora RPM Package (Master Only) The platfora-server RPM package contains the Platfora server software. This package is installed on the Platfora master node only. The platfora-server RPM package creates a /user/local/platfora/platfora-server directory containing the Platfora software. The platfora-server package uses the following file naming convention, where hadoop_distro corresponds to the Hadoop distribution you are using, version-build is the version and build number of the Platfora software, and x86_64 is the supported system architecture. platfora-server-hadoop_distro-version-build-x86_64.rpm Make sure to download the correct Platfora installer packages for your Hadoop distribution and version. See Supported Hadoop and Hive Versions if you are not sure which Platfora package to use for your chosen Hadoop distribution. This task requires root permissions. Commands that begin with sudo denote root commands. 1. Log on to the machine on which you are installing the Platfora master. 2. Using the download link provided by Platfora Customer Support, download the Platfora server package. For example: $ wget http://downloads.platfora.com/release/platforaserver-hadoop_distro-version-build-x86_64.rpm 3. Install the package using the yum package manager (requires root permission). For example: $ sudo yum --nogpgcheck localinstall platforaserver-hadoop_distro-version-build-x86_64.rpm Page 30 Platfora Installation Guide - Install Platfora Software and Dependencies Confirm that the /usr/local/platfora/platfora-server directory was created. $ sudo ls -a /usr/local/platfora/platfora-server Install Using the TAR Package Follow the instructions in this section to install the Platfora dependencies and server software using the TAR packages. The TAR package contains the Platfora server software only. You must install all dependencies yourself. For the Platfora master node, do all the tasks described in this section. For a Platfora worker node, do all the tasks described in this section except for: • Install PostgreSQL • Install Platfora TAR Package • Install PDF Dependencies Create the Platfora System User Platfora requires a platfora system user account to own the Platfora installation and run the Platfora server processes. This same system user must be created on all Platfora nodes. This task requires root permissions. Commands that begin with sudo denote root commands. (MapR Only) If you are using MapR as your Hadoop distribution with Platfora, make sure to follow the additional steps for MapR. The platfora system user must exist on all Platfora nodes and all MapR nodes. The UID/GID must also be the same on the MapR nodes as on Platfora nodes. 1. Create the platfora system user: $ sudo useradd -s /bin/bash -m -d /home/platfora platfora 2. Set a password for the platfora user: $ sudo passwd platfora 3. (MapR Only) Check the /etc/passwd file on your MapR CLDB node, and find the entry for the platfora user. Note the user and group id numbers that are used. For example: platfora:x:1002:1002::/home/platfora:/bin/bash 4. (MapR Only) Check the /etc/passwd file on your Platfora master node. If the user and group id numbers for the platfora user are different, update them so that they are the same as on the MapR nodes. For example: $ sudo usermod -u 1002 platfora $ sudo groupmod -g 1002 platfora Page 31 Platfora Installation Guide - Install Platfora Software and Dependencies Configure sudo for the platfora User This is an optional task. Configuring sudo access for the platfora system user is a convenient way to run commands as root while logged in as the platfora user. If you do not configure sudo access for the platfora user, then you must change to the root user to execute the system commands that require root permissions. This documentation assumes that you have sudo access configured. If you do not, every time you see sudo at the beginning of a command, it means you need to be root to run the command. 1. Edit the /etc/sudoers file using the visudo command. $ sudo visudo 2. Add a line such as the following in this file: # User privilege specification platfora ALL=(ALL:ALL) ALL 3. Save your changes and exit the visudo editor. Generate and Authorize an SSH Key Generating and authorizing an SSH key for the platfora system user on the localhost is required by the Platfora management utilities. This task should be performed on all Platfora nodes. The Platfora management utilities require a trusted-host environment (the ability to SSH to a remote system in the Platfora cluster without a password prompt). Even in single-node installations, you must exchange SSH keys for the localhost. 1. Make sure that Selinux is disabled using either the sestatus or getenforce command. $ sestatus If Selinux is enabled, disable it using the recommended procedure for the node's operating system. 2. Make sure you are logged in to the Platfora server as the platfora system user. $ su - platfora 3. Go to the ~/.ssh directory (create it if it does not exist): $ mkdir .ssh $ cd .ssh 4. Generate a public/private key pair that is NOT passphrase-protected. Press the ENTER or RETURN key for each prompt: $ ssh-keygen -C 'platfora key for node 0' -t rsa Enter file in which to save the key (/home/platfora/.ssh/ id_rsa): ENTER Enter passphrase (empty for no passphrase): ENTER Enter same passphrase again: ENTER Page 32 Platfora Installation Guide - Install Platfora Software and Dependencies 5. Append the public key to the ~/.ssh/authorized_keys file (this allows SSH access from the current host to itself): $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys 6. Make sure the home directory, .ssh directory, and the files it contains have the correct permissions: $ chmod 700 $HOME && chmod 700 ~/.ssh && chmod 600 ~/.ssh/* 7. Test that you can SSH to localhost without a password prompt. If prompted to add localhost to the list of known hosts, enter yes : $ ssh localhost The authenticity of host 'localhost (127.0.0.1)' can't be established... Are you sure you want to continue connecting (yes/no)? yes Set OS Kernel Parameters This section has the Linux OS kernel settings required for Platfora. You must have root or sudo permissions to change kernel parameter settings. Changing kernel settings requires a system reboot in order for the changes to take effect. Kernel ulimit Setting Linux operating systems set limits on the number of open files and connections a process can have. For some applications, such as Platfora and Hadoop, having a lot of open file handlers during processing is normal. Having the limit set too low can cause Platfora lens builds to fail. There are two places file limits are set in the Linux operating system: • A global limit for the entire system (set in /etc/sysctl.conf) • A per-user process limit (set in /etc/security/limits.conf) You must have root or sudo permissions to change OS ulimit settings. You can check the global limit by running the command: $ cat /proc/sys/fs/file-nr This should return a set of three numbers like this: 704 0 294180 The first number is the number of currently opened file descriptors. The second number is the number of allocated file descriptors. The third number is the maximum number of file descriptors for the whole system. This limit should be at least 250000. To increase the global limit, edit /etc/sysctl.conf (as root) and set the property: fs.file-max = 294180 You can check the per-user process limit by running the command: $ ulimit -n Page 33 Platfora Installation Guide - Install Platfora Software and Dependencies This should return the file limit for the currently logged in user, for example: 1024 This limit should be at least 20000 for the platfora user (or whatever user runs the Platfora server). To increase the limit, edit /etc/security/limits.conf (as root) and the following lines (the * increases the limit for all system users): * * root root hard soft hard soft nofile nofile nofile nofile 65536 65536 65536 65536 Reboot the server for the changes to take effect. $ sudo reboot Kernel Memory Overcommit Setting Linux operating systems allow memory to be overcommitted, meaning the OS will allow an application to reserve more memory than actually exists within the system. Allowing overcommit prevents the OS from killing processes when a process requests more memory than is available. If you are using a version 1.6 Java Runtime Environment (JRE), you must configure your OS to allow memory overcommit. If you are using a version 1.7 JRE, overcommit is not necessary. You must have root or sudo permissions to change kernel memory overcommit settings. 1. Check your version of Java. $ java -version If you are running a 1.6 version, proceed to the next steps. If you are running a 1.7 version, you do not need to make any further changes. 2. Edit the /etc/systcl.conf file. $ sudo vi /etc/systcl.conf 3. Set the following value: vm.overcommit_memory=1 4. Save and close the file. 5. Reboot your system for the change to take effect: $ sudo reboot Kernel Shared Memory Settings Some default OS installations have the system shared memory values set too low for Platfora. You may need to increase the shared memory settings if they are set too low. You must have root or sudo permissions to set the system shared memory parameters. 1. In /etc/sysctl.conf, make sure the shared memory parameters have the minimum values or higher. Page 34 Platfora Installation Guide - Install Platfora Software and Dependencies If your settings are lower than these minimum values, you will need to change them. If they are higher than the minimum, leave them as is. kernel.shmmax=17179869184 kernel.shmall=4194304 2. If you made changes to /etc/sysctl.conf, reboot the server for the changes to take effect. $ sudo reboot Install Dependent Software If using the TAR installation package to install Platfora, you must install all of the dependencies yourself. This section provides instructions for manually installing the prerequisite software on a Platfora node. If you are provisioning a Platfora master node, you must install all dependencies. If you are provisioning a Platfora worker node, you can skip the task for installing PostgreSQL. PostgreSQL is only needed on the Platfora master node. Confirm Linux OS Utilities Platfora requires several standard Linux utilities to be installed on your system and in your environment PATH. Check your system for the required utilites before installing Platfora. Most Linux operating systems already have these utilities installed by default. • rsync • ssh • scp • tail • tar • cp • wget • ntp • sysctl (/usr/sbin must be in your PATH) To verify that a utility is installed and can be found in the PATH, you can check its location using the which command. For example: $ which rsync $ which tar $ which sysctl If a utility is not installed, you will need to install it before installing Platfora. Check your OS documentation for instructions on installing these utilities. Page 35 Platfora Installation Guide - Install Platfora Software and Dependencies Install Java The Platfora server requires a Java Runtime Environment (JRE) version 1.7 or higher. Platfora recommends installing the full Java Development Kit (JDK) for access to the latest Java features and diagnostic tools. The instructions in this section are for installing version 1.7 of the Open Java Development Kit (OpenJDK). You must have root or sudo permissions to install Java. 1. Check if Java 1.7 or higher is already installed. $ java -version If java is not found, you will need to install it. 2. Install OpenJDK using your OS package manager. On Ubuntu Systems: $ sudo apt-get install openjdk-7-jdk On RedHat/CentOS Systems: $ su -c "yum install java-1.7.0-openjdk" 3. Set the JAVA_HOME environment variable in the platfora user’s profile file. For example, where java_directory is the versioned directory where Java is installed: $ echo "export JAVA_HOME=/usr/lib/jvm/java_directory/jre" >> /home/ platfora/.bashrc $ echo "export PATH=$JAVA_HOME/bin:$PATH" >> /home/platfora/.bashrc $ source /home/platfora/.bashrc 4. Make sure JAVA_HOME is set correctly for the platfora user: $ su - platfora $ echo $JAVA_HOME Confirm Python Installation The Platfora management utilities require Python version 2.6.8, 2.7.1, or 2.7.3 through 2.7.6. Python version 3.0 is not supported. Most Linux operating systems already have Python installed by default, but you need to make sure the version is compatible with Platfora. To check if the correct version of Python is installed: $ python -V If Python is not installed (or you have an incompatible version of Python) you will need to install or upgrade/downgrade it before installing Platfora. Check your OS documentation for instructions on installing or upgrading/downgrading Python to version 2.6.8 or higher 2.x version. Page 36 Platfora Installation Guide - Install Platfora Software and Dependencies Install PostgreSQL (Master Only) Platfora stores its metadata catalog in a PostgreSQL relational database. PostgreSQL version 9.2 or 9.3 must be installed (but not running) on the Platfora master server before you start Platfora for the first time. Platfora worker nodes do not require a PostgreSQL installation. You must have root or sudo permissions to install PostgreSQL. Install PostgreSQL 9.2 on Ubuntu Systems These instructions are for installing PostgreSQL 9.2 on Linux Ubuntu operating systems. 1. Install the dependent libraries: $ sudo apt-get install libpq-dev 2. Add the PostgreSQL repository to your system configuration: $ sudo add-apt-repository ppa:pitti/postgresql $ sudo apt-get update 3. Install PostgreSQL 9.2: $ sudo apt-get install postgresql-9.2 4. Stop the PostgreSQL service. $ sudo service postgresql stop 5. Remove the PostgreSQL automatic start-up scripts: $ sudo rm /etc/rc*/*postgresql 6. Create and change the ownership on the directory where PostgreSQL writes its lock files: $ sudo mkdir /var/run/postgresql $ sudo chown platfora /var/run/postgresql 7. Update the platfora user’s PATH environment variable to include the PostgreSQL executable directory and /usr/sbin: $ echo "export PATH=/usr/lib/postgresql/9.2/bin:/usr/sbin:$PATH" >> / home/platfora/.bashrc $ source /home/platfora/.bashrc Install PostgreSQL 9.2 on RedHat/CentOS Systems These instructions are for installing PostgreSQL 9.2 on RedHat Enterprise Linux (RHEL) or CentOS operating systems. 1. Download the appropriate PostgreSQL 9.2 YUM repository for your operating system. Go to the PostgreSQL yum repository website, copy the URL link for the appropriate YUM repository configuration, and download it using wget. For example, to download the YUM repository configuration for PostgreSQL 9.2 on a 64-bit RHEL 6 operating system. $ wget http://yum.pgrpms.org/9.2/redhat/rhel-6-x86_64/pgdgredhat92-9.2-7.noarch.rpm 2. Add the PostgreSQL YUM repository to your system configuration: $ sudo rpm -i pgdg-redhat92-9.2-7.noarch.rpm Page 37 Platfora Installation Guide - Install Platfora Software and Dependencies 3. Install PostgreSQL: $ sudo yum install postgresql92 postgresql92-server 4. If it is enabled, disable the PostgreSQL automatic start-up. Each operating system has its own technique for auto starting PostgreSQL. If your system uses chkconfig to manage init scripts, you can remove PostgreSQL from the chkconfig control using the following command: chkconfig --del postgresql For some operating systems, the PostgreSQL start.conf file configures the auto-start of a specific PostgreSQL cluster. 5. Create and change the ownership on the directory where PostgreSQL writes its lock files: $ sudo mkdir /var/run/postgresql $ sudo chown platfora /var/run/postgresql 6. Update platfora user’s PATH environment variable to include the PostgreSQL executable directory and /usr/sbin: $ echo "export PATH=/usr/pgsql-9.2/bin:/usr/sbin:$PATH" >> /home/ platfora/.bashrc $ source /home/platfora/.bashrc Confirm OpenSSL Installation Platfora uses OpenSSL for secure communications between the Platfora worker servers and its metadata catalog database. If you decide to enable SSL for the Platfora catalog, which is optional, you will need OpenSSL version 1.0.1 or higher on your Platfora nodes. As an optional security feature, you can choose to enable SSL communications between the Platfora metadata catalog and the Platfora worker nodes. If you decide to enable this, you will need to have: • SSL-enabled PostgreSQL. If using the RPM installation packages, Platfora provides an optional platfora-security package that contains SSL-enabled PostgreSQL. If using the TAR installation packages, the packages provided in the PostgreSQL public repo come with SSL enabled. • OpenSSL. If using the RPM installation packages, Platfora provides an optional platforasecurity RPM package that pulls this dependency from the public repo. If using the TAR installation packages, you will have to install the openssl package yourself. Many Linux operating systems already have OpenSSL installed by default, but you need to make sure the version is compatible with the version that PostgreSQL uses. 1. Check that OpenSSL version 1.0.1 or higher is installed. $ openssl version 2. If OpenSSL is not installed (or you have an incompatible version) you will need to install or upgrade it before enabling SSL for the Platfora catalog. Check your OS documentation for instructions on installing or upgrading the openssl package. Page 38 Platfora Installation Guide - Install Platfora Software and Dependencies Install Platfora TAR Package (Master Only) The TAR installation package contains the Platfora server software only. You only need to install this package on the Platfora master node. You can skip this task if you are provisioning a Platfora worker node. The platfora tar package uses the following file naming convention, where version-build.no is the version and build number of the Platfora software and hadoop_distro corresponds to the Hadoop distribution you are using. platfora-version-build.num-hadoop_distro.tgz Make sure to download the correct Platfora installer package for your Hadoop distribution and version. See Supported Hadoop and Hive Versions if you are not sure which Platfora package to use for your chosen Hadoop distribution. This task requires root permissions. Commands that begin with sudo denote root commands. 1. Log on to the machine on which you are installing the Platfora master. 2. Create a Platfora installation directory and ensure that it is owned by the platfora system user. For example: $ sudo mkdir /usr/local/platfora $ sudo chown platfora /usr/local/platfora -R 3. Log in as the platfora user and go to the installation directory that you just created: $ su - platfora $ cd /usr/local/platfora 4. Download the 4.5.0 release package and checksum file using the URLs provided by Platfora Customer Support. Make sure to download the correct packages for your Hadoop distribution version. For example: $ wget http://downloads.platfora.com/release/platfora-versionbuild.num-hadoop_distro.tgz $ wget http://downloads.platfora.com/release/platfora-versionbuild.num-hadoop_distro.tgz.sha 5. After downloading the package and checksum file, make sure the package is valid using the shasum command. For example: $ shasum -c platfora-version-build.num-hadoop_distro.tgz.sha If the package is valid, you should see a message such as: platfora-version-build.num-hadoop_distro.tgz: OK 6. Unpack the package within the installation directory. For example: $ tar -zxvf platfora-version-build.num-hadoop_distro.tgz 7. Create a symbolic link named platfora-server that points to the actual installation directory. Page 39 Platfora Installation Guide - Install Platfora Software and Dependencies For example: $ ln -s platfora-version-build.num-hadoop_distro platfora-server 8. Set the PLATFORA_HOME environment variable for the platfora system user. $ echo "export PLATFORA_HOME=/usr/local/platfora/platfora-server" >> $HOME/.bashrc 9. Set the PATH environment variable for the platfora system user. The PATH should include /usr/sbin, $PLATFORA_HOME/bin, and the PostgreSQL executable directories. If your system has more than one version of PostgreSQL installed, make sure that 9.2 is listed first in the PATH of the platfora user. For example (Ubuntu): $ echo "export PATH=/usr/lib/postgresql/9.2/bin:/usr/sbin: $PLATFORA_HOME/bin:$PATH" >> $HOME/.bashrc $ source $HOME/.bashrc For example (RedHat/CentOS): $ echo "export PATH=/usr/pgsql-9.2/bin:/usr/sbin:$PLATFORA_HOME/bin: $PATH" >> $HOME/.bashrc $ source $HOME/.bashrc 10.Make sure the JAVA_HOME environment variable is set (if it's not, see Install Java). $ echo $JAVA_HOME Install PDF Dependencies (Master Only) One feature of Platfora is the ability to save a vizboard as a PDF document. In order for the Platfora server to render PDFs, it needs PhantomJS and the OpenSans font to be installed on the Platfora master node. You can skip this task if you are provisioning a Platfora worker node. The PhantomJS installation relies on several fonts that ship with the Platfora software. For this reason, the PhantomJS installation must be done after installing the Platfora software. To install PhantomJS, do the following: 1. Log into the Platfora master node as the platfora user. 2. Install the PhantomJS dependencies. On Ubuntu $ $ $ $ sudo sudo sudo sudo apt-get apt-get apt-get apt-get On Redhat/CentOS install install install install fontconfig libfreetype6 libfontconfig1 libstdc++6 Page 40 $ $ $ $ $ sudo sudo sudo sudo sudo yum yum yum yum yum install install install install install fontconfig freetype libfreetype.so.6 libfontconfig.so.1 libstdc++.so.6 Platfora Installation Guide - Install Platfora Software and Dependencies 3. Download the compiled PhantomJS executable. $ sudo wget https://bitbucket.org/ariya/phantomjs/downloads/ phantomjs-1.9.7-linux-x86_64.tar.bz2 4. Extract the files. $ sudo tar xjf phantomjs-1.9.7-linux-x86_64.tar.bz2 5. Copy the PhantomJS binary to an accessible bin directory. You should choose a bin directory that is common to most user environments. $ sudo cp phantomjs-1.9.7-linux-x86_64/bin/phantomjs /usr/local/bin 6. Verify the phantomjs command is accessible to the platfora user. $ which phantomjs /usr/local/bin/phantomjs If the command is not found, add the bin directory to the platfora user's environment: $ echo "export PATH=/usr/local/bin:/usr/sbin:$PATH" >> /home/ platfora/.bashrc $ source /home/platfora/.bashrc 7. Install the OpenSans font for use by the PDF feature. a) Make a directory to contain the typeface. $ sudo mkdir -p /usr/share/fonts/truetype b) Copy the font to the truetype directory. $ sudo cp -r $PLATFORA_HOME/server/webapps/proton/dist/fonts/ OpenSans /usr/share/fonts/truetype c) Refresh the font cache. $ sudo fc-cache -f After installing, you'll want to verify the installation is running correctly. One easy way to do this is using examples that came with the PhantomJS tarball: $ phantomjs phantomjs-1.9.7-linux-x86_64/examples/hello.js Hello, world! You can also output a PDF to verify the fonts were installed correctly. to output to PDF choose Share > Prepare PDF for Download from an open vizboard. In the example PDF output below, the left Page 41 Platfora Installation Guide - Install Platfora Software and Dependencies side shows the output when the fonts are installed. The right side was rendered without the proper fonts installed: Page 42 Chapter 5 Configure Environment on Platfora Nodes This section describes how to configure a Platfora node's operating system and network environment. You should perform these tasks on every node in the Platfora cluster (master and workers) after you have installed the Platfora dependencies and software, but before you initialize Platfora (or initialize a new worker node). Topics: • Install the MapR Client Software (MapR Only) • Configure Network Environment • Configure Passwordless SSH • Synchronize the System Clocks • Create Local Storage Directories • Verify Environment Variables Install the MapR Client Software (MapR Only) If you are using MapR as your Hadoop distribution, you must install the MapR client software on all Platfora nodes (master and workers). If you are not using MapR with Platfora, you can skip this task. Platfora uses the MapR client to submit MapReduce jobs and file system commands directly to the MapR cluster. For more information about the MapR client, see the MapR documentation. If you use MapR 4.1, Platfora requires that you install the MapR 4.0.2 client software. You must have root or sudo permissions to install the MapR client. Installing the MapR Client on Ubuntu 1. Add the following line to the /etc/apt/sources.list file: deb http://package.mapr.com/releases/version/ubuntu/ mapr optional Platfora supports MapR client versions: v3.0.x, v3.1.1, v4.0.x. Page 43 Platfora Installation Guide - Configure Environment on Platfora Nodes 2. Update the repository and install the MapR client: $ sudo apt-get update $ sudo apt-get install mapr-client 3. Configure the MapR client where clusterName is the name of your MapR cluster and cldbhost is the hostname and port of the MapR CLDB node: $ sudo /opt/mapr/server/configure.sh –N clusterName -c C cldbhost:7222 4. Check if the /opt/mapr/hostname file exists on the node. $ sudo ls /opt/mapr If the file doesn't exist, create it: $ sudo hostname -f > /opt/mapr/hostname 5. Set the PLATFORA_HADOOP_LIB environment variable. For example (check the path for your version of the MapR client): $ echo "export PLATFORA_HADOOP_LIB=/opt/mapr/hadoop/lib" >> $HOME/.bashrc Installing the MapR Client on RedHat/CentOS 1. Create the file /etc/yum.repos.d/maprtech.repo with the following contents: [maprtech] name=MapR Technologies baseurl=http://package.mapr.com/releases/version/redhat/ enabled=1 gpgcheck=0 protect=1 Platfora supports MapR client versions: v3.0.x, v3.1.1, v4.0.x. 2. Install the MapR client. For example, on 64-bit operating systems: $ sudo yum install mapr-client.x86_64 3. Configure the MapR client where clusterName is the name of your MapR cluster and cldbhost:port is the hostname and port of the MapR CLDB node: $ sudo /opt/mapr/server/configure.sh –N clusterName -c C cldbhost:port 4. Check if the /opt/mapr/hostname file exists on the node. $ sudo ls /opt/mapr If the file doesn't exist, create it: $ sudo hostname -f | sudo tee /opt/mapr/hostname 5. Set the PLATFORA_HADOOP_LIB environment variable. For example (check the path for your version of the MapR client): $ echo "export PLATFORA_HADOOP_LIB=/opt/mapr/hadoop/lib" >> $HOME/.bashrc Page 44 Platfora Installation Guide - Configure Environment on Platfora Nodes Configure Network Environment A Platfora node needs to be able to connect to other Platfora nodes over the network, and to the Hadoop services it needs. This section describes how to check the network connections between nodes, and make sure the required ports are open to connections from a Platfora node. Configure /etc/hosts File The /etc/hosts file is a system file that identifies the hostnames and IP addresses of other machines in the network so that they can find each other. Platfora uses the /etc/hosts system file to find other nodes over the network. This means that each node in a Platfora cluster must have the same entries. When you add, change, or remove a node, you should update the /etc/hosts file on all Platfora nodes. For on-premise Hadoop installations, you will also need to specify the address of your Hadoop NameNode. A typical /etc/hosts file on a Platfora node might look something like this: # Platfora IP 127.0.0.1 10.202.123.45 10.202.123.46 10.202.123.47 10.202.123.48 Hostname localhost ip-10-202-123-45 ip-10-202-123-46 ip-10-202-123-47 ip-10-202-123-48 Alias platfora-master platfora-worker-1 platfora-worker-2 platfora-worker-3 # Hadoop IP Hostname Alias 10.202.123.55 ip-10-202-123-55 hadoop-namenode Platfora relies on the IP address associated with a node's network interface. Host File Configuration on Amazon EC2 Instances If you are installing your Platfora nodes on Amazon EC2 instances, the entries in the /etc/hosts file should use the Amazon internal IP addresses and hostnames. If you are using standard EC2 instances, the internal IP address is associated with the network interface of the instance. When you stop or restart a standard EC2 instance, its internal IP address and hostname changes. This means that whenever you stop and restart an instance, you'll need to update the /etc/ hosts files to reflect the new internal IP addresses and hostnames that are assigned to the instance. Platfora recommends using virtual private cloud (VPC) EC2 instances to run your Platfora nodes. EC2VPC instances maintain their assigned internal IP address and hostname through restarts. Amazon Elastic MapReduce (EMR) Hadoop instances are launched on EC2-VPC instances by default. You do not need to put Hadoop node entries in your Platfora node /etc/hosts files if you are using EMR as your Hadoop distribution. You only need Hadoop entries if you are running your own managed Hadoop cluster on EC2. Page 45 Platfora Installation Guide - Configure Environment on Platfora Nodes Verify Connectivity Between Platfora Nodes The Platfora master and worker nodes must be able to accept incoming network connections from each other on the ports designated for Platfora intra-node communications. This sections explains how you can test network connectivity between Platfora nodes and verify that the required ports are open. In a multi-node Platfora cluster, all of the nodes must be able to connect to each other over the network. Platfora services use certain ports for intra-node communications. Before you initialize Platfora, you should decide what ports to use for these services, and make sure that they are open to connections from other Platfora nodes. The following table shows the default Platfora ports: Platfora Service Default Port Allow connections from… Master Web Services Port (HTTP) 8001 External user network Platfora worker servers localhost Secure Master Web Services Port (HTTPS) 8443 External user network Platfora worker servers localhost Master Server Management Port 8002 Platfora worker servers localhost Worker Server Management Port 8002 Platfora master server other Platfora worker servers localhost Master Data Port 8003 Platfora worker servers localhost Worker Data Port 8003 Platfora master server other Platfora worker servers localhost Master PostgreSQL Database Port 5432 Platfora worker servers localhost One way to verify that these ports are open to connections from another Platfora node is to use the telnet command. For example, to test if port 8002 was open on a remote node with the IP address 10.10.10.9, you could run the following command to test the connection: $ telnet 10.10.10.9 8002 Page 46 Platfora Installation Guide - Configure Environment on Platfora Nodes If a connection is not allowed, you will need to configure the firewall on your Platfora nodes to open the appropriate ports and allow incoming connections from the other Platfora nodes. On Amazon EC2 instances, you may need to configure the port firewall rules on the Platfora server instances in addition to the EC2 Security Group Settings. Verify Connectivity to Hadoop Nodes The Platfora master and worker nodes must be able to connect to certain Hadoop services. This topic explains how you can test network connectivity between Platfora nodes and an on-premise Hadoop installation to verify that the required ports are open. The following table shows the default Hadoop service ports that Platfora needs open: Hadoop Service Default Ports by Hadoop Allow connections from… Distro CDH, HDP, Pivotal Apache MapR Hadoop HDFS NameNode 8020 9000 N/A Platfora master and worker servers HDFS DataNodes 50010 50010 N/A Platfora master and worker servers MapRFS CLDB N/A N/A 7222 Platfora master and worker servers MapRFS DataNodes N/A N/A 5660 Platfora master and worker servers MRv1 JobTracker 8021 9001 9001 Platfora master server MRv1 JobTracker Web UI 50030 50030 50030 External user network (optional) YARN ResourceManager 8032 8032 8032 Platfora master server YARN ResourceManager Web UI 8088 8088 8088 External user network (optional) YARN Job History Server 10020 10020 10020 Platfora master server YARN Job History Server Web UI 19888 19888 19888 External user network (optional) Page 47 Platfora Installation Guide - Configure Environment on Platfora Nodes Hadoop Service Default Ports by Hadoop Allow connections from… Distro CDH, HDP, Pivotal Apache MapR Hadoop HiveServer Thrift Port 10000 10000 10000 Platfora master server Hive Metastore DB 4 Port 9083 9933 (HDP2) N/A 9083 Platfora master server To determine the interfaces and ports your particular Hadoop cluster is using for its file system and data processing services, look at the core-site.xml and mapred-site.xml or yarn-site.xml configuration files on your Hadoop NameNode (typically located in Hadoop's conf directory). One way to verify that these ports are open to connections from a Platfora node is to use the telnet command. For example, to test if port 8020 was open on the Hadoop NameNode with the IP address 10.10.10.9, you could run the following command to test the connection: $ telnet 10.10.10.9 8020 If a connection is not allowed, you will need to configure the firewall on your Hadoop nodes to open the appropriate ports and allow incoming connections from the Platfora nodes. Also, make sure your Hadoop services are actually up and running. Note for Amazon Users: If you are using Amazon Elastic Map Reduce as your Hadoop cluster, the EC2 Security Group Settings are sufficient to allow connectivity between Platfora instances on EC2 and the EMR instances. If you are running your own Hadoop cluster on designated Amazon EC2 instances, you may need to configure the port firewall rules on the Hadoop server instances in addition to the EC2 Security Group Settings. Open Firewall Ports If using firewall software in your network, you must open the required ports in the firewall software to allow incoming connections from the other servers in your Platfora and Hadoop clusters. On Amazon EC2 clusters, this is in addition to configuring your EC2 security group settings. For a list of the default Platfora and Hadoop ports, see Port Configuration Requirements. The process to open a firewall port depends on your server's operating system. For RedHat / CentOS Servers: 4 If connecting to Hive directly using JDBC Page 48 Platfora Installation Guide - Configure Environment on Platfora Nodes Add a line to the /etc/sysconfig/iptables file to open the required port. For example (for port 8001): -A INPUT -m state --state NEW -m tcp -p tcp --dport 8001 -j ACCEPT Restart the firewall for your changes to take effect. For example: $ sudo /etc/init.d/iptables restart For Ubuntu Servers: Open the required port in the firewall. For example (for port 8001): $ sudo ufw allow 8001 Configure Passwordless SSH The Platfora management utilities require a trusted-host environment (the ability to SSH to a remote system in the Platfora cluster without a password prompt). Even in single-node installations, you must exchange SSH keys for the localhost. Verify Local SSH Access This task confirms that local SSH access was set up correctly during installation. If it wasn't, then you'll have to configure it before initializing Platfora. If you installed Platfora using the RPM packages, the package installer creates the platfora user, generates an SSH key, and authorizes it for the localhost. If you installed using the TAR package, you should have done these steps manually as part of installing the dependencies. Test that you can SSH to localhost without a password prompt. $ su - platfora $ ssh localhost If prompted to add localhost to the list of known hosts, enter yes: The authenticity of host 'localhost (127.0.0.1)' can't be established... Are you sure you want to continue connecting (yes/no)? yes If you get a password prompt, see Generate and Authorize an SSH Key. Exchange SSH Keys (Multi-Node Only) In multi-node installations of Platfora, each Platfora node must have the public SSH key for itself and all other nodes in the Platfora cluster in its list of authorized keys. This task applies only when adding a worker node to a Platfora cluster. You must exchange SSH keys between all Platfora nodes as the platfora user (master and all worker nodes). This procedure should be executed from each new worker node that you add to the Platfora cluster. Page 49 Platfora Installation Guide - Configure Environment on Platfora Nodes Before you can exchange an SSH key, you have to generate and authorize it. If you installed Platfora using the RPM packages, the installer should have done this for you automatically. See Verify Local SSH Access to confirm this was set up correctly. If you installed using the TAR package, you should have done this prior to installing the Platfora software. See Generate and Authorize an SSH Key. 1. Make sure you are logged in to the Platfora worker node as the platfora system user. $ su - platfora 2. Copy the public key of the current worker node to the other Platfora nodes in the cluster (master and other worker nodes). If you have password authentication enabled between the Platfora hosts, you can add the public key to each of the remote hosts as follows: $ ssh-copy-id platfora@master_hostname $ ssh-copy-id platfora@worker1_hostname $ ssh-copy-id platfora@worker2_hostname If password authentication is not enabled between hosts (such as on Amazon EC2 instances), login to each remote server in a separate terminal session and copy/paste the public key of the current worker host into each remote server's authorized_keys file. 3. Copy the public keys from all other Platfora nodes to the current worker node (master and other worker nodes). One way to do this is to copy the entire contents of the master’s authorized_keys file (which should have the keys of all nodes in the cluster) into the current node’s authorized_keys file. If you have password authentication enabled between the Platfora hosts, you can copy the master's authorized_keys file to the current node's authorized_keys file as follows: $ scp platfora@master_hostname:/home/platfora/.ssh/authorized_keys ~/.ssh/authorized_keys If password authentication is not enabled between hosts (such as on Amazon EC2 instances), login to the master server in a separate terminal session, copy the contents of its authorized_keys file, and paste into the ~/.ssh/authorized_keys file of the current node. 4. Test that you can ssh to the other Platfora nodes without a password prompt. For example (if prompted to add the other host to the list of known hosts, enter yes): $ ssh worker_hostname The authenticity of host 'worker_hostname (110.123.4.5)' can't be established... Are you sure you want to continue connecting (yes/no)? yes Synchronize the System Clocks Platfora uses NTP (Network Time Protocol) to synchronize the system clocks on the Platfora servers. Page 50 Platfora Installation Guide - Configure Environment on Platfora Nodes Network Time Protocol (NTP) ensures that the system clocks on your Platfora servers stay accurate. Accurate system clocks are important for consistent timestamps in your Platfora log files and for accurate scheduling of lens builds. See www.ntp.org for more information about using NTP. Synchronizing the system clock involves installing the NTP software, making sure all Platfora servers are using the same list of NTP time servers (as configured in the /etc/ntp.conf), and starting the NTP daemon (ntpd). 1. Install the NTP software. On RedHat/CentOS $ sudo yum install ntp On Ubuntu $ sudo apt-get install ntp 2. Verify that NTP is configured to use the correct time server for your network in /etc/ntp.conf. 3. Start the NTP daemon service. $ sudo service ntpd start Create Local Storage Directories The Platfora server needs local file system locations for its data files and configuration files. These must be the same locations on all Platfora servers. When you add a worker node, the locations used on the master are created on the worker node for you (provided the platfora system user has write access to these locations). If not, you'll have to create these locations on the worker nodes ahead of time. Create the Platfora Data Directory Each Platfora server needs a location where it can store its data and work files. This location should have enough disk space to accommodate the Platfora server log files, the metadata catalog database, and materialized lens data. This directory must be writable by the platfora system user. For example: $mkdir /data/platfora_data Set the PLATFORA_DATA_DIR environment variable for the platfora system user, for example: $ echo "export PLATFORA_DATA_DIR=/data/platfora_data" >> $HOME/.bashrc Create the Platfora Configuration Directory Each Platfora server needs a location where it can store its configuration files. This directory must be writable by the platfora system user. For example: $mkdir /home/platfora/platfora_conf Page 51 Platfora Installation Guide - Configure Environment on Platfora Nodes Set the PLATFORA_CONF_DIR environment variable for the platfora system user, for example: $ echo "export PLATFORA_CONF_DIR=/home/platfora/platfora_conf" >> $HOME/.bashrc Source the ~/.bashrc file. $ source $HOME/.bashrc Verify Environment Variables The Platfora installation uses several system environment variables which are typically set during the installation process. These environment variables are used by the Platfora software to determine the location of various directories and files. Verify the platfora user environment by looking at the .bashrc file in the platfora user's home directory. Variable Name Description PLATFORA_HOME Location of the Platfora installation files. PLATFORA_DATA_DIR Location of the Platfora data directory containing the metadata catalog, lens data, and work files. PLATFORA_CONF_DIR Local directory where Platfora stores its configuration files. HADOOP_CONF_DIR Location of the local Hadoop configuration files that Platfora uses to connect to the various Hadoop services. JAVA_HOME Location of the Java installation on your system. PATH Locations of system executables. LD_LIBRARY_PATH Locations of system library files. If you use data compression, make sure that LD_LIBRARY_PATH also contains the paths to the compression libraries you are using. PLATFORA_HADOOP_LIBLocation of the MapR client library files for Hadoop. Only needed (MapR Only) if you are using MapR. Page 52 Chapter 6 Configure Platfora for Secure Hadoop Access This section describes how to configure a Platfora node to authenticate to a Hadoop cluster that has been configured to run in secure mode. If you are not using Kerberos-protected secure Hadoop services, or if you are using Amazon EMR, you can skip the tasks in this section. Topics: • About Secure Hadoop • Configure Kerberos Authentication to Hadoop • Configure Secure Impersonation in Hadoop About Secure Hadoop By default Hadoop runs in non-secure mode, meaning users and clients can connect to Hadoop services without providing authentication credentials. If you have configured your Hadoop cluster to run in secure mode, each client connection needs to be authenticated by Kerberos in order to use Hadoop services. The Hadoop services leverage Kerberos to perform user authentication on all remote procedure calls (RPCs). Group resolution is performed on the Hadoop NameNode, JobTracker and ResourceManager respectively. Tasks are run using the user account who submitted the job. The Platfora master node accesses Hadoop services when: • Connecting to Hadoop data sources. • Defining datasets in the Platfora data catalog. • Submitting and monitoring lens build jobs. The Platfora worker nodes access the Hadoop file system when: • Copying lens data output to Platfora. Platfora acts as a client of the Hadoop file system and data processing services. It connects to these services using the platfora system user account. In order for Platfora to access a secure Hadoop cluster, this platfora user must be authenticated by Kerberos. Page 53 Platfora Installation Guide - Configure Platfora for Secure Hadoop Access Consult your Hadoop vendor documentation for enabling secure Hadoop. After secure Hadoop is enabled, Platfora is just another Kerberos client that you add to your secure Hadoop environment. Configure Kerberos Authentication to Hadoop Platfora supports Kerberos authentication to secure Hadoop services. To enable access to secure Hadoop, you must configure each Platfora server as a Kerberos client in the same realm as your secure Hadoop services. Obtain Kerberos Tickets for a Platfora Server To configure Kerberos authentication between Platfora and Hadoop, you will need to request a Kerberos ticket as the system user that runs the Platfora server (i.e. the platfora user). You can configure the Kerberos client software to request a ticket for this user at login. This will allow Platfora to access Kerberos-protected Hadoop services once the platfora system user has successfully logged in to the operating system. This guide does not provide instructions for installing and configuring the Kerberos client software. See your Linux operating system documentation for detailed instructions. Guides for CentOS and Ubuntu can be found online. See your Hadoop vendor's documentation for creating Kerberos principals and keytabs for Hadoop client services. You should follow the same procedure to create a Kerberos service principal name and keytab file for each Platfora node. Auto-Renew Kerberos Tickets for a Platfora Server In addition to the standard Kerberos client software, Platfora recommends also installing the kstart package on the Platfora server machines, and using the k5start utility to start a daemon process to maintain the Kerberos ticket cache for the Platfora server principal. Otherwise, the Platfora server will be denied access to Kerberos-enabled Hadoop services whenever its issued Kerberos ticket expires. To enable automatic renewal of the Kerberos ticket for the Platfora server: 1. Install the kstart package. 2. Before starting the Platfora server, run the k5start utility. Page 54 Platfora Installation Guide - Configure Platfora for Secure Hadoop Access For example, use the keytab file to obtain a ticket granting ticket (TGT) for the principal platfora/myrealm.com (the principal name as specified in the keytab file). The lifetime is 10 hours and the program wakes up every 10 minutes to check if the ticket is about to expire: $ sudo k5start -f keytab -K 10 -l 10h platfora/myrealm.com If a ticket expires and is re-issued in the middle of a lens build job, the Platfora System page may show a Kerberos authentication failure. Failed authentication attempts are always retried however, and Hadoop usually completes the job as expected despite the initial authentication failure. Configure Secure Impersonation in Hadoop If your Hadoop cluster runs in secure mode, you can do additional configuration in Hadoop to enable secure impersonation. Secure impersonation allows a given Hadoop superuser to submit jobs or access files on behalf of another user. Secure impersonation is used in conjuction with Platfora's HDFS Delegated Authorization feature. Secure impersonation is not required to access a secure Hadoop cluster. You can configure Platfora to authenticate to a secure Hadoop cluster without using secure impersonation. All tasks initiated by Platfora are performed as the platfora system user in that case. Secure impersonation is required to use Platfora's HDFS Delegated Authorization feature. This allows the platfora system user to submit tasks on behalf of another user. The Platfora server uses its Kerberos credentials to authenticate to Hadoop. However, file system accesses and tasks are authorized as the user who is logged in to the Platfora application. To use Platfora's HDFS Delegated Authorization feature, you must do the following to enable secure impersonation in your Hadoop environment: • Add the platfora system user to the HDFS supergroup on all Hadoop nodes. • Create a /user/username directory in HDFS for each proxied user that is owned by that user. • Grant read access on the appropriate source data files and directories in HDFS to the proxied user groups. • You must enable the secure impersonation properties for the platfora superuser in the coresite.xml file on your Hadoop nodes. For example: <property> <name>hadoop.proxyuser.platfora.groups</name> <value>marketing,sales</value> <description>Allow the superuser 'platfora' to impersonate any users in the groups named 'marketing' or 'sales' These groups should map to the LDAP groups registered in Platfora. </description> </property> Page 55 Platfora Installation Guide - Configure Platfora for Secure Hadoop Access <property> <name>hadoop.proxyuser.platfora.hosts</name> <value>*</value> <description>The superuser 'platfora' can connect from any host to impersonate a user</description> </property> Page 56 Chapter 7 Initialize Platfora Master Node This section describes how to set up a new Platfora cluster by initializing the master node. Once the Platfora master node is up and running, you will have a fully functioning single-node Platfora cluster. You can then use the master node to add the worker nodes into the cluster. Topics: • Connect Platfora to Your Hadoop Services • Initialize the Platfora Master • Troubleshoot Setup Issues Before you initialize the Platfora master, make sure you have done all the tasks described in Install Platfora Software and Dependencies and Configure Environment on Platfora Nodes. Connect Platfora to Your Hadoop Services In order to initialize a new Platfora cluster, the master node must be able to connect to the Hadoop services it needs. This section explains how to configure Platfora to connect to your Hadoop file system and data processing services. This process is different depending on the type of Hadoop deployment you have. Understand How Platfora Connects to Hadoop The Platfora servers use native Hadoop protocols to connect to Hadoop services using remote procedure calls (RPC). Platfora is a client of Hadoop, and uses the standard Hadoop configuration files to connect to its services. Platfora uses the Hadoop configuration files to connect to Hadoop. These files must be in a local directory on the Platfora master node. You can either obtain a copy of these files from your Hadoop environment or recreate these files with the minimum required properties. Page 57 Platfora Installation Guide - Initialize Platfora Master Node If you are using Amazon Elastic MapReduce (EMR) as your primary Hadoop distribution, you only need the core-site.xml file to connect to Amazon S3. You then set Platfora configuration properties to connect to EMR for data processing services. Hadoop File Description Connects to... core-site.xml Platfora uses the coresite.xml configuration file to connect to the distributed file system service for your Hadoop deployment. For example: HDFS for Cloudera and Hortonworks, MapRFS for MapR, or S3 for Amazon EMR. On-Premise Hadoop: HDFS Platfora uses the hdfssite.xml configuration file to configure how Platfora data is stored in the remote Hadoop distributed file system (HDFS). On-Premise Hadoop: HDFS Platfora uses the mapredsite.xml configuration file to connect to the MapReduce JobTracker service and to pass in runtime properties for lens build MapReduce jobs. This file is required for Hadoop deployments using MapReduce v1 or YARN. On-Premise Hadoop: Platfora uses the yarnsite.xml configuration file to connect to the YARN ResourceManager service and to pass in runtime properties for map and reduce task containers. This file is required for Hadoop deployments using YARN. On-Premise Hadoop: YARN hdfs-site.xml mapred-site.xml yarn-site.xml Page 58 NameNode Amazon EMR: S3 Bucket NameNode Amazon EMR: not used MapReduce JobTracker Amazon EMR: not used ResourceManager Amazon EMR: not used Platfora Installation Guide - Initialize Platfora Master Node Hadoop File Description Connects to... hive-site.xml You only need to configure a hive-site.xml file if you plan to use Hive as a data source for Platfora. Hive Metastore Obtain Hadoop Configuration Files The easiest way to supply the configurations that Platfora needs is to obtain a copy of your configuration files from your Hadoop installation and place them in the local Platfora Hadoop configuration directory. You can then change any configurations as needed for Platfora. This task only applies to on-premise Hadoop deployments, not Amazon EMR deployments. Platfora requires local versions of the core-site.xml, hdfs-site.xml, and mapred-site.xml files. If your Hadoop distribution supports YARN, you must also include a local yarn-site.xml file. Finally, if you choose the option to use the Hive metastore as a Platfora data source, you must also provide a hive-site.xml file. You can copy the files directly from your Hadoop servers. The location of the Hadoop configuration files varies depending on your Hadoop installation, but they can typically be found in one of the following locations on your Hadoop NameNode: • /etc/hadoop/conf • $HADOOP_INSTALL/hadoop/conf • /opt/mapr/hadoop/hadoop-version/conf Downloading Configuration Files in Cloudera and Hortonworks If you are using Cloudera Manager or Hortonworks Ambari Server, you can download a zip file containing your Hadoop configuration files. For example, in the Cloudera Manager Admin Console: 1. Go to Cluster/Services > Actions > Download Client Configuration. 2. Select Service > All Services. 3. Under cluster-level Actions, click Client Configuration URLs. 4. Choose the configuration files for the services needed by Platfora (HDFS, MapReduce, YARN, Hive) and download to your local system. Create Local Hadoop Configuration Directory This section describes the minimum Hadoop configuration properties that Platfora needs as a client of Hadoop's services. The Platfora master node machine requires a local directory where it can find copies of the standard Hadoop configuration files. When you initialize the Platfora master, you must provide the location of a local Hadoop configuration directory. Page 59 Platfora Installation Guide - Initialize Platfora Master Node 1. Create a configuration directory location owned by the platfora system user. $ su - platfora $ mkdir /home/platfora/hadoop_conf 2. Set the HADOOP_CONF_DIR environment variable for the platfora system user, for example: $ echo "export HADOOP_CONF_DIR=/home/platfora/hadoop_conf" >> $HOME/.bashrc 3. In this directory, copy or recreate the Hadoop configuration files needed for your Hadoop distribution. core-site.xml (HDFS / MapRFS) Platfora uses the core-site.xml configuration file to connect to the distributed file system service for your Hadoop deployment. For example: HDFS for Cloudera and Hortonworks, MapRFS for MapR. Apache/Cloudera/Hortonworks with MapReduce v1 Platfora requires the following minimum property where namenode_hostname is the DNS hostname of your Hadoop NameNode, and hdfs_port is the HDFS server port. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>hdfs://namenode_hostname:hdfs_port</value> </property> </configuration> Apache/Cloudera/Hortonworks/Pivotal with YARN Platfora requires the following minimum property where namenode_hostname is the DNS hostname of your Hadoop NameNode, and hdfs_port is the HDFS server port. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://namenode_hostname:hdfs_port</value> </property> </configuration> Page 60 Platfora Installation Guide - Initialize Platfora Master Node MapR with MapReduce v1 Platfora requires the following minimum properties where where cldbhost is the DNS hostname of the MapR CLDB node, and 7222 is the CLDB server port. If you are using file compression, you must also specify the compression libraries you are using. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.default.name</name> <value>maprfs://cldbhost:7222</value> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.DeflateCodec, org.apache.hadoop.io.compress.SnappyCodec </value> </property> </configuration> MapR with YARN Platfora requires the following minimum properties where where cldbhost is the DNS hostname of the MapR CLDB node, and 7222 is the CLDB server port. If you are using file compression, you must also specify the compression libraries you are using. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>maprfs://cldbhost:7222</value> </property> <property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec, org.apache.hadoop.io.compress.DeflateCodec, org.apache.hadoop.io.compress.SnappyCodec Page 61 Platfora Installation Guide - Initialize Platfora Master Node </value> </property> </configuration> hdfs-site.xml Platfora uses the hdfs-site.xml configuration file to configure how Platfora data is stored in the remote Hadoop distributed file system (HDFS). HDFS This file should have at least the following content. If you want Hadoop replication enabled for Platfora lens data, increase the 1 to a higher value. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- required --> <property> <name>dfs.replication</name> <value>1</value> </property> <!-- required for Cloudera 5.3 and later with HDFS Encryption enabled --> <property> <name>dfs.encryption.key.provider.uri</name> <value>kms://http@hadoop_name_node:16000/kms</value> </property> </configuration> mapred-site.xml Platfora uses the properties in its local mapred-site.xml file to connect to the Hadoop JobTracker service, and pass in client-side configuration options for Platfora-initiated MapReduce jobs. Any Hadoop MapReduce runtime properties can be passed along by Platfora with a lens build job configuration. See MapReduce Tuning for Platfora for a description of the required and recommended Page 62 Platfora Installation Guide - Initialize Platfora Master Node properties that Platfora needs for lens building. Any properties marked as runtime can be set in the local Platfora mapred-site.xml file instead of on the Hadoop cluster. Apache/Cloudera/Hortonworks/MapR with MapReduce v1 Platfora requires the following minimum properties in its local mapred-site.xml file for MapReduce v1 distributions. If you are using the high-availability (HA) JobTracker feature in your Hadoop cluster, you would use the HA JobTracker properties in Platfora's mapredsite.xml file instead of just the mapred.job.tracker property. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <!-- required --> <property> <name>mapred.job.tracker</name> <value>jobtracker_hostname:jt_port</value> </property> <!-- should be at least 1024m, but may be more based on memory on your Hadoop nodes --> <property> <name>mapred.child.java.opts</name> <value>-Xmx1024m</value> </property> <!-- required --> <property> <name>mapred.job.shuffle.input.buffer.percent</name> <value>0.30</value> </property> <!-- optional --> <property> <name>io.sort.record.percent</name> <value>0.15</value> </property> <!-- optional --> <property> <name>io.sort.factor</name> <value>100</value> </property> <!-- optional --> <property> Page 63 Platfora Installation Guide - Initialize Platfora Master Node <name>io.sort.mb</name> <value>256</value> </property> </configuration> Apache/Cloudera/Hortonworks/Pivotal with YARN Platfora requires the following minimum properties in its local mapred-site.xml file for YARN distributions. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapreduce.jobhistory.address</name> <value>yarn_rm_hostname:port</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>yarn_rm_hostname:web_port</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> <!-- should be at least 1024m, but may be more based on memory on your Hadoop nodes --> <property> <name>mapreduce.reduce.java.opts</name> <value>-Xmx1024k</value> </property> <property> <name>mapreduce.map.java.opts</name> <value>-Xmx1024k</value> </property> <property> <name>mapreduce.task.io.sort.factor</name> <value>100</value> </property> <property> <name>mapreduce.job.user.classpath.first</name> <value>true</value> Page 64 Platfora Installation Guide - Initialize Platfora Master Node </property> <!-- Needed For Hortonworks 2.2 Only --> <property> <name>hdp.version</name> <value>2.2.0.0-2041</value> </property> <!-- Needed For Pivotal 3.0 Only --> <property> <name>stack.version</name> <value>3.0.0.0-249</value> </property> <property> <name>stack.name</name> <value>phd</value> </property> </configuration> MapR with YARN Platfora requires the following minimum properties in its local mapred-site.xml file for MapR distributions using YARN. <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>mapr.host</name> <value>yarn_rm_hostname</value> </property> <property> <name>mapreduce.jobhistory.address</name> <value>yarn_rm_hostname:port</value> </property> <property> <name>mapreduce.jobhistory.webapp.address</name> <value>yarn_rm_hostname:web_port</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property> Page 65 Platfora Installation Guide - Initialize Platfora Master Node <property> <name>mapr.centrallog.dir</name> <value>${hadoop.tmp.dir}/logs</value> </property> </configuration> yarn-site.xml Platfora uses the properties in its local yarn-site.xml file to connect to the Hadoop ResourceManager service, and pass in client-side configuration options for Platfora-initiated YARN jobs. Any Hadoop YARN runtime properties can be passed along by Platfora with a lens build job configuration. See YARN Tuning for Platfora for a description of the required and recommended properties that Platfora needs for lens building. Any properties marked as runtime can be set in the local Platfora yarn-site.xml file instead of on the Hadoop cluster. All Hadoop Distributions with YARN Platfora requires the following minimum properties in its local yarn-site.xml file for Hadoop distributions using YARN. <configuration> <property> <name>yarn.resourcemanager.address</name> <value>yarn_rm_hostname:8032</value> </property> <property> <name>yarn.resourcemanager.webapp.address</name> <value>yarn_rm_hostname:8088</value> </property> <property> <name>yarn.resourcemanager.admin.address</name> <value>yarn_rm_hostname:8033</value> </property> <property> <name>yarn.resourcemanager.resource-tracker.address</name> <value>yarn_rm_hostname:8031</value> </property> <property> <name>yarn.resourcemanager.scheduler.address</name> Page 66 Platfora Installation Guide - Initialize Platfora Master Node <value>yarn_rm_hostname:8030</value> </property> <property> <name>mapreduce.job.hdfs-servers</name> <value>hdfs://yarn_rm_hostname:8020</value> </property> # Adjust these properties based on available Hadoop memory resources <property> <name>yarn.scheduler.minimum-allocation-mb</name> <value>1024</value> </property> <property> <name>yarn.scheduler.maximum-allocation-mb</name> <value>8192</value> </property> </configuration> hive-site.xml Platfora uses a local hive-site.xml configuration file to connect to the Hive metastore service. You only need a local hive-site.xml file if you plan to use Hive as a data source for Platfora. There are two ways to configure how clients connect to the Hive metastore service in your Hadoop environment. You can set up the HiveServer or HiveServer2 Thrift service, which allows various remote clients to connect to the Hive metastore indirectly. This is called a remote metastore client configuration, and is the recommended configuration by the Hadoop vendors. If you add a Hive datasource through the Platfora web application, you can connect to the Hive Thrift service without the need for a Platfora copy of the hive-site.xml file. Optionally, you can connect directly to the Hive metastore database using a JDBC connection. This requires that you have the login credentials for the Hive metastore database. This is called a local metastore configuration because you are connecting directly to the metastore database rather than through a service. If you want to connect to the Hive metastore database directly using JDBC, then you must specify the connection information in a hive-site.xml. Platfora can only connect to a single Hive instance via a remote or a local metastore configuration. Remote Metastore (Thrift) Server Configuration If you are using the Hive Thrift remote metastore, in addition to the URI, you may want to include the following performance properties: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> Page 67 Platfora Installation Guide - Initialize Platfora Master Node <configuration> <property> <name>hive.metastore.uris</name> <value>thrift://hostname:hiveserver_thrift_port</value> </property> <property> <name>hive.metastore.client.socket.timeout</name> <value>120</value> <description> Number of seconds to wait for the client to retieve all of the objects (tables and partitions) from Hive. For tables with thousands of partitions, you may need to increase. </description> </property> <property> <name>hive.metastore.batch.retrieve.max</name> <value>100</value> <description> Maximum number of objects to get from metastore in one batch. A higher number means less round trips to the Hive metastore server, but may also require more memory on the client side. </description> </property> </configuration> Local JDBC Configuration To have Platfora connect directly to a local JDBC metastore requires additional configuration on the Platfora servers. Each Platfora server requires a hive-site.xml file with the correct connection information, as well as the appropriate JDBC driver installed. Here is an example hive-site.xml to connect to a MySQL local metastore: <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://hive_hostname:metastore_db_port/metastore</ value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jdbc.Driver</value> Page 68 Platfora Installation Guide - Initialize Platfora Master Node </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>hive_username</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>password</value> </property> <property> <name>hive.metastore.client.socket.timeout</name> <value>120</value> </property> <property> <name>hive.metastore.batch.retrieve.max</name> <value>100</value> </property> </configuration> The Platfora server would also need the MySQL JDBC driver installed in order to use this configuration. You can place the JDBC driver .jar files in $PLATFORA_DATA_DIR/extlib to install them (requires a restart of the Platfora server). Initialize the Platfora Master The Platfora setup utility (setup.py) verifies your operating system environment, configures the Platfora software, and initializes the platfora metadata catalog database. You must run this setup utility successfully before starting Platfora for the first time. To run the setup utility: $ $PLATFORA_HOME/setup.py The setup.py utility prompts you for the following information about your environment. Information Requested Description Platfora Configuration Directory This is the local $PLATFORA_CONF_DIR directory location that you created earlier where Platfora will store its configuration files. For example: /home/platfora/platfora_conf. Page 69 Platfora Installation Guide - Initialize Platfora Master Node Information Requested Description Hadoop Distribution and Version This tells Platfora what distribution of Hadoop you are using. Choose the number that corresponds to your Hadoop distribution and version. Platfora Web Services Port This sets the port number for the Platfora web application server. This is the port used for HTTP client connections to the Platfora application. Defaults to 8001. Platfora Server Management Port This sets the port number for TCP management connections between Platfora servers. This is the port used for server-toserver heartbeat and management utility connections. Defaults to 8002. Platfora Data Transfer Port This sets the port number for TCP data connections between Platfora servers. This is the port used for server-to-server data transfers during query processing. Defaults to 8003. Hadoop Configuration File Directory This is the local directory containing your Hadoop configuration files that you created earlier. For example: /home/platfora/hadoop_conf. Platfora Data Directory This is the local $PLATFORA_DATA_DIR directory location that you created earlier where Platfora will store its metadata catalog database, lens data, and log files. For example: /data/platfora_data. Platfora Catalog Database Port This is the port of the PostgreSQL database server instance where the Platfora metadata catalog database will be initialized. Defaults to 5432. Company Name Used as an identifier for system diagnostic bundles. If you encounter issues or problems, Platfora Support may request that you generate a system diagnostic bundle. Enter the correct company name to aid possible troubleshooting in the future. Setup Platfora for Secure Connection If yes, configures Platfora to use HTTPS for secure communications between the Platfora master server and web browser clients. If no, uses regular HTTP connections. See Configure SSL for Client Connections . Send Metrics to Platfora If yes, configures the Platfora server to send anonymous system diagnostic data to Platfora over an HTTPS connection. See About System Diagnostic Data for details. Remote DFS Data Directory This is the remote data directory location in the configured Hadoop file system. Setup will make sure that Platfora has write permissions to this location before proceeding. Page 70 Platfora Installation Guide - Initialize Platfora Master Node Information Requested Description Maximum Java Virtual The maximum JVM size allocated to the Platfora server process. Machine (JVM) Memory On a dedicated machine, this should be about 80 percent of total system memory. Setup will use this guideline to suggest a default (M=megabytes, G=gigabytes). Relative Platfora Disk Cache Size When a lens is built in Hadoop, lens data files are copied over to Platfora local disk in order to improve the performance of lens queries. This sets the maximum amount of local disk space on the Platfora server to use for storing lens data. The limit is determined by taking a percentage of the total disk space capacity on the Platfora server. The default is 0.8 or 80% of total disk space. After running the setup.py command, run the hadoop-check command to check your Hadoop settings. The hadoop-check utility verifies that Hadoop is correctly configured for use with Platfora. It also collects system information from the Hadoop cluster environment. If you are using MapR, you should run this utility as it collects important information, but be aware that it can report misleading configuration information. Configure SSL for Client Connections When running setup, you have the option to configure SSL connections between the Platfora master server and browser clients. If you do not have your own certificate, you can have the setup utility generate a self-signed certificate for you. Data sent over an HTTPS connection will be encrypted regardless of whether the server certificate is CA-signed or self-signed. However, most web browsers will only trust certificates signed by a trusted certificate authority (CA) and will display security warnings when presented with self-signed certificates. For production installations, you may want to replace the self-signed certificate with one signed by a trusted CA. Page 71 Platfora Installation Guide - Initialize Platfora Master Node If you choose to configure SSL during setup, the setup utility will ask the following additional questions to configure SSL: Information Requested at Setup Description Platfora Secure Connection If yes, configures Platfora to use HTTPS for secure communications between the Platfora master server and web browser clients. If no, uses regular HTTP connections. If you choose to enable secure communications, you can use your own server certificate (if you have one), or you can have Platfora generate one for you. TCP Port for HTTPS Connections Enter the TCP port the Platfora master server should use when web browsers connect to the Platfora web application using HTTPS. Note that when telemetry is enabled, the Platfora server uses this port to securely send telemetry data to Platfora using HTTPS. Default is 8443. KeyStore Location, Password and Type The keystore contains the master server’s private key, and its certificate with the corresponding public key. The keystore is used to provide credentials. Location: Accept the default location if you want Platfora to generate a keystore for you, otherwise enter the path to your own keystore. Password: If using your own keystore, enter the password to access that keystore, otherwise set a password for the keystore that Platfora will create. Type: If using your own keystore, enter the type of keystore format you are using. Allowed types are JKS (Java Keystore) or PKCS12 (Public Key Cryptography Standards #12 Keystore). If you plan to have Platfora generate the keystore for you, use JKS (the default). Generate SelfSigned SSL Certificate Enter Y if you do not have a server certificate and want Platfora to generate one for you. Enter N if you already have a certificate that you want to use. Page 72 Platfora Installation Guide - Initialize Platfora Master Node Information Requested at Setup Description TrustStore Location, A truststore contains certificates to trust. The truststore is used to Password and Type verify certificate authority (CA) credentials. Location: By default, Platfora uses the default truststore that comes with your Java installation. This default truststore is already configured to trust all of the recognized certificate authorities (Verisign, Symantec, Thawte, etc.). If you have your own truststore, you can enter the path to that truststore instead. Password: If using the default truststore that ships with Java, the default password is changeit. If you have changed this password or are using your own truststore, enter the correct password for the truststore. Type: If using your own truststore, enter the type of truststore format you are using. Allowed types are JKS or PKCS12. If you use the default truststore that comes with your Java installation, use JKS (the default). When SSL is enabled, ensure that the keystore and truststore passwords entered in Platfora always match the passwords configured at the keystore and truststore locations. Changing the passwords in Platfora does not change the passwords at the keystore or truststore locations. If the passwords entered in Platfora do not match the passwords at the keystore and truststore locations, the Platfora server fails to start. Configure SSL for Catalog Connections For added security, you can encrypt the communications between the Platfora worker nodes and the metadata catalog on the Platfora master node. If you decide to enable SSL for the Platfora catalog, you must have SSL-enabled PostgreSQL installed on your Platfora master node, and OpenSSL version 1.0.1 or higher installed on all Platfora nodes (master and worker nodes). If you are enabling this optional security feature at installation time, you would do so after running setup.py but before starting the Platfora servers. 1. On the Platfora master node, log in as the platfora system user. $ su - platfora 2. Make sure the Platfora servers are not running. $ platfora-services stop Page 73 Platfora Installation Guide - Initialize Platfora Master Node 3. Run the platfora-catalog ssl utility to configure secure connections to the catalog. For example, if using a self-signed server certificate and private key: $ $ platfora-catalog ssl --enable --self About System Diagnostic Data During setup, you have the option to enable collection of system diagnostic data. This collects anonymous statistics about product usage and performance, and will help Platfora improve the product in future releases. Sending system diagnostic data to Platfora is optional. A system administrator can choose to enable or disable diagnostic data collection at any time by running setup.py. What Data is Collected? Platfora respects your privacy and security. We do not collect any business data, only diagnostic system metrics. The Platfora server sends system metrics data over the configured SSL port. System diagnostic data is completely anonymous. Platfora does not collect any names (data source, dataset, lens, vizboard, or user names), permissions used, or any personally identifiable information. Here is a list of some of the diagnostic data that Platfora does collect: • Actions taken in the UI • Dataset size • Lens size (estimated and actual) • Build duration (how long did a lens build take) • Scheduled lens build times • Client browser type • Screen resolution • Page load times • Rest API call duration (how long did an API call take to return) • Help files viewed • User metrics (number of users and groups in Platfora) • Number of logins • Server startup time (how long did it take for the Platfora server to start) • Permissions performance metrics (how many times the system used cached permissions versus having to look up permissions in the catalog) To see a sample of the data collected, you can look at the system diagnostic logs in $PLATFORA_DATA_DIR/telemetry. How to Configure System Diagnostic Collection If you decide to enable the collection of system diagnostic data, Platfora will log usage information and send the log files to Platfora Customer Support every 15 minutes by default. If an attempt to send the data fails, Platfora will only keep the logs for an hour by default to conserve disk space. The following Page 74 Platfora Installation Guide - Initialize Platfora Master Node server configuration properties can be used to configure the system diagnostics feature. Changing any of these properties requires a system restart. • platfora.support.identifier - The name used to identify a system diagnostic bundle sent to Platfora support. • platfora.telemetry.aggregate.frequency - The number of times each send interval to attempt to aggregate and send the logs to Platfora's telemetry server. Default is 1. • platfora.telemetry.enabled - Whether or not to send collected diagnostic data to Platfora. • platfora.telemetry.file.lifespan - The number of seconds to keep historical diagnostic data between send attempts. The default is 3600 (one hour). • platfora.telemetry.send.frequency - The number of seconds between send intervals. The default is 900 (15 minutes). • platfora.telemetry.url - The URL of Platfora's telemetry server where the diagnostic data is sent. Default is https://telemetry.platfora.com. • platfora.telemetry.logparser.enabled - Turns on the ability to parse the log files using the Platfora application. Troubleshoot Setup Issues This section describes typical errors encountered during installation and setup, and how to resolve them. View the Platfora Log Files If you encounter errors when initializing, starting or running Platfora, check the Platfora log files. The logs can provide more information about the cause of the error. You can view the logs on the Platfora master or in the Platfora web application. The Platfora master server log file is located at $PLATFORA_DATA_DIR/logs/platforaserver.log. If the Platfora server is running, you can also access the Platfora server log file in the browser. Go to: http://hostname:port/debug/view-log/125 Where hostname:port is the Platfora server hostname and web application port (8001 is the default port) and 125 is the thousand number of bytes to display (the default is 50 or 50,000 bytes). Setup Fails Setting up Catalog Metadata Service Platfora uses a PostgreSQL database to store its metadata catalog. If PostgreSQL is already running, setup will fail when it tries to start PostgreSQL. You must stop PostgreSQL, cleanup the Platfora data directory, and then try again. You may also get an error if the /var/run/postgresql directory is missing or has the wrong file permissions. Page 75 Platfora Installation Guide - Initialize Platfora Master Node When this error occurs, you might see errors such as the following from setup.py: Command failed due to: Error occurred command: /usr/lib/postgresql/9.2/ bin/pg_ctl start -D In the PostgreSQL log file (located in PLATFORA_DATA_DIR/logs/pg.log), you may see an error such as: LOG: could not bind IPv4 socket: Address already in use HINT: Is another postmaster already running on port 5432? In the PostgreSQL log file, you may also see an error such as this: FATAL: could not create lock file "/var/run/ postgresql/.s.PGSQL.5432.lock": No such file or directory This means that the platfora system user does not have permission to write to the location where PostgreSQL writes its lock files. Make sure to create /var/run/postgresql and give ownership to the platfora user. Note that a system reboot sometimes clears /var/run, so you may need to recreate this directory if you have rebooted your server. 1. Check if the PostgreSQL data process is running. $ ps ax | grep postgres 2. If it is running, kill the process. $ kill process_id 3. Make sure you have removed the automatic startup scripts for PostgeSQL, otherwise you will probably hit this error again. On RedHat/CentOS: $ sudo rm /etc/init.d/postgresql-9.2 On Ubuntu: $ sudo rm /etc/rc*/*postgresql 4. Clean out the Platfora data directory location before trying setup.py again. For example: $ rm -rf /data/PLATFORA_DATA/* 5. Make sure the /var/run/postgresql directory exists and has the correct permissions. $ sudo mkdir /var/run/postgresql $ sudo chown platfora /var/run/postgresql TEST FAILED: Checking integrity of binaries When you run setup.py utility (which runs the platfora-syscheck utility by default), it does a checksum of all of the files in the installation package to make sure the package is not corrupt. If you add, remove, or change any files inside the Platfora installation directory, this checksum test will fail. When this error occurs, you might see an error such as the following when you try to initialize or upgrade Platfora using setup.py (or run a system verification check using platfora-syscheck): Verifying System Requirements Checking integrity of binaries...... -=-=-=-=-=-=-=-=-=-=- TEST FAILED -=-=-=-=-=-=-=-=-=-=Page 76 Platfora Installation Guide - Initialize Platfora Master Node Reason: .... To avoid this error, you should not add, remove, or modify any files inside $PLATFORA_HOME after you have downloaded and unpacked the installation package. If you have not made any changes to the Platfora installation files, this error means that the package you downloaded may be corrupt. Contact Platfora Customer Support to obtain a new installation package. If you have intentionally made changes to your Platfora installation and want to bypass this check when running setup.py (and you have successfully ran platfora-syscheck in the past), you can skip the system checks using the --skip_syscheck option. For example: $ setup.py --skip_syscheck Page 77 Chapter 8 Start Platfora After installing and initializing the Platfora master server, you are ready to start Platfora. After Platfora is started, log in to the Platfora web application, upload your license, and change the default administrator password. Optionally, you may want to load the tutorial data to make sure everything is working as expected. Topics: • Start the Platfora Server • Log in to the Platfora Web Application • Add a License Key • Change the Default Admin Password • Load the Tutorial Data Start the Platfora Server After you have successfully completed setup, you are ready to start the Platfora server for the first time. Starting the Platfora server also starts the metadata catalog service (PostgreSQL). Before you can start Platfora, make sure your Hadoop services are up and running. Platfora will not start if it cannot connect to the Hadoop file system and data processing services you have configured. PostgreSQL must be installed and in your PATH, but not running. To start the Platfora server: $ $PLATFORA_HOME/bin/platfora-services start To confirm the master server has started correctly (it should be Enabled, Available, and Running): $ $PLATFORA_HOME/bin/platfora-services status ID TYPE HOST PORT ENABLED STATUS PROCESS -------------------------------------------------------------------------0 Master ip-10-xxx-xxx-xxx 8002 Enabled Available Running Page 78 Platfora Installation Guide - Start Platfora Log in to the Platfora Web Application After Platfora is started, you can open a web browser, and go to the URL of the Platfora master server process. To log in for the first time, use admin and admin as the username and password. Enter the following URL in your browser location field, where hostname is the IP address or public DNS hostname of the Platfora master server and port is the HTTP web services port entered during setup (the default port is 8001): http://hostname:port If SSL is enabled, the Platfora web server redirects the browser to use the HTTPS port instead (8443 by default). When prompted for a username and password, use admin and admin to log in for the first time. This is the default credentials for the Platfora System Administrator account. Page 79 Platfora Installation Guide - Start Platfora After logging in for the first time, you will be prompted to accept the Platfora license agreement. You must accept the license agreement to continue. Page 80 Platfora Installation Guide - Start Platfora Add a License Key When the Platfora software is in an unlicensed state, a system administrator must upload a valid license key to activate the product functionality. 1. Go to the System > License page. 2. Click Upload. 3. Navigate to the license key file stored on your local machine and select the license key file. 4. Click OK in the message window after the license is successfully installed. Change the Default Admin Password After logging in to the Platfora web application for the first time, it is a good idea to change the default Platfora System Administrator password from admin to something more secure. Page 81 Platfora Installation Guide - Start Platfora You can change the default System Administrator (admin) user's password and profile picture in the Platfora web application. 1. In the top right corner of the page header, open the System pull-down menu and select User Profile. 2. In the user profile dialog, click Change Password. 3. Enter a new password. Type carefully (there is no password confirmation). 4. Click Update Password. Load the Tutorial Data Platfora installs with some sample data that you can load to see examples of how datasets and lenses are created. Loading the sample data is also a good way to test that Platfora is working correctly with your configured Hadoop implementation. The Platfora server has a client load utility that you can run via the command-line to automatically load the sample data. This client utility creates four sample datasets and one sample lens in the Platfora web application. If you have not received a valid license file from Platfora Customer Support, and enabled it within the Platfora web application, you will not be able to load the tutorial data. You must have a valid license in order to create datasets and lenses. Page 82 Platfora Installation Guide - Start Platfora Log in to the Platfora master server in a terminal session, and run the following command: $PLATFORA_HOME/client/bin/run_python $PLATFORA_HOME/client/examples/ flights/load_flights.py -u admin -p admin -s localhost:8001 If you have changed the default Platfora administrator password (admin) or web server port (8001), you will need to alter the load command to supply the correct connection information for your Platfora server. The command-line does not return until the lens build job completes, which can take several minutes. In the meantime, you can access the Platfora application in a web browser using the following URL (replace hostname with the actual IP or hostname of your Platfora master server: http://hostname:8001 Page 83 Chapter 9 Initialize a Worker Node Worker nodes are initialized and added to a Platfora cluster by running a utility on the Platfora master node. Before you can initialize a worker node, make sure you have provisioned and configured the worker node machine. Before you initialize a Platfora worker, you must do the following tasks on the worker node machine: 1. Install the prerequisite software directly on the worker node. • If using the RPM installer packages, Install Dependencies RPM Package. • If using the TAR installer packages, you must manually Create the Platfora System User, Set OS Kernel Parameters, and Install Dependent Software. 2. Configure Environment on Platfora Nodes. After the worker node has been correctly provisioned, you can add it in to the Platfora cluster from the master. The platfora-node add utility will copy the Platfora software and configurations from the master over to the worker node, start it, and bring the node into the Platfora cluster. 1. On the master node, add the worker node to the cluster: $ platfora-node add --host worker_hostname 2. After the command completes, check the status of the cluster. When the new child node is Enabled and Available then it is ready to serve viz queries. For example: $ platfora-services status ID TYPE HOST MGMT_PORT WEB_PORT ENABLED STATUS PROCESS -------------------------------------------------------------------------------------0 Master ip-10-xxx-xxx-xxx 8002 8001 Enabled Available Running 1 Child ip-10-xxx-xxx-xxx 8002 8001 Enabled Available Running A newly added node may have a status of Not Ready until it is finished copying the lens data blocks it needs over from the Hadoop file system. Page 84 Appendix A Platfora Utilities Reference The Platfora command-line management utilities are located in $PLATFORA_HOME/bin of your Platfora server installation. All utility commands should be executed from the Platfora master node. Topics: • setup.py • hadoop-check • hadoopcp • hadoopfs • install-node • platfora-catalog • platfora-config • platfora-export • platfora-import • platfora-license • platfora-node • platfora-services • platfora-syscapture • platfora-syscheck setup.py Initializes a new Platfora instance or upgrades an existing one. Can also be used to reset bootstrap system configuration properties. Synopsis setup.py [-h] [-q] [-v] [-V] setup.py [--hadoop_conf path] [--platfora_conf path] [--datadir path] [--dfs_dir dfs_path] [--port admin_port] [-data_port data_port] Page 85 Platfora Installation Guide - Platfora Utilities Reference [--websvc_port http_port] [--ssl_port https_port] [-jvmsize jvm_size] [--hadoop_version string] [--extraclasspath path] [-extrajavalib path] [--skip_checks] [--skip_syscheck] [--skip_sync] [-skip_setup_ssl] [--skip_setup_dfscachesize] [--skip_setup_telemetry] [--upgrade_catalog] [--nochanges] [--verbose] Description The setup.py utility is run on the Platfora master node after installing the Platfora software, but before starting the Platfora server for the first time. For new installations, setup.py: • Runs platfora-syscheck to verify that all system prerequisites have been met. • Confirms that you have installed the correct Platfora software package for your intended Hadoop distribution. • Prompts for bootstrap configuration information, such as port numbers, directory locations, memory resources, secure connections, and diagnostic data collection. • Verifies that the supplied ports are open and that permissions and disk space are sufficient on both the local and remote DFS file systems. • Initializes the Platfora metadata catalog database in PostgreSQL. • Creates the default System Administrator user account. • Copies setup files to the Platfora storage location in the configured Hadoop DFS. For upgrade installations, setup.py: • Runs platfora-syscheck to verify that all system prerequisites have been met. • Confirms that you have installed the correct Platfora software package for your intended Hadoop distribution. • Displays your current bootstrap configuration settings and prompts if you want to make changes. • Upgrades the Platfora metadata catalog database in PostgreSQL if necessary. • Copies any updated library files to the Platfora storage location in the configured Hadoop DFS. • Synchronizes the Platfora software and configuration files on the worker nodes in a multi-node installation. Required Arguments No required arguments. Optional Arguments -c | --hadoop_conf path This is the local directory containing your Hadoop configuration files (such as core-site.xml and mapred-site.xml). Platfora uses the information in these files to connect to your Hadoop cluster. Page 86 Platfora Installation Guide - Platfora Utilities Reference -C | --platfora_conf path This is the local directory where Platfora will store its configuration files. Defaults to $PLATFORA_CONF_DIR if set. -d | --datadir path This is the local directory where Platfora will store its metadata catalog database, lens data, and log files. Defaults to $PLATFORA_DATA_DIR if set. --data_port This is the data transfer port used during query proccessing on multi-node Platfora clusters. By default, uses the same port number as the master node. --db_port port This is the port of the PostgreSQL database instance where the Platfora metadata catalog database resides. The default PostgreSQL port is 5432. --db_dump_path path This is the path where the backup SQL file of the Platfora metadata catalog database will be created prior to upgrading the catalog. Defaults to the current directory. -g | --dfs_dir dfs_path This is the remote directory in the configured Hadoop distributed file system (DFS) where Platfora will store its library files and MapReduce output (lens data). -j | --extraclasspath path This is the path where the Platfora server will look for additional custom Java classes (.jar files), such as those for Hive JDBC connectors, custom Hive SerDes, or user-defined functions. These are not included in Lens Building in Hadoop. They are deprecated, please use $PLATFORA_DATA_DIR/extlib instead. -l | --extrajavalib path This is the path where the Platfora server should look for native Java libraries. These are not included in Lens Building in Hadoop. They are deprecated, please use $PLATFORA_DATA_DIR/extlib instead. -n | --nochanges On upgrade, do not prompt the user if they want to make changes to their current Platfora bootstrap configuration settings. -p | --port admin_port This is the server administration port used for management utility and API calls to the Platfora server. This is also the port that multi-node Platfora servers use to connect to each other. The default is 8002. -s | --jvmsize jvm_size Page 87 Platfora Installation Guide - Platfora Utilities Reference The maximum amount of Java virtual memory (JVM) allocated to a Platfora server process. On a dedicated machine, this should be about 80 percent of total system memory. You can specify size using M for megabytes or G for gigabytes. --skip_checks Do not do safety checks, such as verifying ports, disk space, and file permissions. --skip_setup_dfscachesize Do not prompt to configure the maximum local disk space utilization for storing lens data. If this question is skipped, Platfora will set the maximum to 80 percent of the available space in $PLATFORA_DATA_DIR. When this limit is reached, lens builds will fail during the pre-fetch stage. --skip_setup_ssl Do not prompt to configure secure connections (SSL) between browser clients and the Platfora server. If these questions are skipped, the default is no (do not use SSL). --skip_sync Do not sync the installation directory to the worker nodes. --skip_syscheck Do not run the platfora-syscheck utility prior to setup. --skip_setup_telemetry Do not prompt to disable/enable diagnostic data collection. If these questions are skipped, the default is yes (enable diagnostic data collection), and the company name is set to default (anonymous). -t | --hadoop_version version_string The version string corresponding to the Hadoop distribution you are using with Platfora. Valid values are cdh5 (Cloudera 5.0.x an5.1.x), cdh52 (Cloudera 5.2.x and 5.3.x), cdh54 (Cloudera 5.4.x), mapr4 (MapR 4.0.1), mapr402 (MapR 4.0.2 and 4.1.x), emr3 (Amazon Elastic Map Reduce), HDP_2.1 (Hortonworks 2.1.x), HDP_2.2 (Hortonworks 2.2.x), pivotal_3 (PivotalHD 3.0). --upgrade_catalog Automatically upgrade the metadata catalog schema if necessary. The catalog update check is run by default. -v | --verbose Runs in verbose mode. Show all output messages. -w | --websvc_port http_port This is the HTTP listener port for the Platfora web application server. This is the port that browser clients use to connect to Platfora. The default is 8001. -W | --ssl_port https_port Page 88 Platfora Installation Guide - Platfora Utilities Reference This is the HTTPS listener port for the Platfora web application server. This is the SSL port that browser clients use to connect to Platfora. The default is 8443. Examples Run setup without doing the prerequisite checks first: $ setup.py --skip_syscheck Run initial setup without any prompts using the specified bootstrap configuration settings (or use the default settings when not specified): $ setup.py --hadoop_conf /home/platfora/hadoop_conf --platfora_conf / home/platfora/platfora_conf \ --datadir /data/platfora --dfs_dir /user/platfora --jvmsize 12G -hadoop_version cdh4 \ --skip_setup_ssl --skip_setup_dfscachesize --skip_setup_telemetry Run upgrade setup without any prompts and keep all previous configuration settings: $ setup.py --upgrade_catalog --nochanges hadoop-check Checks the Hadoop cluster connected to Platfora to make sure it is not misconfigured. Collects information about the Hadoop environment for troubleshooting purposes. Synopsis hadoop-check [-h] [-v] [-vv] [-V] Description The hadoop-check utility verifies that Hadoop is correctly configured for use with Platfora. It also collects system information from the Hadoop cluster environment. You must complete setup.py before running this utility. Output from this utility is logged in $PLATFORA_DATA_DIR/logs/hadoop-check.log. It performs the following checks: • Root DFS Test. This test makes sure that Platfora can connect to the configured Hadoop file system, and that file permissions are correct on the directories that Platfora needs to write to. It also makes sure that any jar files that have been placed in $PLATFORA_DATA_DIR/extlib have the correct file permissions. • File Codec Test. This test makes sure that Platfora has the codecs (file compression libraries) it needs to recognize and read the compression types supported in Hadoop. If Hadoop is configured to support a compression type that Platfora does not recognize, then this test will fail. You can put the jar files for any additional codecs in $PLATFORA_DATA_DIR/extlib of the Platfora server (requires a restart). Page 89 Platfora Installation Guide - Platfora Utilities Reference • Hadoop Host Configuration Test. This test runs a small MapReduce job on the Hadoop cluster and reports back information from the Hadoop environment. It makes sure that memory is not oversubscribed on the Hadoop MapReduce cluster. These tests assume that all nodes in the Hadoop cluster have the same resource configuration (same amount of memory, CPU cores, etc.). The check retunrs a RC (return code) value. A return code 0 means all tests passed. Return code 1 means one or more tests failed. Root DFS Test This test is skipped if Platfora is configured to use Amazon S3. Test DFS file system information and returns the following: Total The total disk space in the Platfora storage directory on the Hadoop file system. Used The used disk space in the Platfora storage directory on the Hadoop file system. Available The available disk space in the Platfora storage directory on the Hadoop file system. Permissions on the Platfora DFS Directory Permissions on Platfora DFS Directory The platfora system user has write permissions to the Platfora storage directory on the Hadoop file system (PASSED or FAILED). File Codec Test Codecs Installed The file compression libraries that are installed in Hadoop. Output compression in Hadoop Conf Checks if the mapred-site.xml property mapred.output.compress is enabled, and if it is makes sure the compression library specified in mapred.output.compression.codec is also installed in Platfora. Hadoop Host Configuration Test JobTracker Status Ensures the server is up and running. (ResourceManager for YARN) Black Listed Tasktrackers (NodeManagers for YARN) Total Cluster Map Tasks Lists the number of servers marked unavailable in the Hadoop cluster. Total number of map task slots available. This is the value of mapred.tasktracker.map.tasks.maximum in the JobTracker for pre-YARN distributions. This is the value of mapreduce.tasktracker.map.tasks.maximum in the ResourceManager for YARN distributions. Page 90 Platfora Installation Guide - Platfora Utilities Reference Total Cluster Map Tasks Total number of map task slots available. This is the value of mapred.tasktracker.map.tasks.maximum in the JobTracker. Total Cluster Map Tasks Total number of map task slots available. This is the value of mapreduce.tasktracker.map.tasks.maximum in the ResourceManager. Map Tasks Occupied The number of map task slots that were occupied at the time of the test. Total Cluster Reduce Tasks Total number of reduce task slots available. This is the value of mapred.tasktracker.reduce.tasks.maximum in the JobTracker. This is the mapreduce.tasktracker.reduce.tasks.maximum in the ResourceManager for YARN distributions. Reduce Tasks Occupied The number of reduce task slots that were occupied at the time of the test. Job Submission Took How long it took for Platfora to submit the test MapReduce job. Hadoop Host The host name of the JobTracker.The host name of the ResourceManager node for YARN distributions. Hadoop Version The version of Hadoop that is running. CPUs Number of CPUs per TaskTracker node.Number of CPUs for the NodeManager in YARN distributions. RAM The available memory per TaskTracker. The available memory per NodeManager in YARN distributions. Map Slots Maximum map task slots available. Reduce Slots Maximum reduce task slots available. Hadoop Configured Memory The configured amount of memory available to MapReduce processes. Looks at maximum JVM size per task (mapred.child.java.opts) times the total number of tasks slots. The total number of task slots is equal to mapred.tasktracker.map.tasks.maximum plus mapred.tasktracker.reduce.tasks.maximum for preYARN distributions. The total number of task slots is equal to mapreduce.tasktracker.map.tasks.maximum plus mapreduce.tasktracker.reduce.tasks.maximum on YARN distributions. This test will fail if the Hadoop configured memory exceeds available RAM. Page 91 Platfora Installation Guide - Platfora Utilities Reference Required Arguments No required arguments. Optional Arguments -h | --help Shows the command-line syntax help and then exits. -v | --verbose Runs in verbose mode. Show all output messages. -V | --version Shows the software version information and then exits. -vv Runs in extra verbose mode. Examples Test and collect information from the Hadoop cluster that Platfora is configured to use: $ hadoop-check hadoopcp Copies a file from one location in the configured DFS to another location in the configured DFS with the ability to transcode files. Synopsis hadoopcp source_dfs_uri destination_dfs_uri Description The hadoopcp utility allows you to copy a file residing in the remote Hadoop DFS from one location to another and optionally transcode the file. File paths must be specified in URI format using the appropriate DFS file system protocol. For example, hdfs:// for Cloudera, Apache, or Hortonworks Hadoop, maprfs:// for MapR, s3n:// for Amazon S3. This command executes as the currently logged in system user (the platfora user, for example). The target directory location must exist, and this user must have write permissions to the directory. Required Arguments source_dfs_uri The source location in a remote Hadoop file system in URI format. For example: Page 92 Platfora Installation Guide - Platfora Utilities Reference hdfs://hostname:[port]/dfs_path destination_dfs_uri The target location in a remote Hadoop file system in URI format. For example: hdfs://hostname:[port]/dfs_path Optional Arguments -h Shows the command-line syntax help and then exits. Examples Copy the file /mydata/foo.csv residing in HDFS to the same location in HDFS but transcode it to a gzip compressed file: $ hadoodcp hdfs://localhost/mydata/foo.csv hdfs://localhost/mydata/ foo.csv.gz hadoopfs Executes the specified hadoop fs command on the remote Hadoop file system. Synopsis hadoopfs -command Description The hadoopfs utility allows you to run Hadoop file system commands from the Platfora server. This is analagous to running the specified hadoop fs command on the Hadoop NameNode server. The command executes as the currently logged in system user (the platfora user, for example). This user must have sufficient Hadoop file system permissions to perform the command. Required Arguments -command A Hadoop file system shell command. See the Hadoop Shell Command Documentation for the list of possible commands. Optional Arguments No optional arguments. Examples List the contents of the /platfora/uploads directory in the configured Hadoop file system: $ hadoopfs -ls /platfora/uploads Page 93 Platfora Installation Guide - Platfora Utilities Reference Remove the file /platfora/uploads/test.csv in the configured Hadoop file system: $ hadoopfs -rm /platfora/uploads/test.csv install-node Copies the Platfora software and configuration directories from the current node to the specified remote node(s). Synopsis install-node --host hostname | --hostsfile filename [-h] [-q] [-v] [-V] Description The install-node utility copies the $PLATFORA_HOME directory from the current node to the specified remote nodes. It also synchronizes the configuration files in the $PLATFORA_CONF_DIR directory. You can use the install-node utility to copy a Platfora software installation to a remote node that has not yet been added to your Platfora cluster configuration. This utility is also called indirectly by the platfora-services sync, platfora-node add, platfora-node sync, and setup.py upgrade utilities. Platfora recommends using these utilities when adding new nodes or upgrading existing nodes in your Platfora cluster configuration. Files are copied to the remote node as the currently logged in system user. The $PLATFORA_HOME and $PLATFORA_CONF_DIR directory locations must exist on the remote node, and the current system must have sufficient file system permissions to write to these locations. Required Arguments One of either --host or --hostsfile is required. --host hostname Copies the Platfora software and configuration directories to the specified host name or IP address. --host hostsfile Copies the Platfora software and configuration directories to the host names or IP addresses specified in the named file, one host per line. Optional Arguments -h | --help Shows the command-line syntax help and then exits. -q | --quiet Runs in quiet mode. Do not send output messages to STDOUT. -v, -vv, -vvv | --verbose Runs in verbose mode. Show all output messages. Page 94 Platfora Installation Guide - Platfora Utilities Reference -V | --version Shows the software version information and then exits. Examples Install the Platfora software on the remote host named myremotehost by copying over the Platfora installation installed on the local host: $ install-node --host myremotehost platfora-catalog Manages the Platfora metadata catalog database in PostgreSQL. Synopsis platfora-catalog [-h] [-q] [-v] [-V] init | start | stop | status | backup | restore | upgrade | pswd | keygen | ssl [sub-command options] Description Use the platfora-catalog utility to manage the Platfora metadata catalog database in PostgreSQL. When you first install and initialize Platfora using setup.py, it initializes a PostgreSQL database instance using the default PostgreSQL port (5432) and creates a platfora database in the $PLATFORA_DATA_DIR location. You run this utility by passing one its subcommands either directly or indirectly through the setup.py and platfora-services utilities. The following subcommands you can call directly. Subcommand Description backup Dumps the contents of the platfora catalog database to a backup file. restore Restores the platfora catalog database using a backup file. pswd Creates a new encrypted superuser password for the platfora metadata catalog database. Platfora encrypts the stored password using 128-bit AES encryption. This command is called by setup.py during new installations (in 4.1.3 and later releases). You must run platforaservices stop before running this command. keygen Generates a new key that is used to encrypt the password used to access the platfora metadata catalog database and re-encrypts the password using the new key. You must run platfora-services stop before running this command. Page 95 Platfora Installation Guide - Platfora Utilities Reference Subcommand Description ssl Controls whether or not worker nodes use an SSL connection to communicate with the metadata catalog database. You must run platfora-services stop before running this command. These subcommands are called indirectly, but you can also call them directly: Subcommand Description init Initializes a new Platfora metadata catalog database. This command is called by setup.py during new installations. start Starts the PostgreSQL database server. This command is called by platfora-services start. stop Stops the PostgreSQL database server. This command is called by platfora-services stop. status Shows the status of the PostgreSQL database server process. This command is called by platfora-services status. migrate Migrates individual elements in the platfora metadata catalog database from one DFS location to another. upgrade Upgrades the schema in the platfora catalog to the latest installed Platfora version. This command is called by setup.py during upgrade. Required Arguments Requires one of the following sub-commands: init, start, stop, status, backup, restore, pswd, keygen, ssl, or upgrade. To see the arguments available with a sub-command, enter the following command-line string: platfora-catalog sub-command --help Optional Arguments -h | --help Shows the command-line syntax help and then exits. -q | --quiet Runs in quiet mode. Do not send output messages to STDOUT. -v, -vv, -vvv | --verbose Runs in verbose mode. Show all output messages. -V | --version Page 96 Platfora Installation Guide - Platfora Utilities Reference Shows the software version information and then exits. platfora-catalog ssl Controls whether or not worker nodes use an SSL connection to communicate with the metadata catalog database in PostgreSQL. Synopsis platfora-catalog ssl [-h] [--enable] [--disable] [--self] [--manual] [-cert_file certificate_file] [--key_file private_key_file] Description The platfora-catalog ssl command controls whether or not worker nodes use an SSL connection to communicate with the metadata catalog database in PostgreSQL. By default, SSL connections are not enabled. Note that the Platfora server must be stopped to run this command. To enable SSL connections between worker nodes and the metadata database on the master node, the PostgreSQL database that Platfora uses must support and enable SSL. Required Arguments No required arguments. Optional Arguments -h | --help Shows the command-line syntax help and then exits. --enable Specifies that worker nodes should use an SSL connection when communicating with the metadata database on the master node. When enabled, Platfora distributes the server certificate to the worker nodes every time the server starts. Enabling SSL may increase lens build times. Platfora only recommends enabling this feature if your organization's security requirements deem it necessary. --disable Disables SSL connections between worker nodes and the metadata database on the master node. --self Specifies to use a self-signed server certificate and key when enabling SSL connections. When you use this argument, Platfora generates and signs its own server certificate and private key. --manual Page 97 Platfora Installation Guide - Platfora Utilities Reference Specifies to use a server certificate and private key uploaded to Platfora, typically generated by a certificate authority (CA). You must specify the certificate and private key using the --cert_file and -key_file arguments. --cert_file certificate_file The path and file name of the server certificate to use. --key_file private_key_file The path and file name of the server private key to use. Examples Use SSL connections between worker nodes and the PostgreSQL database using a self-signed server certificate and private key: $ platfora-catalog ssl --enable --self Use SSL connections between worker nodes and the PostgreSQL database using a server certificate and private key generated by a certificate authority (CA). $ platfora-catalog ssl --enable --manual --cert_file file.crt --key_file file.key Disable SSL connections between the worker node and the PostgreSQL database: $ platfora-catalog ssl --disable platfora-config Displays the current settings of Platfora configuration properties, and allows you to update property settings. Requires one of the following sub-commands: get, set, load, server. Synopsis platfora-config options] [-h] [-q] [-v] [-V] get | reset | list | set | load | server | get_dfs_path | set_dfs_path [sub-command Description The platfora-config command is used to manage Platfora server configuration properties. The Platfora server does not need to be running to use this utility. After resetting a property, you must restart Platfora for your changes to take effect. platfora-config must be run with one of the following sub-commands: • get - Display all configuration properties and their current settings on the Platfora master or on the specified worker node. • reset - Reset a configuration property to its default value on the Platfora master or on the specified worker node. Page 98 Platfora Installation Guide - Platfora Utilities Reference • list - Display all configuration properties and their current settings on the Platfora master or on the specified worker node. Same functionality as get. • set - Change the value of the specified configuration property. • load - Sets the properties specified in a configuration file on the specified Platfora worker node. • server - List the client-side Hadoop configuration property settings. • get_dfs_path - Get the current URI path of the given datasource in the remote file system. • set_dfs_path - Udate the URI path of the given datasource in the remote file system. Required Arguments Requires either --help or one of the following sub-commands: get, list, set, load, server, get_dfs_path, or set_dfs_path. Optional Arguments -h | --help Shows the command-line syntax help and then exits. -q | --quiet Runs in quiet mode. Do not send output messages to STDOUT. -v, -vv, -vvv | --verbose Runs in verbose mode. Show all output messages. -V | --version Shows the software version information and then exits. Examples Show all configuration properties and their currently set values: platfora-config get Set a configuration property: platfora-config set --key platfora.license.expirationwarningdays --value 30 Update the datasource path of the Uploads and System data sources when you are migrating Platfora to a new Hadoop NameNode: # To get the old paths $ platfora-config get_dfs_path --datasource System $ platfora-config get_dfs_path --datasource Uploads # To set the new paths $ platfora-config set_dfs_path --datasource System \ --old_path 'protocol://old_namenode_host:port/platfora/system' \ --new_path 'protocol://new_namenode_host:port/platfora/system' Page 99 Platfora Installation Guide - Platfora Utilities Reference $ platfora-config set_dfs_path --datasource System \ --old_path 'protocol://old_namenode_host:port/platfora/uploads' \ --new_path 'protocol://new_namenode_host:port/platfora/uploads' platfora-export Exports Platfora object metadata from the catalog database to one JSON file per object. Synopsis platfora-export [-h] [-q] [-v] [-V] --username username --password password [--server server_name] [--port port] [--protocol http|https][--all] [--namespace namespace_name] [--export-datasources data_source_name [...]] [--exportdatasets dataset_name [...]][--export-lenses lens_name [...]] [--exportvizboards vizboard_title [...]] [--export-users user_name [...]] [-export-groups group_name [...]] [--include-referenced-datasources] [-include-referenced-datasets] [--include-referenced-lenses] [--includereferenced-segments] [--include-permissions] [--lazy-fetch] [--skipobjects-by-name object_name] Description The platfora-export command exports Platfora object metadata from the catalog database. You can export one or more object types. When specifying an objects you use the name or, for vizboards, the title. For names or titles with spaces, enclose the name in quotes. You can also export multiple objects of each type. Separate each object with a space or user an * (asterisk) to export everything of that type. The command exports objects to .json files to a subdirectory in the current directory. The command labels the subdirectory with a type. Exported file names are URL-encoded along with the exported objects current version. For example, if you export the Web Logs the data source the command creates file here: datasources/Web%20Logs%20.json If a particular filename already exists, the command silently overwrites it. Vizboards are the exception. Vizboard names need not be unique. For this reason, the export utility appends a unique identifier to the exported vizboard filename. When using one of the --include arguments to export referenced objects of a particular type, you must include all object types in between. For example, if you export a vizboards and want to include data sources (--include-referenced-datasources), you must also include lenses and datasets. If you forget to provide the proper includes, the command produces the exported object(s) you requested but none of the objects refrenced by them. Required Arguments --username username Page 100 Platfora Installation Guide - Platfora Utilities Reference Username of a Platfora user account that has the appropriate object permissions on the objects to export. For example, to export an object, the user must be able to view the object in the web application. --password password Password for the specified user account. Optional Arguments -h | --help Shows the command-line syntax help and then exits. -q | --quiet Runs in quiet mode. Do not send output messages to STDOUT. -v, -vv, -vvv | --verbose Runs in verbose mode. Show all output messages. -V | --version Shows the software version information and then exits. --server server_name Hostname or IP address for the Platfora master node. Defaults to localhost. --port port Port for the Platfora master node. Defaults to 8001. --protocol http|https Specify which protocol to use to access the Platfora server, either http or https. Defaults to https when the port ends with 443, otherwise defaults to http. --namespace namespace_name Export objects from the specified namespace. You can only export objects from one namespace in a single call. Defaults to default. --export-datasources data_source_name [...] Export the specified data source. You can list multiple names to export multiple objects. Include names in double quotes if they contain spaces or other special characters. --export-datasets dataset_name [...] Export the specified dataset. Use this flag to export segments which are a special kind of dataset. Segments have two supporting lenses: segment members and segment refresh prerequisites; Include these using the --include-referenced-lenses flag. Include names in double quotes if they contain spaces or other special characters. --export-lenses lens_name [...] Page 101 Platfora Installation Guide - Platfora Utilities Reference Export the specified lens. You can list multiple names to export multiple objects. Include names in double quotes if they contain spaces or other special characters. --export-vizboards vizboard_title [...] Export the specified vizboard by title. A vizboard title is the name users assign the vizboard in the Platfora web application. You can list multiple titles to export multiple objects. Include titles in double quotes if they contain spaces or other special characters. Vizboard title names may not be unique. If multiple vizboards use the same title, all vizboards with that title are exported, and each one is assigned a unique identifier. --export-users user_name [...] Export one or more users. Only administrators can export users and groups. --export-groups group_name [...] Export one or more groups. Only administrators can export users and groups. --include-permissions Export all permissions for all exported Platfora objects such as lenses or datasets. Users and groups do not have permissions. --include-referenced-datasources Use this argument to export all data sources referenced by a specified object. --include-referenced-datasets Use this argument to export all datasets referenced by a specified object. --include-referenced-lenses Use this argument to export all lenses referenced by a specified object. This option applies to lens references from segments and vizboards. This option does not support following references from datasets to the lenses that use them. --include-referenced-segments Use this argument to export all segment datasets and segment lenses referenced by a specified vizboard object. This argument only works when exporting vizboards. --include-permissions Use this argument to export all permissions for all exported objects. --lazy-fetch When exporting a number of objects that are significantly less than the total number of objects in the catalog, use this argument to improve export performance. Defaults to false. --skip-objects-by-name object_name [...] When exporting multiple objects, use this argument to skip exporting objects with the specified names. This applies to exporting all objects with the * wildcard as well as referenced objects when using one Page 102 Platfora Installation Guide - Platfora Utilities Reference of the --include-referenced-* arguments. By default, this command does not export system-created objects, these objects are: Object Type Excluded by Default group Everyone user system admin data sources System Uploads datasets Date Time Latitude, Longitude with Name Latitude, Longitude To override the defaults, provide an * (asterisk) or specify an object name to skip. --all Export the entire catalog. This flags behavior is equivalent to: • --export-datasources "*" • --export-datasets "*" • --export-lenses "*" You must explicitly export permissions, users, and groups. Examples Export the "event log" vizboard and the lenses, data sources, and datasets that are used by that vizboard: $ platfora-export --username admin --password password --exportvizboards "event log" --include-referenced-lenses --include-referenceddatasets --include-referenced-datasources Export vizboards together with their permissions: $platfora-export -vvv --username admin --password admin --exportvizboards "o_viz" --include-permissions Export all data sources and the datasets used by those data sources: $ platfora-export --username admin --password password --exportdatasources "*" --include-referenced-datasets Others may find this useful: Page 103 Platfora Installation Guide - Platfora Utilities Reference $ platfora-export --username admin --password admin --server localhost --export-datasets "airports" "batting" "Carriers" --include-referenceddatasources About to export datasets: [airports, batting, Carriers] Exporting Dataset: "airports" to file: "datasets/airports.json" Exporting Dataset: "batting" to file: "datasets/batting.json" Exporting datasource: "hive on cdh1" to file: "datasources/hive%20on %20cdh1.json" Exporting Dataset: "Carriers" to file: "datasets/Carriers.json" platfora-import Imports Platfora object metadata from one or more JSON files into the catalog database. Synopsis platfora-import [-h] [-q] [-v] [-V] --username username --password password [--server server_name] [--port port] [--protocol http|https] [--import-files file_name [...]] [--handle-conflicts reuse|fail] [-s] [m] Description The platfora-import command is used to import Platfora object metadata from one or more JSON formatted files into the catalog database. You can obtain these files using the platfora-export command. Only import objects from files that were exported from the same minor release. Platfora does not support importing objects exported from a different minor release. If your system uses HDFS Delegated Authorization, the importing user must have READ permission on the underlying DFS data. If the user does not have this permissions, the catalog import succeeds but the Platfora instance is unable to access the underlying data. Each JSON file should contain a single object definition. After importing an object, the object owner is assigned the username given in the --username argument. If an object exists in both the catalog and in one of the imported JSON files, then the --handle-conflicts argument determines whether the import fails or uses the object in the catalog instead of importing the object from the JSON file. When importing an object that references another object, the referenced object must exist either in one of the imported JSON files or in the Platfora catalog. If any referenced object doesn't exist in either location, the entire import fails. Vizboards are a special case. They have both a title visible through the user interface (UI) and unique name which is only used internally and is not visible in the UI. When you import a vizboard, the system assigns the vizboard a unique name and keeps the visible title unchanged. Vizboard permissions are tied Page 104 Platfora Installation Guide - Platfora Utilities Reference to the unique name Platfora uses internally. Therefore, if you want to ensure that imported vizboards inherit the same object permissions as they did in the original Platfora catalog, you must export both vizboards and their permissions. Then you must import both the vizboard and their permissions in a single call using platfora-import. Required Arguments --username username Username of a Platfora user account that has the appropriate object permissions on the objects to import. For example, to import an object the user must have Own or Edit permission on the object type. --password password Password for the specified user account. Optional Arguments -h | --help Shows the command-line syntax help and then exits. -q | --quiet Runs in quiet mode. Do not send output messages to STDOUT. -v, -vv, -vvv | --verbose Runs in verbose mode. Show all output messages. -V | --version Shows the software version information and then exits. --server server_name Hostname or IP address for the Platfora master node. Defaults to localhost. --port port Port for the Platfora master node. Defaults to 8001. --protocol http|https Specify which protocol to use to access the Platfora server, either http or https. Defaults to https when the port ends with 443, otherwise defaults to http. --handle-conflicts reuse|fail Specifies how to handle objects that already exist in the catalog with the same name. Choose reuse to keep the existing object in the catalog and ignore the imported object with the same name. Choose fail to stop the import process without importing any object. Defaults to fail. --import-files file_name [...] Page 105 Platfora Installation Guide - Platfora Utilities Reference Import the object in the file. You can list multiple names to import multiple objects. When listing multiple objects, the order does not matter. Always import user and groups together. This is because the two object types are interdependent. Importing groups fails if all the members do not also exist. Similarly, users are not imported unless their corresponding group exists. --skip_checks Skips version checks between imported data and the Platfora instance. Set this when importing JSON without metadata fields. -s | --run-as-super-admin Run the import job in Super Administrator mode. The specified --username must be eligible to switch to Super Administrator mode. -m | --skip-objects-with-missing-references Skips importing any objects that reference other objects that cannot be found. Platfora lists which objects were not imported because they reference objects that can't be found. Search for "Warning: Removing" in the command response to find the objects that were not imported. This does not apply to users and groups. Examples Import the lens in the flights_lens.json file. If a lens with the same name already exists, then keep the existing lens: $ platfora-import --username admin --password password --import-files flights_lens.json --handle-conflicts reuse Use the following to import vizboards and the permissions associated with them. $platfora-import -vvv --username admin --password admin --import-files vizboards/* permissions/vizboards/* platfora-license Installs, uninstalls, or views a Platfora license. Requires one of the following sub-commands: install, uninstall, or view. Synopsis platfora-license options] [-h] [-q] [-v] [-V] install | uninstall | view [sub-command Description The platfora-license command is used to manage the license on Platfora. The Platfora server must be running to use this utility. Page 106 Platfora Installation Guide - Platfora Utilities Reference Required Arguments Requires one of the following sub-commands: install, uninstall, or view. Optional Arguments -h | --help Shows the command-line syntax help and then exits. -q | --quiet Runs in quiet mode. Do not send output messages to STDOUT. -v, -vv, -vvv | --verbose Runs in verbose mode. Show all output messages. -V | --version Shows the software version information and then exits. platfora-license install Installs a Platfora license by uploading a license key file. Synopsis platfora-license install [--license license_file] [-h] Description The platfora-license install command is used to upload a license key file to Platfora. The Platfora server must be running to use this utility. Required Arguments --license license_file The path and license key file name to upload to the Platfora server. If no directory is specified, Platfora looks for the file in the current directory. Optional Arguments -h | --help Shows the command-line syntax help and then exits. Examples Upload the license key file named licensekey.license to the Platfora server: $ platfora-license install --license licensekey.license Page 107 Platfora Installation Guide - Platfora Utilities Reference platfora-license uninstall Uninstalls the current Platfora license. Synopsis platfora-license uninstall [-h] Description The platfora-license uninstall command is used to uninstall the license currently installed on Platfora. The Platfora server becomes in the unlicensed state after running this command. The Platfora server must be running to use this utility. Required Arguments No required arguments. Optional Arguments -h | --help Shows the command-line syntax help and then exits. Examples Uninstalls the current license from the Platfora server: $ platfora-license uninstall platfora-license view Displays the details of the currently installed license. Synopsis platfora-license view [-h] Description The platfora-license view command is used to view the details of the currently installed Platfora license. The Platfora server must be running to use this utility. Required Arguments No required arguments. Optional Arguments -h | --help Shows the command-line syntax help and then exits. Page 108 Platfora Installation Guide - Platfora Utilities Reference Examples Views the current Platfora license: $ platfora-license view platfora-node Starts, stops, restarts, checks, updates, disables, enables, or removes a worker node in a multinode Platfora cluster. Requires one of the following sub-commands: add, remove, start, stop, restart, status, sync, config, enable, or disable. Synopsis platfora-node [-h] [-q] [-v] [-V] status | enable | stop | sync | remove | start | add | disable | config | restart [sub-command options] Description The platfora-node utility is used to manage worker nodes in a multi-node Platfora cluster, and is always executed from the Platfora master. It must be run with one of the following sub-commands: • add - Adds and initializes a new worker node to a Platfora cluster. • remove - Removes an existing worker node from a Platfora cluster. • start - Starts the Platfora server process on the designated worker node(s). • stop - Stops the the Platfora server process on the designated worker node(s). • restart - Issues a stop immediately followed by a start. • status - Shows the status of the Platfora server process on the designated worker node(s). • disable - Takes a worker node out of operation. Disabled nodes remain in the cluster configuration, but are not available to process queries. Typically you would disable a node to do server maintenance, and then enable it again after maintenance is complete. When a node is disabled, other nodes in the cluster will take over the lens data and processing work it was responsible for serving. • enable - Brings a disabled worker node back into operation. When a node comes up, it must retrieve the latest lens data it is responsible for serving before it will be fully available to work on queries. • sync - Copies the Platfora software binaries from the master to the designated worker node(s). • config - Configures the management port and host name of an existing node. You can also use the platfora-services utility to run the start, stop, restart, status, sync, and config commands on all nodes at once. This utility is mainly used for adding new worker nodes, or taking nodes in and out of the cluster for server maintenance. A node is identified by its unique node ID. This corresponds to the order that the node was added to the Platfora cluster configuration. Usually the master node is 0, the first worker node added is 1, the second worker node added is 2, and so on. You can run platfora-services status to see the IDs of all nodes in the Platfora cluster. Page 109 Platfora Installation Guide - Platfora Utilities Reference Finally, this utility ensures the clock on remote, worker nodes are in the acceptable tolerance from the master node clock. The tolerance is 60 seconds. If the node is not within the acceptable tolerance, the utility logs an error and, depending on the context, the node is not started/added/enabled. Required Arguments Requires one of the following sub-commands: add, remove, start, stop, restart, status, sync, config, enable, or disable. Optional Arguments -h | --help Shows the command-line syntax help and then exits. -q | --quiet Runs in quiet mode. Do not send output messages to STDOUT. -v, -vv, -vvv | --verbose Runs in verbose mode. Show all output messages. -V | --version Shows the software version information and then exits. platfora-node add Adds a new worker node to a Platfora cluster configuration. Synopsis platfora-node add --host hostname [--port admin_port] [-data_port data_port] [--websvc_port http_port] [--datadir path] [-disabled] [--skip_syscheck] | [-h] Description The platfora-node add command checks the remote node for the required software, registers a new worker node in the Platfora metadata catalog, copies the Platfora installation files to the remote node, starts the Platfora server on the new node, and enables the node to begin serving query requests. This command is run from the Platfora master. Before you can add a node to the Platfora cluster, the remote server has to be correctly provisioned with the required prerequisite software and OS configuration settings. See the Provisioning a Platfora Server section of the Platfora Installation Guide for more information. Page 110 Platfora Installation Guide - Platfora Utilities Reference Required Arguments --host hostname The host name or IP address of the new worker node to add to the Platfora cluster. Optional Arguments -h | --help Shows the command-line syntax help and then exits. --datadir path The local directory where Platfora will store lens data and log files on the worker node. By default, uses the same $PLATFORA_DATA_DIR location as the master node. --data_port This is the data transfer port used during query proccessing on multi-node Platfora clusters. By default, uses the same port number as the master node. --disabled Adds the node to the cluster configuration but in a disabled state. The node will not participate in query processing until it is enabled. --port admin_port This is the server administration port used for management utility and API calls to the Platfora server. This is also the port that multi-node Platfora servers use to connect to each other. By default, uses the same port number as the master node. --skip_syscheck Do not run the platfora-syscheck utility prior to adding the node. --websvc_port web_service_port The web service port of the Platfora application server. By default, uses the same port number as the master node. Examples Add a new worker node with the host name of platfora-worker-1 to the Platfora cluster: $ platfora-node add --host platfora-worker-1 platfora-node config Changes the configured host name and/or server administration port for a Platfora worker node. Synopsis platfora-node config --id number [--host hostname] [--port admin_port] | [-h] Page 111 Platfora Installation Guide - Platfora Utilities Reference Description The platfora-node config changes the configured management port and/or host name of an existing Platfora worker node. Required Arguments --id number This node ID number in the Platfora catalog database. Usually the master node is 0, the first worker node added is 1, the second worker node added is 2, and so on. You can run platfora-services status to see the IDs of all nodes in the Platfora cluster. Optional Arguments -h | --help Shows the command-line syntax help and then exits. --host hostname The updated host name or IP address of the worker node. -p | --port admin_port The updated server administration port. Examples Update the port of the worker node ID number 2: $ platfora-node config --id 2 --port 8004 platfora-services Starts, stops, restarts, or checks the status of Platfora server processes. Can also be used to syncronize Platfora software and configuration files in multi-node installations. Requires one of the following subcommands: start, stop, restart, status, or sync. Synopsis platfora-services [-h] [-q] [-v] [-V] start | stop | restart | status | sync [sub-command options] Description The platfora-services utility is used to manage Platfora server processes. It must be run with one of the following sub-commands: • start - Starts the Platfora server processes. In multi-node installations, starts the master server first and then the worker servers in sequential order. Page 112 Platfora Installation Guide - Platfora Utilities Reference • stop - Stops the Platfora server processes. In multi-node installations, sequentially stops the worker servers first and then the master server. • restart - Issues a stop immediately followed by a start. • status - Shows the status of the Platfora server processes. • sync - Copies the Platfora software binaries and global configuration settings from the master to the worker nodes. The following sub-commands are issued internally by the platfora-services utility. DO NOT USE without explicit directions from Platfora customer support. • watchdog - Starts the watch dog daemon for the Platfora server process. • launch - Includes the specified Java class in the Platfora environment. Finally, this utility ensures the clock on the master node is not skewed. The tolerance is 60 seconds. If this master node is not within the acceptable tolerance, the utility logs an error and, depending on the context, the node process is not started/added/enabled. Required Arguments Requires one of the following sub-commands: start, stop, restart, status, or sync. Optional Arguments -h | --help Shows the command-line syntax help and then exits. -q | --quiet Runs in quiet mode. Do not send output messages to STDOUT. -v, -vv, -vvv | --verbose Runs in verbose mode. Show all output messages. -V | --version Shows the software version information and then exits. platfora-services start Starts the Platfora server processes. In multi-node installations, starts the master server first and then the worker servers in sequential order. Synopsis platfora-services start [-h] [-d [DEBUG]] [--hadoop_conf path] [-platfora_conf path] [--logdir path] [--profile] [-n node_id] [p management_port] [-w web_port] [--datadir path] [-P pid_path] [s jvm_size] [--no_watchdog] [--nowait] [--heapdump] [--gc] [--gclogging] [-jvmopts options] Page 113 Platfora Installation Guide - Platfora Utilities Reference Description The platfora-services start command starts the Platfora server processes. If you do not specify any arguments, the command uses the configuration information specified during setup. This configuration is stored the Platfora metadata catalog. To view your current configuration, see your Global Settings in Platfora or the platfora.properties configuration file located in the $PLATFORA_CONF_DIR. Required Arguments No required arguments. Optional Arguments -h | --help Shows the command-line syntax help and then exits. -d | --debug Starts the Platfora server with the Java debugger listener enabled. --datadir path The path of the Platfora data directory where the catalog database, lens data, and logs reside. Defaults to $PLATFORA_DATA_DIR or what was specified during setup. --hadoop_conf path Local directory path where the Hadoop configuration files reside. Defaults to what is specified for the env.platfora.hadoopconf property. --heapdump Enables the JVM to provide a heap dump to $PLATFORA_DATA/log/platfora-heapdump.hprof when an out of memory error occurs. --gc G1|SmallHeap Sets the garbage collection algorithm. --gclogging Enables JVM garbage collection logging. --jvmopts options Adds additional JVM options to the server process. --logdir path The directory of the Platfora server log files. Defaults to $PLATFORA_DATA_DIR/logs. -n | --nodeid node_id Page 114 Platfora Installation Guide - Platfora Utilities Reference Starts the Platfora server process on the given node. The master node id is usually 0. Worker node ids can be determined by running platfora-services status. --no_watchdog Do not start a watch dog daemon process to monitor and restart the server process if needed. --nowait Do not wait for the server startup tasks to complete before returning the command prompt. -P | --piddir pid_path The path of the server PID file. Defaults to $PLATFORA_DATA_DIR/platfora.pid. --platfora_conf path The directory that contains the Platfora server configuration files. Defaults to $PLATFORA_CONF_DIR or what was specified during setup. -p | --port management_port The API port of the Platfora server used by the management utilities. Defaults to what is specified for the platfora.server.management.port property. --profile Starts server with the Java profiler listener enabled. -s | --jvmsize jvm_size The size of the Java Virtual Memory (JVM) to allocate to the Platfora server process (M=megabytes, G=gigabytes). Defaults to what is set for the env.platfora.jvm.maxsize property. -w | --websvc_port web_service_port The web service port of the Platfora application server. Defaults to what is specified for the platfora.webservice.port property. Examples Start the Platfora server on all nodes in the cluster (master and workers) using the default settings: $ platfora-services start Start the Platfora server on worker node 3 only with a 8 GB JVM: $ platfora-services start -n 3 -s 8G platfora-services stop Stops the Platfora server processes. Synopsis platfora-services stop [-h] [--datadir path] [--logdir path] [-master_only] [-n node_id] [-P pid_path] [--no_watchdog] [--force] Page 115 Platfora Installation Guide - Platfora Utilities Reference Description The platfora-services stop command is used to stop the Platfora server processes. If no arguments are given, it uses the configuration information specified during startup. Required Arguments No required arguments. Optional Arguments -h | --help Shows the command-line syntax help and then exits. --datadir path The path of the Platfora data directory where the catalog database, lens data, and logs reside. Defaults to $PLATFORA_DATA_DIR or what was specified during setup. --logdir path The directory of the Platfora server log files. Defaults to $PLATFORA_DATA_DIR/logs. --master-only Stop the master server process only. -n | --node node_id Stops the Platfora server process on the given node. The master node id is usually 0. Worker node ids can be determined by running platfora-services status. -P | --piddir pid_path The path of the server PID file. Defaults to $PLATFORA_DATA_DIR/platfora.pid. --no_watchdog Do not stop the watch dog daemon process to monitor and restart the server process if needed. --force Stop all nodes in the cluster immediately without waiting for processes to finish gracefully. This is similar to the kill -9 UNIX command. Examples Stop the Platfora server on all nodes in the cluster (master and workers) using the default settings: $ platfora-services stop Stop the Platfora server on worker node 3 only: $ platfora-services stop -n 3 Page 116 Platfora Installation Guide - Platfora Utilities Reference platfora-services restart Stops the Platfora server processes immediately followed by a start of the Platfora server processes. Synopsis platfora-services restart [-h] [-d [DEBUG]] [--hadoop_conf path] [-profile] [-n node_id] [-p management_port] [-w web_port] [-P pid_path] [-s jvm_size] [--no_watchdog] Description The platfora-services restart command restarts the Platfora server processes. If you do not specify any arguments, the command uses the configuration information specified during setup. This configuration is stored the Platfora metadata catalog. To view your current configuration, see your Global Settings in Platfora or the platfora.properties configuration file located in the $PLATFORA_CONF_DIR. Required Arguments No required arguments. Optional Arguments -h | --help Shows the command-line syntax help and then exits. -d | --debug Starts the Platfora server with the Java debugger listener enabled. --hadoop_conf path Local directory path where the Hadoop configuration files reside. Defaults to what is specified for the env.platfora.hadoopconf property. -n | --node node_id Starts the Platfora server process on the given node. The master node id is usually 0. Worker node ids can be determined by running platfora-services status. --no_watchdog Do not start a watch dog daemon process to monitor and restart the server process if needed. -P | --piddir pid_path The path of the server PID file. Defaults to $PLATFORA_DATA_DIR/platfora.pid. -p | --port management_port The API port of the Platfora server used by the management utilities. Defaults to what is specified for the platfora.server.management.port property. Page 117 Platfora Installation Guide - Platfora Utilities Reference --profile Starts server with the Java profiler listener enabled. -s | --jvmsize jvm_size The size of the Java Virtual Memory (JVM) to allocate to the Platfora server process (M=megabytes, G=gigabytes). Defaults to what is set for the env.platfora.jvm.maxsize property. -w | --websvc_port web_service_port The web service port of the Platfora application server. Defaults to what is specified for the platfora.webservice.port property. Examples Restart the Platfora server on all nodes in the cluster (master and workers) using the default settings: $ platfora-services restart Restart the Platfora server on worker node 3 only: $ platfora-services restart -n 3 platfora-services status Shows the status of the Platfora server processes. Synopsis platfora-services status [-h] [-P pid_path] [-n node_id] [p management_port] [-w web_port] [--logdir path] [--datadir path] Description The platfora-services status command is used to query the status and availability of the Platfora server processes. If no arguments are given, it uses the configuration information specified at startup. It reports the following information about the servers in a Platfora cluster: Information Description ID The system assigned node ID. Type The type of node: Master or Child (worker). Host The host name of the node. Port The management port of the node. Enabled The cluster status of the node: Enabled or Disabled or Not Ready. A node is disabled when an administrator takes it offline from query processing. Page 118 Platfora Installation Guide - Platfora Utilities Reference Information Description Status The network status of the node: Available, Unavailable, or Not Ready. A node is unavailable when it cannot be reached by the master or is not responding. A node is not ready when it has been newly added or re-enabled, but has not yet finished copying the data blocks it needs to answer queries. Process The status of the Platfora server process on a node: Running or Stopped or Unhealthy. A node is Unhealthy if Platfora cannot determine the process status. For example, a node is Unhealthy if the server is Running but not processing ping messages. Required Arguments No required arguments. Optional Arguments -h | --help Shows the command-line syntax help and then exits. --datadir path The path of the Platfora data directory where the catalog database, lens data, and logs reside. Defaults to $PLATFORA_DATA_DIR or what is specified for the platfora.data.dir in Platfora's Global Settings. --logdir path The directory of the Platfora server log files. Defaults to $PLATFORA_CONF_DIR/logs. -n | --node node_id Starts the Platfora server process on the given node. The master node id is usually 0. Worker node ids can be determined by running platfora-services status. -P | --piddir pid_path The path of the server PID file. Defaults to $PLATFORA_DATA_DIR/platfora.pid. -p | --port management_port The API port of the Platfora server used by the management utilities. Defaults to what is specified for the platfora.server.management.port property in $PLATFORA_CONF_DIR/platfora.properties. -w | --websvc_port web_service_port The web service port of the Platfora application server. Defaults to what is specified for the platfora.webservice.port property in $PLATFORA_CONF_DIR/platfora.properties. Page 119 Platfora Installation Guide - Platfora Utilities Reference Examples Check the status of all nodes in a Platfora cluster: $ platfora-services status ID TYPE HOST PORT ENABLED STATUS PROCESS ---------------------------------------------------------------------------------0 Master ip-10-212-123-456 8002 Enabled Available Running 1 Child ip-10-212-123-567 8002 Enabled Available Running 2 Child ip-10-212-123-678 8002 Enabled Not Ready Running platfora-services sync Syncronizes the Platfora software binaries and global configuration settings of the master to the worker nodes. Synopsis platfora-services sync [-h] Description The platfora-services sync command is used to push software and configuration file settings from the master node to the worker nodes. Required Arguments No required arguments. Optional Arguments -h | --help Shows the command-line syntax help and then exits. Examples Push configuration file changes and software binaries from the master to the workers: $ platfora-services sync platfora-syscapture Captures the Platfora log files, configuration files, metadata catalog, and system environment information needed by Platfora Customer Support to troubleshoot issues. Page 120 Platfora Installation Guide - Platfora Utilities Reference Synopsis platfora-syscapture [--all | --last "number time_units"] [ --hostsfile filename [--child] ] [--tempdir] [--with-catalog] [-h] [-q] [-v] [-V] Description The platfora-syscapture utility captures files needed by Platfora Customer Support, and creates a compressed tar file in the current directory. It captures the following information from your Platfora installation: • The Platfora server log files. By default only master log files from the past 7 days are captured. • The Platfora configuration files. • The Hadoop configuration files used by Platfora. • The OS settings on the master host (provided that /sbin/sysctl is in your PATH). • System resource information such as memory and CPU. • The version of Java you are using. • Optionally, a database dump of the Platfora metadata catalog database. • Optionally, the list of files in the Platfora directory of DFS. • Optionally, your Platfora data directory. Required Arguments No required arguments. Optional Arguments --all Captures all log files. By default, only log files that have changed within the past 7 days are captured. --last number time_units Captures only the log files within the specified time period (relative to now). By default, only log files that have changed within the past 7 days are captured. Allowed time units are weeks, days, hours, or minutes. --outfile filename The file where you want to to store the syscapture data. --child If --hostsfile is used, also captures worker node configuration files in addition to the log files. --tempdir Specifies a temporary directory for writing interim results. Defaults to the $PLATFORA_DATA_DIR directory. The utility automatically cleans up the temporary directory upon success and failure. --with-catalog Captures the contents of the Platfora metadata catalog database. Page 121 Platfora Installation Guide - Platfora Utilities Reference --with-telemetry Captures the telemetry data for your Platfora instance. --with-dfs-ls Includes a DFS directory listing with the capture. --datadir Captures the contents of the Platfora data directory. -h | --help Shows the command-line syntax help and then exits. -q | --quiet Runs in quiet mode. Do not send output messages to STDOUT. -v, -vv, -vvv | --verbose Runs in verbose mode. Show all output messages. -V | --version Shows the software version information and then exits. Examples Capture log files on the Platfora master for the last 2 days: $ platfora-syscapture --last "2 days" Capture log files for the last 36 hours on the Platfora master and on the worker nodes (as named in the hosts file): $ platfora-syscapture --last "36 hours" --hostsfile /home/platfora/ worker_nodes.txt platfora-syscheck Checks the operating system on the master andworker nodes. Synopsis platfora-syscheck [--skipdb] [-h] [-q] [-v] [-V] Description The platfora-syscheck utility verifies that the operating system environment on each Platfora node (master and workers) meets the requirements needed to run the Platfora server software. It performs the following checks: • Verifies that the installation package is not corrupt by doing a checksum of the files in $PLATFORAHOME. Page 122 Platfora Installation Guide - Platfora Utilities Reference • Verifies that the required Unix OS utilities are installed and can be found in the $PATH. • Verifies that ulimit is sized appropriately. • Verifies that ssh keys were correctly configured. This checks against the local host, the fully qualified domain name, and the hostname. • Verifies that a compatible Java Runtime Environment (JRE) is installed. • Verifies that a compatible version of PostgreSQL is installed, and the system shared memory settings are sized appropriately for PostgreSQL. • Reports the amount of free disk space in the configured environment variable $PLATFORA_DATA_DIR. If $PLATFORA_DATA_DIR is not set, checks the disk space of the current user's home directory. The utility does not check disabled nodes. Required Arguments No required arguments. Optional Arguments --skipdb Skips the database-related checks. This option can be used when verifying the operating system environment of a Platfora worker node, since the PostgreSQL database software is only required on the Platfora master. -h | --help Shows the command-line syntax help and then exits. -q | --quiet Runs in quiet mode. Do not send output messages to STDOUT. -v, -vv, -vvv | --verbose Runs in verbose mode. Show all output messages. -V | --version Shows the software version information and then exits. Examples Run a system check on the Platfora master: $ export PLATFORA_DATA_DIR=/home/platfora/PLATFORA_DATA $ platfora-syscheck cmd line: /usr/local/platfora/current/bin/platfora-syscheck Verifying System Requirements Checking integrity of binaries......[SUCCESS] Checking unix utilities......[SUCCESS] Checking file and directory permissions......[SUCCESS] Page 123 Platfora Installation Guide - Platfora Utilities Reference Checking Checking Checking Checking ssh to localhost......[SUCCESS] java version......[SUCCESS] postgres version......[SUCCESS] shared memory settings......[SUCCESS] System Resources: Platfora Data Directory fs has xx GB free space. System Memory total: xxMB used: xxMB free: xxMB Page 124 Appendix B Glossary The glossary defines Platfora product terminology and concepts. Topics: • aggregate lens • field • aggregation • filter • Amazon EMR • focus • Amazon S3 • funnel • categorical data • geographic analysis • column • geo map • computed field • geo reference • CSV • granularity • data catalog • Hadoop • dataset • HDFS • data source • Hive • derived dataset • key • dimension dataset • location field • dimension • lens • distributed file system • MapReduce • drill down • measure • elastic dataset • quantitative data • entity-centric data model • reference • event • regular expressions • event series lens • ROLLUP measure • expression • row • fact dataset • segment • fact-centric data model • visualization (viz) Page 125 Platfora Installation Guide - Glossary • vizboard aggregate lens An aggregate lens contains a selection of measure and dimension fields chosen from the focal point of a single transactional (or fact) dataset. A completed or built lens can be thought of as a table that contains aggregated measure data values grouped by the selected dimension values. An aggregate lens can be built from any dataset. There are no special data modeling requirements to build an aggregate lens. aggregation An aggregation is the result of a function that takes all values of a numeric column, and returns a single value of more significant meaning or measurement. An aggregate function groups the values of multiple rows together based on some defined input expression. Examples of aggregate functions include SUM, COUNT, DISTINCT, MIN, MAX, and STDDEV. In Platfora, measure fields are always the result of an aggregation. Amazon EMR Amazon Elastic MapReduce (Amazon EMR) is a Hadoop framework hosted by Amazon Web Services (AWS). It utilizes Amazon Elastic Compute Cloud (Amazon EC2) for compute resources and Amazon Simple Storage Service (Amazon S3) for data storage. Platfora can be configured to use Amazon EMR as its backend Hadoop processing framework, and Amazon S3 as its primary data source and storage system. Amazon S3 Amazon Simple Storage Service (Amazon S3) is a data storage service provided by Amazon Web Services (AWS). It is a distributed file system hosted by Amazon where you pay a monthly fee for storage space and data transfer bandwidth. Data transfer is free between S3 and Amazon Elastic Compute Cloud (EC2) clusters, making S3 an attractive choice for users who run Hadoop clusters on AWS or utilize the Amazon EMR service. Hadoop supports two S3 file system protocols as an alternative to HDFS: S3 Native File System (s3n) and S3 Block File System (s3). Platfora supports the S3 Native File System (s3n) only. Page 126 Platfora Installation Guide - Glossary categorical data Categorical data is data with unconnected data points that can be represented in a visualization as a categorical grouping or individual data point. Categorical data is countable and often finite (for example, the number of products sold or the number of people in a city). In Platfora, categorical values in a visualization are evenly spaced by sort order. By default, dimension fields in a visualization are categorical, but numeric or datetime dimensions can be changed to quantitative. Categorical data is sometimes referred to as discrete data. column A column is a set of data values of a particular data type, with one value for each row in the dataset. Columns provide the structure for composing a row. The terms column and field are often used interchangeably, although many consider it more correct to use field to refer specifically to the single item that exists at the intersection of one row and one column. computed field A computed field generates its values based on a calculation or condition, and returns a value for each input row. Values are computed based on expressions that can contain values from other fields, constants, mathematical operators, comparison operators, or built-in row functions. Computed fields are useful for deriving meaningful values from base fields (such as calculating someone's age based on their birthday), doing data cleansing and pre-processing (such as grouping similar values together or substituting one value for another), or for computing new data values based on a number of input variables (such as calculating a profit margin value based on revenue and costs). A computed field that does an aggregate calculation is called a measure, which is a special kind of computed field in Platfora. CSV Comma-separated values (CSV) is a plain text file format for describing tabular data. CSV, in general, refers to any file that is plain text (typically ASCII or Unicode characters), has one record per line, has records divided into fields separated by delimiters (typically a comma), and has the same sequence of fields for every record. Within these general constraints, there are many variations of CSV in use. For example, some CSV formats use quotation marks around field values, some use delimiters other than a comma (such as a tab or a semi-colon), and some reserve the very first line of the file as a header of field names. Platfora supports the typical CSV formatting conventions, and allows for some configuration to support different variations. Page 127 Platfora Installation Guide - Glossary data catalog The data catalog is a collection of data items available and visible to Platfora users. Data administrators build the data catalog by defining and modeling datasets in Platfora that point to source data in Hadoop. When users request data from a dataset, that request is materialized in Platfora as a lens. The data catalog shows all of the datasets (data available for request) and lenses (data that is ready for analysis) that have been created by Platfora users. dataset A dataset is a collection of external data files residing in a data source that can be described in table form (rows and columns). Source data is mapped into Platfora by creating a dataset definition. A dataset definition describes the rows and columns, the base fields and their associated data types, computed fields, measure aggregations, and references (or joins) to other related datasets. The collection of dataset definitions make up the data catalog (the data items available to Platfora users). data source A data source is a connection to a mount point or directory on an external data server, such as a file system or database server. Platfora currently provides data source adapters for Hive, HDFS, Amazon S3, and MapR FS. Platfora has one default data source named Uploads (for data files that you upload from your local file system). This default data source resides in the distributed file system (DFS) that the Platfora server is configured to use as its primary data source. derived dataset A dataset whose underlying data is produced from the results of a Platfora lens query or visualization. There are two types of derived datasets -- static (lens query results are saved to a static file) or dynamic (lens query results are refreshed each time the lens is rebuilt). A derived dataset allows you to save the query results from a lens as a new dataset in Platfora. Once a derived dataset is saved, you can use it as you would any other dataset in Platfora - you can edit it, add additional computed fields, and join it by reference to other datasets in the Platfora data catalog. dimension dataset Page 128 Platfora Installation Guide - Glossary A type of dataset that has a primary key and contain attributes (additional dimension fields) that describe some aspect of a fact or event record (such as a person, item, date, etc.). Dimension datasets are referenced by a fact dataset. dimension A dimension is a type of field (or a collection of fields) that allows you to analyze a measure from different perspectives to derive meaning from the data. Dimensions are used to summarize, filter, categorize, and group quantitative measure data in order to answer business questions. For example, a product dimension can help you understand which products generate the most revenue for your business. A date dimension can show you the breakdown of sales by year, quarter, month, or day. Dimension fields can be character-type data (such as product categories), datetime-type data (such as months, days, or hours), or categorical numeric-type data (such as customer ratings on a scale of 1-10). distributed file system A distributed file system (DFS) is any file system that allows access to files from multiple hosts over a computer network. It makes it possible for multiple machines and users to share files and storage resources. HDFS is the primary distributed file system for Hadoop, however Hadoop supports other distributed file systems as well, such as Amazon S3. drill down Drill down (or drill up) is a data analysis technique for navigating from the most summarized to the most detailed categorization of a particular dimension. Drill down allows exploration of multi-dimensional data by moving from one level of detail to the next. A drill-down path is defined by specifying a hierarchy of categories for a dimension or between related dimensions. For example, a date dimension might have categories defined for year, quarter, month, week, day, and so on. A product dimension might have categories defined for division, type, and model. Drill-down levels depend on the granularity of the fields available in the source data. elastic dataset Elastic datasets are a special kind of dataset used for entity-centric data modeling in Platfora. They are used to consolidate unique key values from other datasets into one place for the purpose of defining segments or event series lenses. They are elastic because the data they contain is dynamically generated at lens build time. Page 129 Platfora Installation Guide - Glossary Elastic datasets are not backed by source files like regular datasets. Instead, they consolidate the unique foreign keys from any dataset that points to it via a reference. Because they do not contain any records of their own, elastic datasets cannot be used as the focus for an aggregate lens or event. entity-centric data model An entity-centric data model 'pivots' a fact-centric data model to focus an analysis around a particular dimension (or entity). Modeling the data in this way allows you to do event series analysis and segment analysis in Platfora. For example, modeling different fact datasets around a central customer dataset allows you to analyze different aspects of a customer's behavior. For example, instead of asking "how many customers visited my web site?" (fact-centric), you could ask questions like "which customers visit my site more than once a day?" (entity-centric). event An event is similar to a reference, but the direction of the join is reversed. An event joins the primary key field(s) of a dimension dataset to the corresponding foreign key field(s) in a fact dataset, plus designates a timestamp field for ordering the event records. event series lens An event series lens contains a selection of dimension fields chosen from the focal point of a single entity dataset, including any fields from event datasets associated with that entity. A completed or built lens can be thought of as a table that contains individual event records partitioned by the primary key of the entity dataset, and ordered by a time. An event series lens can only be built from datasets that have at least one event reference defined in them. It contains non-aggregated event records of various types, partitioned by some common entity (typically a user id), and sorted by the time the events occurred. Choose this lens type if you specifically want to do funnel analysis. expression An expression computes or produces a value by combining fields (or columns), constant values, operators, and functions. An expression's result can be any data type, such as numeric, string, datetime, or Boolean (true/false) values. Simple expressions can be a single constant value, field (or column), or a function call. You can use operators to join two or more simple expressions into a complex expression. Page 130 Platfora Installation Guide - Glossary fact dataset In multi-dimentional data models, a fact dataset (or table) contains records (or rows) that represent a single real-world event that has occurred (such as a sales transaction, a page view, a user registration, an airline flight, and so on). A fact record contains the quantitative measure data (such as the dollar amount of a sale), and several descriptive attributes (or dimensions) that give the measure context (such as the date, the customer, the product, and so on). Facts are stored at a uniform level of detail (or grain) within a fact dataset. fact-centric data model A fact-centric data model is centered around a particular real-world event that has happened, such as web page views or sales transactions. Datasets are modeled so that a central fact dataset is the focus of an analysis, and dimension datasets are referenced to provide more information about the fact. In data warehousing and business intelligence (BI) applications, this type of data model is often referred to as a star schema. For example, you may have web server logs that serve as the source of your central fact data about pages viewed on your web site. Additional dimension datasets can then be related (or joined) to the central fact to provide more in-depth analysis opportunities. field A field is an atomic unit of data that has a name, a value, a data type, and a role of either dimension or measure. When working with visualizations, fields are the same thing as the dimensions and measures used to analyze the data. Fields describe a single aspect of a record (or row) in a dataset. An order record, for example, might contain an order date field, a product name field, a quantity field, and so on. All records in a dataset have exactly the same fields, although the values in each field vary from record to record. filter A filter is a field value or expression used as a condition for limiting the data that is selected from a lens and shown in a visualization. A filter can be applied to a visualization to exclude (or include) data that meets the filter criteria. For example, if you wanted to show only the sales for the US west coast, you could use the state field as a filter and just include the values for California, Oregon and Washington. All of the other values for state would then be filtered out (not shown in the visualization). Page 131 Platfora Installation Guide - Glossary focus A focus sets the central topic for a data exploration and analysis. You set a focus by choosing a single dataset from the data browser. For example, if you wanted to explore the characteristics of users who registered on your web site in the past month, you might choose the user dataset or the registration dataset as the focus of your analysis. Choosing a focus allows you to find or build a lens of optimized data to work with in a visualization. funnel A funnel is a visual analysis type that tracks users' (entities') behavior across a sequence of events, with each step in the sequence defined as a stage. Each funnel stage shows progressively decreasing proportions of the original set of users. The first stage has 100% of the original group of users by definition. A funnel is always based on an event series lens. The users in the funnel are from the focus dimension dataset in the lens, and their behaviors are from one or more fact datasets in the lens. The funnel analyzes their behaviors performed sequentially and counts the number of users that meet the criteria defined in each stage. geographic analysis Geographic analysis is a type of data analysis that involves understanding the role that location plays in the occurence of other factors. By looking at the geo-spatial distribution of data on a map, analysts can see how location impacts different variables. In Platfora, geographic analysis is performed in a geo map viz type. geo map A geo map is a viz type that allows analysts to perform geographic analysis on a lens that contains location data. It includes the Geography drop zone that places positions (using a location field) on a map background. Geo map visualizations can be made from an aggregate lens that has at least one location field included. geo reference A geo reference is a special type of reference to a dataset that contains a location field. Page 132 Platfora Installation Guide - Glossary In addition to at least one location field, the dataset referenced in a geo reference typically contains primarily location data. For example, this might include population, voting district information, or government data. granularity The granularity of data refers to the fineness with which data fields are sub-divided, and the level of detail that data is stored within a dataset or lens. For example, a postal address can be recorded with low granularity with the entire address in one field (address=123 Main St. San Mateo, CA 94403). Or a higher granularity with the fields broken out (address=123 Main St., city=San Mateo, state=CA, zipcode=94403). Hadoop Hadoop is open-source software framework designed for storing and processing large amounts of complex, structured, and semi-structured data. It is a distributed system, meaning it runs on a collection of commodity, shared-nothing servers. Hadoop consists of two key services: a for data storage and for parallel data processing. HDFS Hadoop Distributed File System (HDFS) is the primary storage system for Hadoop applications. It is a distributed file system, meaning it runs on a collection of commodity servers. An HDFS cluster usually consists of a NameNode (the metadata management node that manages access to files and directories) and multiple DataNodes (the storage nodes where file data resides). HDFS creates multiple replicas of a file's data storage blocks and distributes them throughout the cluster to enable extremely fast data processing. Platfora can be configured to use HDFS as its primary data source. Hive Hive is an execution engine for Hadoop that lets you write data queries in an SQL-like language called Hive Query Language (HQL). Hive allows you to create tables by describing the structure of files residing in HDFS. Platfora can use a Hive metastore server as a data source, and map a Hive table definition to a Platfora dataset definition. Platfora uses the Hive table definition to obtain metadata about the source data, such as which files to process, the parsing logic for rows and columns, and the field names and data types contained in the source data. It is important to note that Platfora does not execute queries through Hive; Page 133 Platfora Installation Guide - Glossary it only uses Hive tables to obtain the metadata needed for defining datasets. Platfora generates and runs its own MapReduce jobs directly in Hadoop. key A key is single field (or combination of fields) that uniquely identifies a row in a dataset, similar to a primary key in a relational database. A dataset must have a key defined to be the target of a reference. location field A location field is a dataset field encoded with a complex datatype that includes geo coordinate information (latitude and longitude) and a label that associates a location name with the coordinates. Location fields are defined in the dataset. When defining a location field in a dataset, you can optionally use the values in an existing dataset field as the location field label. If no label is defined, Platfora creates a unique string from the coordinates as the label name (for example @(122.33063°W, 37.541886°N)). Use a location field in a geo map viz to place positions on a map. lens A lens is a type of data storage that is specific to Platfora. Platfora uses Hadoop as its data source and processing engine to build and store its lenses. Once a lens is built, this prepared data is copied to Platfora, where it is available for analysis. A lens can be thought of as a dynamic, on-demand data mart purpose-built for a specific analysis project. Platfora generates MapReduce jobs to pull the requested data from the Hadoop source system, and prepares the data for fast, ad-hoc visual analysis. As users build visualizations, lens data is loaded into memory on a column-by-column basis as it is needed. Platfora has two types of lenses you can build: an aggregate lens or an event series lens. The type of lens you build determines what kinds of visualizations you can create and what kinds of analyses you can perform when using the lens in a vizboard. MapReduce MapReduce is a data-flow programming model for processing large amounts of data on a cluster of commodity servers. It passes data items from one stage of processing to the next using user-defined criteria (or jobs). The MapReduce engine acts as an abstraction, allowing programmers to focus on their desired data computations. The details of parallelism, distribution, load balancing and fault tolerance are all handled by the MapReduce framework. Platfora defines and runs MapReduce jobs on the source data in Hadoop Page 134 Platfora Installation Guide - Glossary based on the and lens definitions created by Platfora users. The output of the MapReduce jobs executed by Platfora are stored both in HDFS and Platfora. MapReduce jobs typically start with a large data file that is broken down into smaller pieces called splits, which are similar to database rows. Each split is parsed into key/value pairs (similar to fields) and processed by the user-defined map criteria. The output of the map processing stage is then passed to the reduce processing stage, which does final grouping and aggregation. Each stage of processing uses parallelism to enable many map and reduce tasks to run at the same time on multiple machines. measure A measure is a numeric value representing an aggregation of some dataset metric (such as total dollars sold, average number of users, and so on). To create measures, you add computed fields to a dataset or a lens. When a lens is built, the build calculates any measures and stores them in the lens. In a visualization, measures provide the basis for quantitative analysis. Measures represent a set of real-world events (or facts) and typically answer "how" questions about data such as how many or how long? If you are familiar with SQL, measure values come from the aggregate functions such as SUM(), COUNT(), MAX(), MIN(). Measure fields are typically derived from numeric fields in a dataset, and their values are always the result of an aggregation (average, count, sum, min, max, and so on). quantitative data Quantitative data can be characterized as a sequence or progression of values with connected data points that can be represented as an unbroken line in a visualization. Quantitative fields usually have values that can be shown in ordered progression, such as height, speed, or duration measurements. Quantitative values are placed on a continuous axis, always displayed from low to high. In Platfora, measure data is always quantitative, but numeric or datetime dimensions can be either quantitative or categorical. Quantitative data is sometimes referred to as continuous data. reference A reference allows two datasets to be joined together on one or more fields that they share in common. A reference creates a link from a field in one dataset to the primary key of another dataset. Reference fields are typically created in a fact dataset, and point to a dimension dataset. Creating a reference allows the datasets to be joined when building lenses or segments, similar to a foreign key to primary key relationship in a relational database. Page 135 Platfora Installation Guide - Glossary regular expressions Regular expressions, also referred to as regex or regexp, are a standardized collection of special characters and constructs used for matching strings of text. They provide a flexible and precise language for matching particular characters, words, or patterns of characters. ROLLUP measure ROLLUP is a modifier to an aggregate expression that allows you to define complex measure expressions, such as windowed, partitioned, or adaptive measure expressions. This is useful when you want to compute an aggregation for a subset of rows within the overall result of a viz query. It allows you to compute things such as running totals, moving averages, benchmark comparisons, rank ordering, percentiles, and so on. row A a row represents a single object or record in a dataset. A dataset or lens consists of rows of columns (or fields). Each row represents a set of related data, and every row has the same structure. For example, in a dataset that represents customers, each row would represent a single customer. Columns might represent things like customer name, email address, gender, age, and so on. segment A segment is a special type of dimension field that you can create to group together members of a population that meet some defined common criteria. A segment is a based on members of a dimension dataset (such as customers) that have some behavior in common (such as purchasing a particular product). In Platfora, a segment is always based on a dimension (or referenced) dataset, and must include at least one condition from a fact or event dataset. For example, customers who are female would not be considered a valid segment, however customers who are female that made a purchase would be. A segment is not just people or things that share common attributes, but also share a common behavior or action. Behind the scenes, segments are saved as a special type of lens that can be used and updated independently of the lens that they were created from. For example, you can create a segment from a customer purchases lens but then use that segment in a different customer support calls lens. As long as the lenses have a conforming dimension in common (such as customer), then segments can be used to compare behaviors of a group of individuals across multiple fact or event datasets. Page 136 Platfora Installation Guide - Glossary visualization (viz) A visualization (or viz for short) is a graphical representation of certain data fields chosen from the perspective of a single Platfora lens. It is a query of lens data that is visually rendered based on the types of fields chosen (measure or dimension), their order and placement in the Builder drop zones, and the various appearance encodings applied to the data (color, size, shape, and so on). A viz shows aggregated measure data grouped and filtered by the chosen dimensions. A chart in Platfora can best be described as a recipe of dimension and measure fields, plus axis placement (X-axis and Yaxis), plus appearance encodings (Color, Size, Shape, Opacity, Labels), plus mark type (Point, Line, Bar, Area, and so on). vizboard A vizboard is the starting point for data analysis, and can be thought of as a dashboard or project workspace. The vizboard is the canvas for discovering and sharing data insights. A vizboard contains one or more pages of visualizations that together are meant to tell a data story. The individual visualizations on a vizboard page can be related (use the same underlying data), or unrelated (use completely different data). A vizboard can be saved, versioned, and shared with others. Page 137
© Copyright 2024