Data Ingest Guide - Documentation

Data Ingest Guide
Version 4.5
Copyright Platfora 2015
Last Updated: 10:15 p.m. June 28, 2015
Contents
Document Conventions............................................................................................. 9
Contact Platfora Support.........................................................................................10
Copyright Notices.................................................................................................... 10
Chapter 1: About the Platfora Data Pipeline........................................................... 12
FAQs - Platfora Data Pipeline................................................................................ 12
About the Data Workflow........................................................................................ 15
Chapter 2: Manage Data Sources.............................................................................18
Supported Data Sources.........................................................................................18
Add a Data Source................................................................................................. 20
Connect to a Hive Data Source........................................................................ 20
Connect to an HDFS Data Source....................................................................25
Connect to an S3 Data Source......................................................................... 26
Connect to a MapR Data Source...................................................................... 27
Connect to Other Data Sources........................................................................ 28
About the Uploads Data Source........................................................................29
Configure Data Source Security............................................................................. 31
Delete a Data Source............................................................................................. 32
Edit a Data Source................................................................................................. 33
Chapter 3: Define Datasets to Describe Data..........................................................34
FAQs - Dataset Basics........................................................................................... 34
Understand the Dataset Workspace....................................................................... 39
Understand the Dataset Creation Process............................................................. 40
Understand Dataset Permissions............................................................................44
Select Source Data................................................................................................. 45
Supported Source File Formats........................................................................ 46
Select a Hive Source Table.............................................................................. 48
Select DFS Source Files................................................................................... 49
Edit the Dataset Source Location......................................................................50
Parse the Data........................................................................................................ 51
View Raw Source Data Rows........................................................................... 51
Update the Dataset Sample Rows.................................................................... 52
Update the Dataset Source Schema.................................................................53
Parse Delimited Data.........................................................................................55
Parse Hive Tables............................................................................................. 59
Parse JSON Files.............................................................................................. 62
Parse XML Files................................................................................................ 67
Parse Avro Files................................................................................................ 76
Data Ingest Guide - Contents
Parse Web Access Logs................................................................................... 78
Parse Other File Types..................................................................................... 79
Prepare Base Dataset Fields..................................................................................81
Confirm Data Types...........................................................................................81
Change Field Names.........................................................................................83
Add Field Descriptions.......................................................................................84
Hide Columns from Data Catalog View.............................................................85
Default Values and NULL Processing............................................................... 86
Bulk Upload Field Header Information.............................................................. 87
Transform Data with Computed Fields................................................................... 89
FAQs - Dataset Computed Fields..................................................................... 89
Add a Dataset Computed Field......................................................................... 91
Add Binned Fields............................................................................................. 92
Add Measures for Quantitative Analysis.................................................................96
FAQs - Dataset Measures.................................................................................96
The Default 'Total Records' Measure................................................................ 98
Add Quick Measures......................................................................................... 98
Add Computed Measures.................................................................................. 99
Prepare Date/Time Data for Analysis................................................................... 101
FAQs - Date and Timestamp Processing........................................................101
Cast DATETIME Data Types.......................................................................... 103
About Date and Time References................................................................... 103
About the Default 'Date' and 'Time' Datasets..................................................104
Prepare Location Data for Analysis...................................................................... 105
FAQs - Location Data and Geographic Analysis.............................................106
Understand Geo Location Fields..................................................................... 108
Add a Location Field to a Dataset...................................................................110
Understand Geo References........................................................................... 111
Prepare Geo Datasets to Reference............................................................... 111
Add a Geo Reference..................................................................................... 116
Prepare Drill Paths for Analysis............................................................................118
FAQs - Drill Paths........................................................................................... 118
Add a Drill Path............................................................................................... 121
Model Relationships Between Datasets............................................................... 122
Understand Data Modeling in Platfora............................................................ 122
Add a Reference..............................................................................................128
Add an Event Reference................................................................................. 129
Add an Elastic Dataset.................................................................................... 130
Delete or Hide a Reference............................................................................ 133
Update a Reference........................................................................................ 134
Define the Dataset Key.........................................................................................134
Chapter 4: Use the Data Catalog to Find What's Available..................................136
FAQs - Data Catalog Basics................................................................................ 136
Page 3
Data Ingest Guide - Contents
Find Available Datasets........................................................................................ 139
Find Available Lenses...........................................................................................141
Find Available Segments...................................................................................... 143
Organize Datasets, Lenses and Vizboards with Labels....................................... 144
Chapter 5: Define Lenses to Load Data.................................................................148
FAQs - Lens Basics..............................................................................................148
Lens Best Practices.............................................................................................. 150
About the Lens Builder Panel............................................................................... 151
Understand the Lens Build Process..................................................................... 152
Understand Lens MapReduce Jobs................................................................ 152
Understand Source Data Input to a Lens Build...............................................154
Understand How Datasets are Joined.............................................................157
Create a Lens....................................................................................................... 158
Name a Lens................................................................................................... 160
Choose the Lens Type.................................................................................... 160
Choose Lens Fields.........................................................................................164
Define Lens Filters...........................................................................................178
Allow Ad-Hoc Segments.................................................................................. 182
Estimate Lens Size............................................................................................... 183
About Dataset Profiles..................................................................................... 183
About Lens Size Estimates............................................................................. 185
Manage Lenses.....................................................................................................187
Edit a Lens Definition...................................................................................... 187
Update Lens Data............................................................................................188
Delete or Unbuild a Lens................................................................................ 189
Check the Status of a Lens Build................................................................... 191
Manage Lens Notifications.............................................................................. 191
Schedule Lens Builds...................................................................................... 194
Manage Segments—FAQs................................................................................... 196
Chapter 6: Export Lens Data...................................................................................202
Export an Entire Lens as CSV............................................................................. 202
Export a Partial Lens as CSV...............................................................................204
Query a Lens Using the REST API...................................................................... 204
FAQs - Lens Export Basics.................................................................................. 206
Chapter 7: Platfora Expressions.............................................................................210
Expression Building Blocks................................................................................... 210
Functions in an Expression............................................................................. 210
Operators in an Expression.............................................................................212
Fields in an Expression................................................................................... 214
Literal Values in an Expression.......................................................................216
PARTITION Expressions and Event Series Processing (ESP).............................217
Page 4
Data Ingest Guide - Contents
How Event Series Processing Works..............................................................217
Best Practices for Event Series Processing (ESP)......................................... 221
ROLLUP Measures and Window Expressions..................................................... 223
Understand ROLLUP Measures...................................................................... 223
Understand ROLLUP Window Expressions.................................................... 226
Computed Field Examples.................................................................................... 227
Troubleshoot Computed Field Errors....................................................................229
Write a Lens Query...............................................................................................231
FAQs - Expression Basics.................................................................................... 232
Expression Language Reference..........................................................................233
Expression Quick Reference........................................................................... 233
Comparison Operators.....................................................................................248
Logical Operators.............................................................................................249
Arithmetic Operators........................................................................................ 250
Conditional and NULL Processing...................................................................250
Event Series Processing..................................................................................252
String Functions............................................................................................... 260
URL Functions................................................................................................. 288
IP Address Functions...................................................................................... 293
Date and Time Functions................................................................................ 295
Math Functions................................................................................................ 301
Data Type Conversion Functions.................................................................... 305
Aggregate Functions........................................................................................310
ROLLUP and Window Functions.....................................................................314
User Defined Functions (UDFs)...................................................................... 328
Regular Expression Reference........................................................................333
Appendix A: Expression Language Reference..................................................... 343
Expression Quick Reference.................................................................................343
Comparison Operators.......................................................................................... 358
Logical Operators.................................................................................................. 359
Arithmetic Operators............................................................................................. 360
Conditional and NULL Processing........................................................................ 361
CASE................................................................................................................361
COALESCE......................................................................................................362
IS_VALID..........................................................................................................362
Event Series Processing....................................................................................... 363
PARTITION...................................................................................................... 363
PACK_VALUES............................................................................................... 370
String Functions.................................................................................................... 371
CONCAT.......................................................................................................... 371
ARRAY_CONTAINS........................................................................................ 371
FILE_NAME..................................................................................................... 372
FILE_PATH...................................................................................................... 373
Page 5
Data Ingest Guide - Contents
EXTRACT_COOKIE.........................................................................................374
EXTRACT_VALUE...........................................................................................374
INSTR...............................................................................................................375
JAVA_STRING.................................................................................................376
JOIN_STRINGS............................................................................................... 377
JSON_ARRAY_CONTAINS.............................................................................377
JSON_DOUBLE............................................................................................... 378
JSON_FIXED................................................................................................... 379
JSON_INTEGER..............................................................................................380
JSON_LONG....................................................................................................381
JSON_STRING................................................................................................ 382
LENGTH...........................................................................................................383
REGEX.............................................................................................................383
REGEX_REPLACE.......................................................................................... 389
SPLIT............................................................................................................... 395
SUBSTRING.................................................................................................... 396
TO_LOWER..................................................................................................... 397
TO_UPPER...................................................................................................... 397
TRIM.................................................................................................................398
XPATH_STRING..............................................................................................398
XPATH_STRINGS........................................................................................... 399
XPATH_XML.................................................................................................... 401
URL Functions.......................................................................................................402
URL_AUTHORITY........................................................................................... 402
URL_FRAGMENT............................................................................................ 403
URL_HOST...................................................................................................... 404
URL_PATH.......................................................................................................405
URL_PORT...................................................................................................... 406
URL_PROTOCOL............................................................................................ 407
URL_QUERY................................................................................................... 407
URLDECODE...................................................................................................408
IP Address Functions............................................................................................410
CIDR_MATCH..................................................................................................410
HEX_TO_IP......................................................................................................411
Date and Time Functions......................................................................................411
DAYS_BETWEEN............................................................................................412
DATE_ADD...................................................................................................... 412
HOURS_BETWEEN.........................................................................................413
EXTRACT.........................................................................................................414
MILLISECONDS_BETWEEN...........................................................................415
MINUTES_BETWEEN..................................................................................... 415
NOW.................................................................................................................416
SECONDS_BETWEEN....................................................................................417
TRUNC.............................................................................................................417
YEAR_DIFF......................................................................................................418
Page 6
Data Ingest Guide - Contents
Math Functions......................................................................................................419
DIV................................................................................................................... 419
EXP.................................................................................................................. 420
FLOOR............................................................................................................. 420
HASH............................................................................................................... 421
LN.....................................................................................................................421
MOD................................................................................................................. 422
POW.................................................................................................................422
ROUND............................................................................................................ 423
Data Type Conversion Functions..........................................................................424
EPOCH_MS_TO_DATE...................................................................................424
TO_CURRENCY.............................................................................................. 424
TO_DATE.........................................................................................................424
TO_DOUBLE....................................................................................................427
TO_FIXED........................................................................................................427
TO_INT.............................................................................................................428
TO_LONG........................................................................................................ 428
TO_STRING.....................................................................................................429
Aggregate Functions............................................................................................. 430
AVG..................................................................................................................430
COUNT.............................................................................................................431
COUNT_VALID................................................................................................ 431
DISTINCT.........................................................................................................432
MAX..................................................................................................................432
MIN...................................................................................................................433
SUM................................................................................................................. 434
STDDEV...........................................................................................................434
VARIANCE....................................................................................................... 435
ROLLUP and Window Functions.......................................................................... 435
ROLLUP........................................................................................................... 435
DENSE_RANK................................................................................................. 441
NTILE............................................................................................................... 443
RANK............................................................................................................... 446
ROW_NUMBER............................................................................................... 449
User Defined Functions (UDFs)............................................................................451
Writing a Platfora UDF Java Program.............................................................451
Adding a UDF to the Platfora Expression Builder........................................... 454
Regular Expression Reference............................................................................. 456
Regex Literal and Special Characters.............................................................457
Regex Character Classes................................................................................458
Regex Line and Word Boundaries.................................................................. 462
Regex Quantifiers............................................................................................ 462
Regex Capturing Groups................................................................................. 464
Page 7
Data Ingest Guide - Contents
Appendix B: Lens Query Language Reference.....................................................467
SELECT Statement............................................................................................... 467
DEFINE Clause................................................................................................469
WHERE Clause............................................................................................... 470
GROUP BY Clause......................................................................................... 471
HAVING Clause............................................................................................... 472
Example of Lens Queries................................................................................ 472
Page 8
Preface
This guide provides information and instructions for ingesting and loading data into a Platfora®
cluster. This guide is intended for data administrators who are responsible for making Hadoop data
accessible to business users and data analysts. Knowledge of Hadoop, data processing, and data storage
is recommended.
Document Conventions
This documentation uses certain text conventions for language syntax and code examples.
Convention
Usage
Example
$
Command-line prompt proceeds a command to be
entered in a command-line
terminal session.
$ ls
$ sudo
Command-line prompt
$ sudo yum install open-jdk-1.7
for a command that
requires root permissions
(commands will be prefixed
with sudo).
UPPERCASE
Function names and
keywords are shown in all
uppercase for readability,
but keywords are caseinsensitive (can be written
in upper or lower case).
SUM(page_views)
italics
Italics indicate a usersupplied argument or
variable.
SUM(field_name)
[ ] (square
Square brackets denote
optional syntax items.
CONCAT(string_expression[,...])
...
(elipsis)
An elipsis denotes a syntax
item that can be repeated
any number of times.
CONCAT(string_expression[,...])
brackets)
Page 9
Data Ingest Guide - Introduction
Contact Platfora Support
For technical support, you can send an email to:
[email protected]
Or visit the Platfora support site for the most up-to-date product news, knowledge base articles, and
product tips.
http://support.platfora.com
To access the support portal, you must have a valid support agreement with Platfora. Please contact
your Platfora sales representative for details about obtaining a valid support agreement or with questions
about your account.
Copyright Notices
Copyright © 2012-15 Platfora Corporation. All rights reserved.
Platfora believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” PLATFORA
CORPORATION MAKES NO REPRESENTATIONS OR WARRANTIES OF ANY KIND WITH
RESPECT TO THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS
IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR
PURPOSE.
Use, copying, and distribution of any Platfora software described in this publication requires an
applicable software license. Platfora®, You Should Know™, Interest Driven Pipeline™, Fractal Cache™,
and Adaptive Job Synthesis™ are trademarks of the Platfora Corporation. Apache Hadoop™ and Apache
Hive™ are trademarks of the Apache Software Foundation. All other trademarks used herein are the
property of their respective owners.
Embedded Software Copyrights and License Agreements
Platfora contains the following open source and third-party proprietary software subject to their
respective copyrights and license agreements:
• Apache Hive PDK
• dom4j
• freemarker
• GeoNames
• Google Maps API
• javassist
Page 10
Data Ingest Guide - Introduction
• javax.servlet
• Mortbay Jetty 6.1.26
• OWASP CSRFGuard 3
• PostgreSQL JDBC 9.1-901
• Scala
• sjsxp : 1.0.1
• Unboundid
Page 11
Chapter
1
About the Platfora Data Pipeline
Got questions about how Platfora enables self-service access to raw data in Hadoop? Want to know what
happens to the data on the way to those stunning, interactive visualizations? This section explains how data
flows from Hadoop to Platfora, and what happens to the data at each step in the workflow.
Topics:
•
FAQs - Platfora Data Pipeline
•
About the Data Workflow
FAQs - Platfora Data Pipeline
This section answers the most frequently asked questions (FAQs) about Platfora's Interest Driven
Pipeline™ and the data workflow.
What does 'Interest Driven Pipeline' mean?
The traditional data pipeline is mainly an 'operations driven pipeline' -- IT pushes the data to the
consumers, rather than the consumers pulling the data that interests them. In a traditional data pipeline
data is pre-processed, modeled into a relational schema, and loaded into a data warehouse. Then it is
optimized (or pre-aggregated) to make it possible for BI and reporting tools to access it. All of this work
to move and prepare the data happens regardless of the immediate needs of the business users.
The idea behind an 'interest driven pipeline' is to not move or pre-process the data until somebody wants
it. Platfora's approach is to catalog all of data that's available, then allow business users to discover
and request data of interest to them. Once a request is made, then Platfora pulls the data from Hadoop,
cleanses and processes it, and optimizes it for analysis. Having the entire data pipeline managed in a
single application allows for more agile data projects.
How does Platfora access data in Hadoop?
When you first install Platfora, you provide connection information to your Hadoop cluster services.
Then you define a Data Source in the Platfora application to point to a particular location in the Hadoop
file system. Once a data source is registered in Platfora, the data files in that location are then visible to
Platfora users.
Page 12
Data Ingest Guide - About the Platfora Data Pipeline
Can I control who sees what data?
Yes. Platfora provides role-based security so you can control who can see data coming from a particular
data source. You can also control access at a more granular per-dataset level if necessary. You can
either control data access within the Platfora application, or configure Platfora to inherit the file system
permissions from HDFS.
Does all the source data have to be in Hadoop?
Yes (for the most part). Platfora primarily works with data stored in a single distributed file system -typically HDFS for on-premise Hadoop deployments or Amazon S3 for cloud deployments.
However, it is also possible to develop custom data connectors to access smaller datasets outside of
Hadoop. For example, you may have customer data in a relational database that you want to use in
conjunction with log data stored in HDFS. Data connectors can be used to pull relatively small amounts
of external data over to Hadoop on demand.
How does Platfora know the structure of the data and how to process it?
You have to tell Platfora how your data is structured by defining a Dataset. A dataset points to a set of
files in a data source and describes the structure of the data, as well as any processing logic needed to
prepare the data for consumption. A dataset is just a metadata description of the data -- it contains all of
the data about the data -- plus a small sampling of raw rows to facilitate data discovery.
How does Platfora handle messy, complicated data?
Platfora's dataset workspace has a number of tools to help you cleanse and transform data into a
structured format. There are a number of built-in data parsers for common file formats, such as
Delimited Text, CSV, JSON, XML or Avro. For unstructured or semi-structured data, Platfora has an
extensive library of built-in functions that you can use to define data processing tasks.
When Platfora processes the data during a lens build, it also logs any problem rows that it could not
process according to the logic defined in the dataset. These 'dirty rows' are shown as lens build warnings.
Platfora administrators can investigate these warnings to determine the extent of the problem.
How does Platfora deal with multi-structured data?
Defining a dataset in Platfora overlays the structure on the data as a light-weight metadata layer. The
actual data remains in Hadoop in its raw form until it is requested by a Platfora user. This allows you
to have datasets with very different characteristics exist together in Platfora, described in the unifying
language of the Platfora dataset.
If two datasets have fields that can be used to join them together, then the logic of that join can also be
described in the dataset as a Reference. Modeling references between datasets within Platfora allows
you to quickly combine multi-structured data without having to move or pre-process the data up front.
How do I find available data?
Every dataset that is defined in Platfora is added to the Data Catalog. You can search or browse the data
catalog to find datasets of interest. The Platfora data catalog is the one place where you capture all of
Page 13
Data Ingest Guide - About the Platfora Data Pipeline
your organizational knowledge about your data. It is where non-technical users can discover and request
the data they need.
How do I request data?
Once you find the dataset you want in the data catalog, you create a Lens in Platfora to request data
from that dataset. A lens is a selection of fields from the focal point of a single dataset. A dataset points
to data in Hadoop.
How does data get from Hadoop into Platfora?
Users bring data into Platfora by kicking off a Lens Build. A lens build runs a series of processing jobs
in Hadoop to pull, process, and optimize the requested data. The output of these jobs is the Lens. Once
the lens build jobs have completed successfully in the Hadoop cluster, the prepared lens data is then
copied over to the Platfora nodes. At this point the data is in Platfora and available for analysis.
Where does the prepared data (the lenses) reside?
Lens data is distributed across the Platfora worker nodes. This allows Platfora to use the resources of
multiple servers to process lens queries in parallel, and scale up as your data grows. Lenses are stored on
disk on the Platfora nodes, and are also loaded into memory whenever Platfora users interact with them.
Having the lenses in memory makes the queries run faster.
A copy of each lens is also stored in the primary Hadoop file system as a backup.
How do I explore the data in a lens?
Once a lens is built, the data is in Platfora and ready to explore in a Vizboard. The main way to interact
with the data in a lens is to create a Visualization (or Viz for short). A viz is just a lens query that is
represented visually as a chart, graph, or table.
Is building a viz the only way to look at the data in a lens?
No, but we think it is the best way! Platfora also has a REST API that you can use to programmaticly
query a lens, or you can export lens data in CSV format for use in other applications or data workflows.
How is Platfora different from Hadoop tools like Hive?
First of all, users do not need to have any special technical knowledge to build a lens. Platfora enables
all levels of users to request data from Hadoop -- no programming or SQL skills required.
Secondly, with a query tool like Hive, each query is its own MapReduce job in Hadoop. You have
to wait for each query to run in order to see the results. If you want to change the query, you have to
rewrite it and run it again (and wait). It is not very responsive or interactive.
A lens is more like an on-demand data mart rather than a single query result. It contains optimized data
that is loaded into memory so the query experience is fast and interactive. The data contained in a lens
can support many combinations of queries, and the results are rendered visually so that insights are
easier to find.
Page 14
Data Ingest Guide - About the Platfora Data Pipeline
What if a lens doesn't have the data I need?
If a lens doesn't quite meet your data requirements, there are a couple of things you can do:
• You can edit an existing lens definition to add additional fields or expand the scope of the data
requested.
• You can add computed fields directly in the vizboard to further manipulate the data you already have.
• You can go back to the data catalog and create an entirely new lens. You can even upload new data
from your desktop and combine it with datasets already in Platfora.
How can I know if the data is correct?
One of the advantages to having the entire data pipeline in one application is complete visibility at each
stage of the workflow. Platfora allows you to see the data lineage of every field in a lens, all the way
back to the source file that the data originated from.
How do I share my insights with others?
Platfora's vizboards were purpose built for sharing and collaboration. You can invite others to join you
in a vizboard, and use comment threads to collaborate. You can prepare view-only dashboards and
email them to your colleagues. You can also export data and images from the vizboard for use in other
business applications, like R, Excel, or PowerPoint.
About the Data Workflow
What are the steps involved in going from raw data in Hadoop to visualizations in Platfora? What skills
do you need to perform each step? This section explains each stage of the data workflow from data
ingest, to analysis, to collaboration.
Step 1: Define Data Sources to Connect to Raw Data in Hadoop
The first step in the data pipeline is to make the raw data accessible to Platfora. This is done by defining
a Data Source. A data source uses a data connector to point to some location in the Hadoop file system
or other external data server. Platfora has out-of-the-box data connectors for:
Page 15
Data Ingest Guide - About the Platfora Data Pipeline
• HDFS
• MapR FS
• Amazon S3
• Hive Metastore
Platfora also provides APIs for defining your own custom data connectors.
Who does this step? System Administrators (someone who knows where the raw data resides and how
to provide access to it). System administrators also define the security permissions for the data sources.
Platfora users can only interact with data that they are authorized to see.
Step 2: Create Datasets to Describe the Structure of the Data
After you have connected to a data source, the next step is to describe and model the data by creating
Datasets in Platfora. A dataset is a pointer to a collection of raw data files along with a metadata
description of how those files are structured. Platfora provides a number of built-in file parsers for
common file formats, such as:
• Delimited Text
• Comma-Separated Values (CSV)
• JSON
• XML
• Avro
• Web Access Logs
• Hive Table Definitions
In addition to describing the structure of the data, the datasets also contain information on how to
process the data, plus how to join different datasets together. If you are familiar with ETL workflows
(extract, transform, and load), the dataset encompasses the extract and transform logic.
Who does this step? Data Administrators (someone who understands the data and how to make the data
ready for consumption).
Step 3: Build a Lens to Pull Data from Hadoop into Platfora
All datasets that have been defined in Platfora are available in Platfora's Data Catalog. The data catalog
is where Platfora users can see what data is available, and make requests for the data they need. The way
you request data is by choosing a dataset, then building a Lens from that dataset. A lens can be thought
of as an on-demand data mart, a summary table, or a materialized view.
A Lens Build automates a number of Hadoop processing tasks -- it submits a series of MapReduce jobs
to Hadoop, collects the results, and brings the results back into Platfora. The data populated to a lens is
pre-aggregated, compressed, and columnar. From the perspective of an ETL workflow, the lens build is
the load part of the process.
Who does this step? Data Analysts or Data Administrators (someone who understands the business
need for the data or has an analysis use case they want to achieve). Lenses provide self-service
Page 16
Data Ingest Guide - About the Platfora Data Pipeline
access to the data in Hadoop -- users do not need any specialized technical skills to build a lens. Data
administrators may want to set up a schedule of production lenses that are built on a regular basis.
However, data analysts can also build their own lenses as needed.
Step 4: Create Vizboards to Analyze and Visualize the Data
Once a lens is built, the data is available in Platfora for analysis. Platfora users create Vizboards to
manage their data analysis projects. The vizboard can be thought of as a project workspace where you
can explore the data in a lens by creating visualizations.
A Visualization (or Viz for short) is the result of a lens query, but the data is represented in a visual
way. Visualizations can take various forms such as charts, graphs, maps, or cross-tabs. As users build
vizzes using the data in a lens, the data is loaded into memory so the experience is fast and interactive.
Within a vizboard, analysts can build dashboards (or pages) of visualizations that reveal particular
business insights or tell a data story. For example, a vizboard may show two or three charts that support
a future business direction or confirm the results of a past business campaign or decision.
Who does this step? Data Analysts (anyone who has access to the data and has a question or hunch they
want to investigate).
Step 5: Share Your Insights with Others
The Platfora Vizboard is a place where you can collaborate with your fellow analysts or share
prepared insights with business users. You can invite other Platfora users to view and comment on your
vizboards, or you can export images from a vizboard to send to others via email or PDF. You can also
export query results (the viz data) for use in other applications, such as Excel or R.
Who does this step? Data Analysts (anyone who has an insight they want to share).
Page 17
Chapter
2
Manage Data Sources
The first step in making Hadoop data available in Platfora is identifying what source data you want to expose
to your business users, and making sure the data is in a format that Platfora can work with. Although the source
data may be coming into Hadoop from a variety of source systems, and in a variety of different file formats,
Platfora needs to be able to parse it into rows and columns in order to create a dataset in Platfora. Platfora
supports a number of data sources and source file formats.
Topics:
•
Supported Data Sources
•
Add a Data Source
•
Configure Data Source Security
•
Delete a Data Source
•
Edit a Data Source
Only System Administrators can manage data sources.
Supported Data Sources
Hadoop supports many different distributed file systems, of which HDFS is the primary implementation.
Platfora provides data adapters for a subset of the file systems that Hadoop supports. Hadoop also has
various database and data warehouse implementations, some of which can be used as data sources for
Platfora. This section describes the data sources supported by Platfora.
Source
Description
Hive
Platfora can use a Hive metastore server as a data source, and map a
Hive table definition to a Platfora dataset definition. Platfora uses the Hive
table definition to obtain metadata about the source data, such as which
files to process, the parsing logic for rows and columns, and the field
names and data types contained in the source data.
It is important to note that Platfora does not execute queries through
Hive; it only uses Hive tables to obtain the metadata needed for defining
datasets. Platfora generates and runs its own MapReduce jobs directly in
Hadoop.
Page 18
Data Ingest Guide - Manage Data Sources
Source
Description
HDFS
Hadoop Distributed File System (HDFS) is the primary storage system for
Hadoop. Platfora can be configured to connect to the HDFS NameNode
server and use the HDFS file system as its primary data source.
Amazon S3
Amazon Simple Storage Service (Amazon S3) is a distributed file system
hosted by Amazon where you pay a monthly fee for storage space and
data transfer bandwidth. It can be used as a data source for users who
run their Hadoop clusters on Amazon EC2 or who utilize the Amazon EMR
service.
Hadoop supports two S3 file systems as an alternative to HDFS: S3
Native File System (s3n) and S3 Block File System (s3). Platfora supports
the S3 Native File System (s3n) only.
MapR FS
MapR FS is the proprietary Hadoop distributed file system of MapR.
Platfora can be configured to connect to a MapR Container Location
Database (CLDB) server and use the MapR file system as its primary data
source.
Uploaded Files Platfora allows you to upload files from your local file system into Platfora.
These files are added to a special Uploads data source, which resides in
the distributed file system (DFS) that the Platfora server is configured to
use when it first starts up.
Custom Data
Connector
Plugins
Platfora provides Java APIs that allow developers to create custom data
connector plugins. For example, you can create a plugin that connects
to a relational database such as MySQL or PostgreSQL. Datasets created
from a custom data source should be relatively small (less than 100,000
rows). External data is pulled over to Hadoop at lens build time via the
Platfora master (which is not a parallel operation).
Page 19
Data Ingest Guide - Manage Data Sources
Add a Data Source
A data source is a connection to a mount point or directory on an external data server, such as a file
system or database server. Platfora currently provides data source adapters for Hive, HDFS, Amazon S3,
and MapR FS.
1. Go to the Data Catalog > Datasets page.
2. Click Add Dataset to open the dataset workspace.
3. Click New Source.
4. Enter the connection information for the data source server. The required connection information
depends on the Source Type you choose.
5. Click Connect.
6. Click Cancel to exit the dataset workspace.
Connect to a Hive Data Source
When Platfora uses Hive as a data source, it connects to the Hive metastore to query information about
the source data. There are multiple ways to configure the Hive metastore service in your Hadoop
environment. If you are using the Hive Thrift Metastore (known as the remote metastore client
configuration), you can add a Hive data source directly in the Platfora application. If you connect
directly to the Hive metastore relational database management system (RDBMS) (known as a local
Page 20
Data Ingest Guide - Manage Data Sources
metastore client configuration), this requires additional configuration on the Platfora master server. You
cannot define this type of Hive data source in the Platfora application.
See the Hive wiki for more information about the different Hive metastore client configurations.
Connect to a Hive Thrift Metastore
By default, the Platfora application allows you to connect the the Hive Thrift Metastore service. To use
the Thrift server as a data source for Platfora, you must start the Hive Thrift Metastore server in your
Hadoop environment and know the URI to connect to this server.
In a remote Hive metastore setup, Hive clients (such as Platfora) make a connection to the Hive Thrift
Metastore server which then queries the metastore database (typically a MySQL database) for the Hive
metadata. The client and metastore server communicate using the Thrift protocol.
You can add a Hive Thrift Metastore data source in the Platfora application. You will need to supply the
URI to the Hive Thrift metastore service in the format of:
thrift://hive_host:thrift_port
Where hive_host is the DNS host name or IP address of the Hive server, and thrift_port is the
port that the Hive Thrift metastore service is listening on. For Cloudera 4, Hortonworks 1.2, and MapR
installations the default Thrift port is 9083. For Hortonworks 2 installations, it is 9983.
If the connection to Hive is successful, you will see a list of available Hive databases in that data source.
Click on a database name to show the Hive tables within that database. The default database in Hive is
named default. If you have not created your own databases in Hive, this is where all of your tables will
reside.
Page 21
Data Ingest Guide - Manage Data Sources
If you are using Hive views, they will also be listed. However, Hive views are disabled to use as the
basis of a Platfora dataset. You can only create a dataset from Hive tables.
If you have trouble connecting to Hive, make sure that the Hive Thrift metastore server process is
running, and that the Platfora server machine has access over the network to the designated Hive Thrift
server port. Also, make sure that the system user that the Platfora server runs as has read permissions to
the underlying data files in HDFS.
The Hive Thrift metastore is an optional service and is not usually started by default when you install
Hive, so it is possible that the service is not started. To check if Platfora can connect to the Hive Thrift
Metastore, run the following command from the Platfora master server:
$ hive --hiveconf hive.metastore.uris="thrift://your_hive_server:9083"
--hiveconf hive.metastore.local=false
Make sure the Hive server host name or IP address and Thrift port is correct for your Hadoop
installation. For Cloudera 4, Hortonworks 1.2, and MapR installations the default Thrift port is 9083. For
Hortonworks 2 installations, it is 9983.
If the Platfora server can connect, you should see the Hive console command prompt and be able to
query the Hive metastore. For example:
hive> SHOW DATABASES;
hive> exit;
If you cannot connect, it is possible that your Hive Thrift Metastore service is not running. Depending
on the Hadoop distribution you are using and the version of Hive server you are running, there are
different ways to start the Hive Thrift metastore. For example, run the following command on the server
where Hive is installed:
$ sudo hive --service metastore
Page 22
Data Ingest Guide - Manage Data Sources
or
$ sudo hive --service hiveserver2
Check on your Hive server to make sure it is started, and view you Hive server logs for any issues with
starting the metastore.
Connecting to a Hive RDBMS Metastore
If you are not using the Hive Thrift Metastore server in your Hadoop environment, you can configure
Platfora to connect directly to a Hive metastore relational database management system (RDBMS), such
as MySQL. This requires additional configuration on the Platfora master server that must be done before
you can create the data source in the Platfora application.
To have Platfora connect directly to a RDBMS metastore requires additional configuration on the
Platfora master server. The Platfora master server needs a hive-site.xml file with the correct
RDBMS connection information. You also need to install the appropriate JDBC driver on the Platfora
master server, and make sure that Platfora can find the Java libraries and class files for the JDBC driver.
Here is an example hive-site.xml to connect to a MySQL metastore. A hive-site.xml containing
these properties must reside in the local Hadoop configuration directory of the Platfora master server.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://hive_hostname:port/metastore</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>database_username</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>database_password</value>
</property>
<property>
<name>hive.metastore.client.socket.timeout</name>
<value>120</value>
</property>
<property>
Page 23
Data Ingest Guide - Manage Data Sources
<name>hive.metastore.batch.retrieve.max</name>
<value>100</value>
</property>
</configuration>
The Platfora server would also need the MySQL JDBC driver installed in order to use this configuration.
You can place the JDBC driver .jar files in $PLATFORA_DATA_DIR/extlib to install them
(requires a Platfora restart).
You can add a Hive RDBMS Metastore data source in the Platfora application after you have done
the appropriate configuration on the Platfora master server. When you leave the Thrift Metastore
URI blank, the Platfora application will look for the metastore connection information in the hivesite.xml file on the Platfora master server.
If the connection to Hive is successful, you will see a list of available Hive databases in that data source.
Click on a database name to show the Hive tables within that database. The default database in Hive is
named default. If you have not created your own databases in Hive, this is where all of your tables will
reside.
Page 24
Data Ingest Guide - Manage Data Sources
If you are using Hive views, they will also be listed. However, Hive views are disabled to use as the
basis of a Platfora dataset. You can only create a dataset from Hive tables.
If you have trouble connecting to the Hive RDBMS metastore, make sure that:
• The Hive RDBMS metastore server process is running (i.e. the MySQL database server is running).
• The Platfora server machine has access over the network to the designated database server host and
port.
• The system user that the Platfora server runs as has database permissions granted on the appropriate
database objects in the RDBMS. For example, if using a MySQL metastore you could run a
command such as the following in MySQL:
GRANT ALL ON *.* TO 'platfora'@'%';
• The system user that the Platfora server runs as has read permissions to the underlying data files in
HDFS.
Connect to an HDFS Data Source
Creating an HDFS data source involves specifying the connection information to the HDFS NameNode
server. Once you have successfully connected, you will be able to browse the files and directories in
HDFS, and choose the files that you want to add to Platfora as datasets.
When you add a new data source that connects to an HDFS NameNode server, you will need to supply
the following connection information:
Connection
Information
Select or Enter
Source Type
HDFS
Page 25
Data Ingest Guide - Manage Data Sources
Connection
Information
Select or Enter
Name
A name for the data source location. This can be any name you
choose, such as HDFS User Data or HDFS Root Directory.
Host
The external DNS hostname or IP address of the HDFS NameNode
server.
Port
The port that the HDFS NameNode server listens on for connections.
For Cloudera installations, the default port is 8020. For Apache
installations, the default port is 9000.
Root Path
The HDFS directory that Platfora should access. For example, to
access the entire HDFS file system, use / (root directory). To access
a particular directory only, enter the qualified path (for example, /
user/data or /data/weblogs).
If the connection to HDFS is successful, you will see a list of the files and directories that reside in the
specified location of the HDFS file system when defining a dataset from the data source.
If you have trouble connecting to HDFS, make sure that the HDFS NameNode server process is running,
and that the Platfora server machine has access over the network to the designated NameNode port.
Also, make sure that the system user that the Platfora server runs as has read permissions to the HDFS
directory location you specified.
Connect to an S3 Data Source
Amazon Simple Storage Service (Amazon S3) is a distributed file system hosted on Amazon Web
Services (AWS). Data transfer is free between S3 and Amazon cloud servers, making S3 an attractive
choice for users who run their Hadoop clusters on EC2 or utilize the Amazon EMR service.
Page 26
Data Ingest Guide - Manage Data Sources
If you are not using Amazon EC2 or EMR as your primary Hadoop implementation for Platfora, you can
still use S3 as a data source, but keep in mind that the source data will be copied to the Platfora primary
Hadoop implementation during the lens build process. If you are transferring a lot of data between S3
and another network outside of Amazon, it could be slow.
Hadoop supports two S3 file systems as an alternative to HDFS: S3 Native File System (s3n) and S3
Block File System (s3). Platfora supports the S3 Native File System (s3n) only.
When you add a new data source that connects to an S3 data source, you will need to supply the
following connection information:
Connection
Information
Select or Enter
Source Type
Amazon S3
Name
A name for the data source location. This can be any name you
choose, such as S3 Sample Data or Marketing Bucket on S3.
Bucket Name
A bucket is a named container for objects stored in Amazon S3. If
you go to your AWS Management Console S3 Home Page, you can
see the list of buckets you have created for your account.
Path
The directory in the specified bucket that Platfora should access.
For example, to access the entire bucket, use / (root directory).
To access a particular directory only, enter the qualified path (for
example, /user/data or /data/weblogs).
If the connection to Amazon S3 is successful, you will see a list of the files and directories that reside in
the specified location of the S3 file system when defining a dataset from the data source.
If you have trouble connecting to Amazon S3, make sure that the Platfora server machine has access
over the network to Amazon Web Services, and that your S3 connection information and AWS security
credentials are specified in the core-site.xml configuration file of the Platfora master server. If you
are using Amazon EMR as the default Hadoop implementation for Platfora, your Platfora administrator
should have configured the S3 connection information and AWS security credentials during installation.
Connect to a MapR Data Source
Creating a MapR data source involves specifying the connection information to the MapR Container
Location Database (CLDB) server. Once you have successfully connected, you will be able to browse
the files and directories in the MapR file system, and choose the files that you want to add to Platfora as
datasets.
When you add a new data source that connects to a MapR cluster, you will need to supply the following
connection information:
Connection
Information
Select or Enter
Source Type
MapR
Page 27
Data Ingest Guide - Manage Data Sources
Connection
Information
Select or Enter
Name
A name for the data source location. This can be any name you
choose, such as MapR File System or MapRFS Marketing Directory.
Host
The external DNS hostname or IP address of the MapR Container
Location Database (CLDB) server.
Port
The port that the MapR CLDB server listens on for client connections. The
default port is 7222.
Root Path
The MapR file system (MRFS) directory that Platfora should
access. For example, to access the entire file system, use / (root
directory). To access a particular directory only, enter the qualified
path (for example, /user/data or /data/weblogs).
If the connection to MapR is successful, you will see a list of the files and directories that reside in the
specified location of the MapR file system when defining a dataset from the data source.
If you have trouble connecting to MapR, make sure that the CLDB server process is running, and that
the Platfora server machine has access over the network to the designated CLDB port. Also, make
sure that the system user that the Platfora server runs as has read permissions to the MapRFS directory
location you specified.
Connect to Other Data Sources
The Other data source type allows you to specify a connection URL to an external data source server.
You can use this to create a data source when you already know the protocol and URL to connect to a
supported data source type.
When you add a data source using Other, you will need to supply the following connection
information:
Connection
Information
Select or Enter
Source Type
Other
Name
A name for the data source location. This can be any name you
choose, such as My File System or Marketing Data.
URL
A connection URL for the data source using one of the supported
data source protocols (hdfs, maprfs, thrift, or s3n), or you can
also use the file protocol to access a directory or file on the local
Platfora master server file system. For example:
file://localhost:8001/file_path_on_platfora_master
Page 28
Data Ingest Guide - Manage Data Sources
If the connection to the data source is successful, you will see a list of the files and directories that reside
in the specified location of the file system when defining a dataset from the data source.
If you have trouble connecting, make sure that the Platfora server machine has access over the network
to the designated server. Also, make sure that the system user that the Platfora server runs as has read
permissions to the directory location specified.
About the Uploads Data Source
When you first start the Platfora server, it connects to the configured distributed file system (DFS) for
Hadoop, and creates a default data source named Uploads. This data source cannot be deleted. You
can upload single files residing on your local file system, and they will be copied to the Uploads data
source in Hadoop.
For large files, it may take a few minutes for the file to be uploaded. The largest file that you can upload
through the Platfora web application is 50 MB. If you have large files, consider adding them directly in
the Hadoop file system rather than uploading them through the browser.
Other than adding new files, you cannot manage the files in the Uploads data source through the
Platfora application. You cannot remove files from the Uploads data source once they have been
uploaded, or create sub-directories to organize uploaded files. If you want to remove a file, you must
delete it in the DFS source system. Re-uploading a file with the same file name will overwrite the
previously uploaded copy of the file.
Page 29
Data Ingest Guide - Manage Data Sources
Upload a Local File
You can upload a file through the Platfora application and it will be copied to the default Uploads data
source in Hadoop. Once a file is uploaded, you can then select it as the basis for a dataset.
1. Go to the Data Catalog page.
2. Click Add Dataset to open the dataset workspace.
3. Click Upload File.
4. Browse your local file system and select the file you want to upload.
5. Click Upload.
After the file is uploaded, you can either Cancel to exit the dataset workspace, or Continue to
define a dataset from the uploaded file.
About Security on Uploaded Files
By default, data access permissions on the Uploads data source is granted to the Everyone group.
Object permissions allow the Everyone group to define datasets from the data source (and thereby
upload files to this data source).
Keep in mind that only users with a system role of System Administrator or Data
Administrator are allowed to create datasets, so only these roles can upload files.
Page 30
Data Ingest Guide - Manage Data Sources
Configure Data Source Security
Only system administrators can create data sources in Platfora. Access to the files in a data source
location is controlled by granting data access permissions to the data source. The ability to manage or
define datasets from a data source is controlled by its object permissions.
1. Go to the Data Catalog page.
2. Click Add Dataset to open the dataset workspace.
3. Select the data source in the Source List.
4.
Click the data source information icon
(to the right of the data source name).
5. Click Permission Settings.
6. The Data Access section lists the users and groups allowed to see the data coming from this data
source location. If a user does not have data access, they will not be able to see any data values in
Platfora that originate from this data source. Data access permissions apply to any Platfora object
created from this source (dataset, lens, or viz).
Page 31
Data Ingest Guide - Manage Data Sources
Data access defaults to the Everyone group (click the X to remove it).
Click Add Data Access to grant data access to other users and groups.
7. The Collaborators section lists the users and groups allowed to access the data source object.
Click Add Collaborators to grant object access to users or groups.
The following data source object permissions can be granted:
• Define Datasets on Data Source. The ability to define datasets from files and directories
in the data source.
• Manage Permissions on Data Source. Includes the ability to define datasets plus the
ability to grant data access and object access permissions to other Platfora users.
Delete a Data Source
Deleting a data source from Platfora removes the data source connection as well as any Platfora dataset
definitions you have created from that data source. It does not remove source files or directories from the
source file system, only the Platfora definitions. The default Uploads data source cannot be deleted.
1. Go to the Data Catalog page.
2. Click Add Dataset to open the dataset workspace.
3. Select the data source you want to delete from the Source List.
4.
Click the data source information icon
(to the right of the dataset name).
Page 32
Data Ingest Guide - Manage Data Sources
5. Click Delete.
6. Click Confirm to delete the data source and all of its dataset defintions.
7. Click Cancel to exit the dataset workspace.
Edit a Data Source
You typically do not need to edit a data source once you have successfully established a connection.
If the connection information changes, however, you can edit an existing data source to update its
connection information, such as the server name or port of the data source. You cannot, however, change
the name of a data source after it has been created.
1. Go to the Data Catalog page.
2. Click Add Dataset to open the dataset workspace.
3. Select the data source you want to edit from the Source List.
4.
Click the data source information icon
(to the right of the dataset name).
5. Click Edit.
6. Change the connection information for the data source. You cannot change the name of a data source
after it has been saved.
7. Click Save.
8. Click Cancel to exit the dataset workspace.
Page 33
Chapter
3
Define Datasets to Describe Data
Data in Hadoop is added to Platfora by defining a Dataset. A dataset describes the characteristics of the source
data, such as its file locations, the structure of individual rows or records, the fields and data types, and the
processing logic to cleanse, transform, and aggregate the data when it is loaded into Platfora. The collection of
modeled datasets make up the Data Catalog (the data items available to Platfora users). This section explains
how to create and manage datasets in Platfora. Datasets point to source data in Hadoop.
Topics:
•
FAQs - Dataset Basics
•
Understand the Dataset Workspace
•
Understand the Dataset Creation Process
•
Understand Dataset Permissions
•
Select Source Data
•
Parse the Data
•
Prepare Base Dataset Fields
•
Transform Data with Computed Fields
•
Add Measures for Quantitative Analysis
•
Prepare Date/Time Data for Analysis
•
Prepare Location Data for Analysis
•
Prepare Drill Paths for Analysis
•
Model Relationships Between Datasets
•
Define the Dataset Key
FAQs - Dataset Basics
This section answers the most frequently asked questions (FAQs) about creating and managing Platfora
datasets.
Page 34
Data Ingest Guide - Define Datasets to Describe Data
What is a dataset?
A dataset points to a set of files in a data source and describes the structure of the data, as well as any
processing logic needed to prepare the data for consumption. A dataset is just a metadata description of
the data -- it contains all of the data about the data -- plus a small sampling of raw rows to facilitate data
discovery.
What are the prerequisites for creating a dataset?
You need access to the source data. Before you add a new dataset to Platfora, the source data files on
which the dataset is based must be in the Hadoop file system and accessible to Platfora via a Data
Source. You can also upload files from your desktop to the default Uploads data source.
Who can create a dataset?
Only System Administrators or Data Administrators can create and edit datasets in Platfora.
You must also have data access permissions to the source data in order to define a dataset from data files
in Hadoop. The person who creates the dataset becomes the dataset owner. The dataset owner can grant
permissions to other Platfora users.
How do I create a dataset?
Go to the Data Catalog and click Add Dataset.
The dataset workspace guides you through a series of steps to define the structure and processing rules
for the data. See Understand the Dataset Creation Process.
How do I edit an existing dataset?
Open the dataset detail page, and click Edit...
Page 35
Data Ingest Guide - Define Datasets to Describe Data
or find the dataset in the Data Catalog and choose Edit from the dataset action menu.
If the edit option is not available, it means you don't have the appropriate permissions. Ask the dataset
owner to grant you edit permission.
How do I rename a dataset?
You cannot rename a dataset after it has been saved for the first time.
You can, however, make a duplicate copy of a dataset and save it as a new name. Then you can then
delete the old dataset and keep the renamed one. Note that any references to the renamed dataset will be
broken in other datasets, so you will have to manually update those.
Can I make a copy of a dataset?
Yes, you can make a copy of an existing dataset. Edit the dataset you want to copy, and choose Save
As from the dataset workspace Save menu.
Platfora makes a copy of the current version of the dataset using the new name. Any dataset changes that
were made since saving the previous dataset are applied to the new dataset only.
Page 36
Data Ingest Guide - Define Datasets to Describe Data
You might want to copy an existing dataset to:
• Experiment with changes to the dataset computed fields without affecting the original dataset.
• Create another dataset that accesses different source files for users that only have access to source
files in a different path.
• Change the name of the dataset (then delete the original dataset).
Since duplicating a dataset changes its name, references to the previous dataset will not be automatically
updated to point to the duplicated dataset. You must manually edit the other datasets and update their
references to point to the new dataset name instead.
How do I delete a dataset?
Open the dataset detail page, and click Delete...
or find the dataset in the Data Catalog and choose Delete from the dataset action menu.
If the delete option is not available, it means you don't have the appropriate permissions. Only a dataset
owner can delete a dataset.
Deleting a dataset does not remove files or directories in the source file system. It does not remove
lenses built from the dataset. Any lenses that have been built from the dataset will remain in Platfora,
however future lens builds that use a deleted dataset will fail. Also, any references to the deleted dataset
will be broken in other datasets.
What kinds of data can be used to define a dataset?
You can define a dataset from data files that reside in Hadoop. Platfora supports a number of file formats
out-of-the-box. See Supported Source File Formats.
Page 37
Data Ingest Guide - Define Datasets to Describe Data
How do I join datasets together?
The logic of a join is described within the dataset definition as a Reference. A reference joins two
datasets together using fields they share in common. A reference creates a link in one dataset to the
primary key of another dataset. The actual joining of the datasets happens during lens build time, not
when the reference is created. See Model Relationships Between Datasets.
What are the different kinds of columns or fields that a dataset can have?
A field is an atomic unit of data that has a name, a value, and a data type. A column is a set of data
values of a particular data type, with one value for each row in the dataset. Columns provide the
structure for composing a dataset row. The terms column and field are often used interchangeably.
Within a dataset, there are three basic classes of fields or columns:
• Base fields are the raw fields parsed directly from the source data.
• Computed fields are fields that you add to the dataset to perform some kind of extraction, cleansing,
or transformation on the base data fields.
• Measure fields are a special type of computed field that specifies how the data should be aggregated
when it is analyzed. For example, suppose you had a Dollars Sold field in your dataset. At analysis
time, you may want to know the Total Dollars Sold per day (a SUM aggregation). Measures serve as
the quantitative data in an analysis, and every dataset, lens, and viz must have at least one measure.
Every dataset column or field also has a data type, which describes the kind of values allowed in that
column. See About Platfora Data Types.
You can change the data types of base fields. Computed field data types are set by the output type of
their computed expression.
How do I transform or manipulate the data?
To transform or manipulate the data, add computed fields to the dataset. Platfora's expression language
has an extensive library of built-in functions and operators that you can use to define computed fields.
Think of a computed field as a single step in an ETL (extract, transform, load) workflow. Sometimes
several steps, or computed fields, are needed to achieve the result you want. You can hide the computed
fields that do interim data processing steps.
Page 38
Data Ingest Guide - Define Datasets to Describe Data
How do I request data from a dataset?
You request data from a dataset by choosing one dataset in the Data Catalog, and creating a lens from
that dataset. When you create a lens, you can choose any fields you want from the focus dataset, plus
dimension fields from any dataset that it references. When you build the lens, Platfora will go fetch the
data from Hadoop and prepare it for analysis.
See Define Lenses to Load Data.
Understand the Dataset Workspace
When you add new dataset or edit an existing one, you are brought to the dataset workspace. This is
where you describe the structure and characteristics of your source data in the form of a Platfora dataset.
1. The dataset workspace is divided into six areas to guide you through the dataset definition process.
You can go back and forth between the areas as you work on the dataset. You do not have to do the
steps in order.
2. The dataset is horizontally divided into columns (or fields). Columns are listed in the order that they
occur in the source data (for original base fields), then in the order that they were added to the dataset
(for computed fields).
3. When you select a column, the Field Info panel shows the field detail information. This is where
you can edit things like the field name, description, data type, or quick measure aggregations.
4. Platfora shows twenty rows to help with data discovery. These are records taken directly from the
source data files, and shown with the parsing or expression logic applied.
Page 39
Data Ingest Guide - Define Datasets to Describe Data
Some computed columns do not show sample values because the values are computed as lens build
time, such as measures (aggregate computed fields), event series processing (ESP) computed fields,
and any computed field that operates on fields not in the current dataset (via references).
5. At the bottom of the dataset workspace is where you can navigate between areas (Back or
Continue), save your changes (Save, Save As, or Save and Exit), or exit the dataset without
saving (Cancel).
Understand the Dataset Creation Process
There are several steps involved in creating a Platfora dataset. This section helps data administrators
understand all of the tasks to consider when defining a dataset in Platfora. The goal of the dataset is to
make the data consumable for data analysts and business users.
The dataset workspace is divided into six areas to guide you through the dataset definition process. You
can go back and forth between the areas as you work on the dataset. You do not have to do the steps in
order.
Step 1 - Select Data
The Select Data step is where you point Platfora to a specific location in a data source. You can only
browse the data sources that have been added to Platfora by a system administrator.
Page 40
Data Ingest Guide - Define Datasets to Describe Data
Once the dataset has been saved, the Select Data step becomes disabled. You can change the Source
Data location within the same data source, but you cannot switch data sources for an existing dataset.
Step 2 - Parse Data
The Parse Data step is where you specify the parsing logic used to extract rows and columns from the
source data. Platfora comes with several built-in parsers for the most common file formats. After you
have done the initial parsing, you usually don't need to revisit this step unless the underlying structure of
the data changes.
The Wrangled tab shows the data with the parsing logic applied. The Raw tab shows the original raw
data records.
Page 41
Data Ingest Guide - Define Datasets to Describe Data
Step 3 - Manage Fields
The Manage Fields step is where you prepare the actual fields that users can see and request from the
Platfora data catalog. This is where the majority of the dataset definition work is performed.
To make sure that the data is in consumable format for analysis, you may need to:
1. Verify the base field data types
2. Give fields meaningful names
3. Add field descriptions to help users understand the data
4. Add computed fields to further transform and process the data
5. Identify the dataset measures
6. Hide fields you don't want users to see
7. Specify how NULL values are handled
8. Prepare geo-location data for analysis
9. Prepare datetime data for analysis
10.Define drill path hierarchies
Page 42
Data Ingest Guide - Define Datasets to Describe Data
Step 4 - Create References
The Create References step is where you create joins to other datasets.
You may need to come back to this step later once all of the dependent datasets have been added to
Platfora. When adding the dependent datasets, you must make sure that a) they have a primary key, and
b) the data types of the primary key and foreign key fields are the same in both datasets.
Step 5 - Define Key
The Define Key step is where you choose the column(s) that uniquely identify each row in the dataset,
also known as the primary key of the dataset.
A dataset only needs a primary key if:
Page 43
Data Ingest Guide - Define Datasets to Describe Data
• You plan to join to it from another dataset (it is the target of a reference)
• You want to use it as the focus of an event series lens
• You want to use it to define segments
Step 6 - Finish & Save
The Finish & Save step is where you can add a description of the dataset and verify the dataset name.
Dataset names cannot be changed after the dataset has been saved for the first time.
Understand Dataset Permissions
Only system and data administrators can create datasets in Platfora. The ability to edit or create a
lens from a dataset is controlled by the dataset's object permissions. In addition to the dataset object
permissions, users must also have access to the source data itself in order to see and work with the data
in Platfora.
Platfora controls access to a dataset at two levels:
• Source Data Access Permission - Source data access permission determines who is authorized
to view the raw source data. By default, data access permission is controlled at the data source
level only, and is inherited by the datasets coming from that data source. Your Platfora system
administrator may also configure Platfora to authorize data access using the permissions set in HDFS.
In these two cases, data access permission is disabled at the dataset level. If Platfora is configured
for more granular per-dataset access control, then data access can be set independently of the data
source, but this is not the default behavior.
Page 44
Data Ingest Guide - Define Datasets to Describe Data
• Dataset Object Permissions - Dataset object permissions control who can edit, delete, or create a
lens from a dataset within the Platfora application.
Users must have dataset permissions at both levels in order to work with a dataset.
To manage permissions for a dataset, find the dataset in the data catalog and select Permissions.
Click Add Collaborators to choose new users or groups to add.
By default, the user who created the dataset is the owner, and the Everyone group is granted Define
Lens from Dataset access.
The following dataset object permissions can be granted:
• Define Lens from Dataset. The ability to define a lens from the visible fields of a dataset.
The fields of referenced datasets are not included in this permission by default. A user must have
appropriate permissions on each individual dataset in order to choose dataset fields for a lens. By
default, all datasets have this permission granted to Everyone.
• Edit. Define lens plus the ability to edit the dataset definition. Editing the dataset definition means a
user can see the raw data, including hidden fields.
• Own. Edit plus the ability to delete a dataset or manage its permissions.
Select Source Data
After you have created a data source, the first step in creating a dataset is selecting some Hadoop source
data to expose in Platfora. This is accomplished by choosing files from the source file system.
For Hive data sources, a single Hive table definition maps to a single Platfora dataset.
Page 45
Data Ingest Guide - Define Datasets to Describe Data
For file system data sources such as HDFS or S3, a dataset can map to either a single file or to multiple
files residing in the same parent directory location.
For the default Uploads data source, a dataset usually maps to a single uploaded file, although you can
select multiple uploaded files if they use a similar file naming convention.
Supported Source File Formats
To ingest source data, Platfora uses its parsing facilities to parse the data into records (rows) and fields
(columns). Platfora supports the following source file formats and file compression formats.
Format
Description
Hive Tables
When creating a dataset from a Hive table, there is no need to
define parsing controls in Platfora. Platfora uses the Hive table
definition to obtain metadata about the source data, such as which
files to process, the parsing logic for rows and columns, and the field
names and data types contained in the source data. Since Platfora
relies on Hive to do the file parsing, you must make sure that Hive is
able to correctly handle the source file format of the underlying table
data files.
Platfora is able to parse Hive tables that refer to data in the following
file formats:
• Delimited Text file format
• SequenceFile format
• Record Columnar File (RCFile format)
• Optimized Row Columnar (ORC file format)
• Custom Input Format (provided that the SERDE used to define the
row format is also installed in Platfora)
Delimited Text
A delimited file is a plain text file format for describing tabular data.
It refers to any file that is plain text (typically ASCII or Unicode
characters), has one record per line, has records divided into fields,
and has the same sequence of fields for every record. Records (or
rows) are separated by line breaks, and fields (or columns) within a
line are separated by a special character called the delimiter (usually
a comma or tab character). If the delimiter also appears in the field
values, it must be escaped. The Platfora delimited parser supports
single character escapes (such as a backslash), as well as enclosing
field values in double quotes (as is common with CSV files).
CSV
Comma-separated value (CSV) files are a type of delimited text
files. The Platfora delimited file parser also supports typical CSV
formatting conventions, such as enclosing field values in double
quotes, using double quotes to escape literal quotes, and the use of
header rows.
Page 46
Data Ingest Guide - Define Datasets to Describe Data
Format
Description
JSON
JavaScript Object Notation (JSON) is a data-interchange format
based on a subset of the JavaScript Programming Language. JSON
is a text format comprised of two basic data structures: objects and
arrays. The Platfora JSON parser supports the selection of a top-level
JSON object to signify a record or row, and selection of name:value
pairs within an object to signify columns or fields (including nested
objects and arrays).
XML
Extensible Markup Language (XML) is a markup language that
defines a set of rules for encoding documents in a format that is
both human-readable and machine-readable. XML is a text format
comprised of two basic data structures: elements and attributes.
The Platfora XML parser supports the selection of a top-level XML
element to signify a record or row, and selection of attribute:value
or element:value pairs within a parent element to signify columns or
fields (including nested elements).
Avro
Apache Avro is a remote procedure call (RPC) and data serialization
framework. Its primary use is to provide a data serialization format
for persistent data stored in Hadoop. It uses JSON for defining
schema, and JSON or binary format for data encoding. When Avro
data is stored in a persistent file (called a container file), its schema
is stored along with it. This allows any program to be able to read
the serialized data in the file.
Hadoop Sequence
Files
Sequence files are file format generated by Hadoop MapReduce
tasks, and are a common format for storing data in Hadoop. It is a
flat file format containing binary records. Platfora can import records
contained within a sequence file as long as the format of the records
is delimited text, CSV, JSON, XML, or Avro.
Web Access Logs
A web access log contains records about incoming requests made
to a web server. Platfora has a built-in parser that automatically
recognizes web access logs that adhere to the NCSA common or
combined log formats used by many popular web servers (such as
Apache HTTP Server).
Other File Types
For semi-structured file formats, you can still define parsing logic
using regular expressions or Platfora's built-in expression language.
Platfora provides a Regex or Line parser to allow you to define your
own parsing logic to extract data columns from the records in your
source files (as long as your source files have one record per line).
Custom Data
Sources
For source data coming in from a custom data connector, the logic of
the data connector dictates the format of the data. For example, if
using a JDBC data connector to access data in a relational database,
the data is returned in delimited format.
Page 47
Data Ingest Guide - Define Datasets to Describe Data
For Platfora to read a compressed source file, both your Platfora and your Hadoop configuration must
support the compression format. By default, Platfora supports the following formats:
Format
Notes
Deflate (zlib)
Platfora and Hadoop support these formats out-of-the-box.
Gzip
Bzip
Snappy
Platfora includes support for Snappy in its distribution. Hadoop
does not. Your administrator must configure Hadoop to support
Snappy. Refer to your Hadoop distribution documentation for
information on configuring Snappy.
LZO (Hadoop-LZO)
Due to licensing restrictions, Platfora does not bundle
support for these with the product. Your administrator must
configure these compression formats both in Platfora and
Hadoop.
Although neither compression format is explicitly qualified
with each new release, Platfora will fix issues and release
patches if a problem is discovered.
LZ4
Select a Hive Source Table
For Hive data sources, Platfora points to a Hive metastore server. From that point, you can browse the
available databases and tables, and select a single Hive table on which to base your dataset. Since Hive
tables are already in tabular format, the parsing step is skipped for Hive data sources. Platfora does not
execute any queries through Hive; it only uses the table definition to obtain the metadata needed to
define the dataset.
Page 48
Data Ingest Guide - Define Datasets to Describe Data
1. On the Select Data step of the dataset workspace, select a Hive data source from the Source
List.
2. Select the Hive database that contains the table you want to use.
The default Hive database is named default.
3. Select a single Hive table. Only tables can be used to define datasets, not views.
Platfora will use the Hive table definition to determine the source files, columns, data types,
partitioning, and so on.
4. Click Continue.
Platfora skips the Parse Data step for Hive data sources, and goes directly to the Manage Fields
step.
Select DFS Source Files
For distributed file system data sources, such as HDFS and S3, a data source points to a particular
directory in the file system. From that point, you can browse and select the source files to include in
Page 49
Data Ingest Guide - Define Datasets to Describe Data
your dataset. You can enter a wildcard pattern to select multiple files including files from multiple
directory locations, however all of the files selected must be of the same file format.
1. On the Select Data step of the dataset workspace, select an HDFS or S3 data source from
the Source List.
2. Browse the file system to choose a directory or file you want to use as the basis for your dataset.
3. To select multiple files within the selected directory, use a wildcard pattern in the Source
Location path, where ? represents a single character and * represents any number of characters.
For example, suppose you wanted to base a dataset on log files that are partitioned into monthly
directories. To select all log files for 2014, you could use a wildcard path such as:
hdfs://myhdfs.mycompany.com/data/*2014/*.log
4. In the Selected Source Files list, confirm that the files you want are selected. If a large number
of source files are selected, Platfora will only display the first 200 file names.
5. Click Continue.
Edit the Dataset Source Location
Once the dataset has been saved, the Select Data step of the dataset workspace becomes disabled.
You can edit the dataset to point to different source location as long as it is in the same data source.
Page 50
Data Ingest Guide - Define Datasets to Describe Data
You cannot switch data sources for a dataset after it has been saved. For example, you cannot change a
dataset that points to the Uploads data source to then use another HDFS data source instead.
1. Open the dataset workspace, and click Source Data in the dataset header.
2. Edit the Source Location to point to the new directory path or file name within the same data
source.
3. Click Update.
4. Click Save.
Parse the Data
The Parse Data step of the dataset workspace is where your specify the parsing options for a dataset.
This section describes how to use Platfora's built-in file parsers to describe your source data in tabular
format (rows and columns). The built-in parsers assume that each record has a similar data structure.
View Raw Source Data Rows
On the Parse Data step of the dataset workspace, Platfora shows a sample of raw lines or records
from a source data file. This allows you to compare the data in its original format (the Raw data) to the
data with the parsing logic applied (the Wrangled data).
Viewing the raw data is helpful in determining the parsing logic, and when writing computed field
expressions that do transformations on base fields.
For delimited data, Platfora shows a sampling of 20 lines taken from one source file.
Page 51
Data Ingest Guide - Define Datasets to Describe Data
For structured file formats, such as JSON and XML, Platfora shows a sampling of the first 20 top-level
objects taken from one source file. If your data is one record per file, only file is shown (one sample
record).
The Raw tab shows a sample of records from one source data file. The Wrangled tab shows the data
values after the parsing logic has been applied.
1. Open the dataset and go to the Parse Data step of the dataset workspace.
2. Select the Raw tab.
3. To see where the sample records are coming from, click Source Data.
4. To make sure you are seeing the latest source data, click the refresh button. The sample data rows are
cached, and this will ensure that the cache is refreshed from the source.
Update the Dataset Sample Rows
Platfora displays a sample of dataset rows to facilitate the data ingest process. The sample consists of 20
records taken from the first file in the source location. If a dataset is comprised of multiple source files,
you can change which file the sample rows are taken from. You can also refresh the sample rows to read
from the latest the source data.
Page 52
Data Ingest Guide - Define Datasets to Describe Data
You can only change the sample file for an existing dataset. When the dataset is first created, Platfora
takes a sample of rows and stores it in the dataset cache. You may want to take the sampling from a
different source file, or refresh the data if the original source data has changed.
1. Open the dataset and go to the Parse Data step of the dataset workspace.
2. Click Source Data in the dataset header.
3. Choose another file from the Display sample data using drop-down.
This option is only available for source locations that point to multiple files.
4. Click Update.
5. (Optional) Click the refresh button to resample rows from the original source data file.
Refreshing the sample rows is particularly useful when you replace a file in the Uploads data
source. The cached sample rows are not updated automatically when the source data changes.
Update the Dataset Source Schema
Over time a dataset's source schema may evolve and change. You may need to periodically re-parse the
source data to pick up schema changes, such as when new columns are added in the source data.
Updating the dataset source schema in this way only applies to Hive and
Delimited source data types.
Update Schema for Hive Datasets
Datasets based on Hive tables have the Parse Data step disabled in the Platfora dataset workspace.
This is because the Hive table definition is used to determine the dataset columns and their respective
column order, column names, and data type information.
Page 53
Data Ingest Guide - Define Datasets to Describe Data
If the source data schema changes for a Hive-based data source, you would first update the table
definition in Hive. Then in Platfora you can refresh the dataset schema to get the latest dataset columns
from the Hive table definition.
1. Update the table in the Hive source system.
2. Edit the dataset in Platfora.
3. Click Source Data at the top of the dataset workspace.
4. Click Refresh Hive.
5. Click Update.
Platfora re-reads the table definition from Hive and displays the updated column order, names, and
data types.
6. Save your changes.
Update Schema for Delimited Datasets
For datasets based on delimited text or comma-separted value (csv) files, the only schema change that
is supported is appending new columns to the end of a row. If new columns are added in the source
data files, you can refresh the schema to pick up the new columns. Changing the column order (adding
new columns in the middle of the row) is not supported for delimited datasets. For delimited datasets
that have a header row, the base column names in the Platfora dataset definition must match the header
column names in the source data file in order to use this feature.
Page 54
Data Ingest Guide - Define Datasets to Describe Data
Older source data rows that do not have the new appended columns will just have NULL (empty) values
for those columns.
1. Edit the dataset in Platfora.
2. Click Source Data at the top of the dataset workspace.
3. Choose a source data file that has the most recent schema containing the new columns.
4. Click Refresh Schema.
5. Click Update.
Platfora re-reads the schema from the sample file and displays the new base columns (as long as the
new columns are appended at the end of the rows).
6. Save your changes.
Parse Delimited Data
To use Platfora's delimited file parser, your data must be in plain text file format, have one record per
line, and have the same sequence of fields for every record separated by a common delimiter (such as a
comma or tab).
Delimited records (or rows) are separated by line breaks, and fields (or columns) within a line are
separated by a special character called the delimiter (usually a comma or tab character). If the delimiter
also appears in the field values, it must be escaped. The Platfora delimited parser supports single
character escapes (such as a backslash), as well as enclosing field values in double quotes (as is common
with CSV files).
Page 55
Data Ingest Guide - Define Datasets to Describe Data
On the Parse Data step of the dataset workspace, the Parsing Controls for the Delimited parser
are as follows:
Parser Control
Description
File Type
Choose the Delimited parser for delimited text and CSV files. The
Wrangled view shows the data with the parsing logic applied.
Row Delimiter
Specifies a single character used to separate rows (or records) in your
source data files.
In most delimited files, rows are separated by a new line, such as the
line feed character, carriage return character, or carriage return plus
line feed. Line feed is the standard new line representation on UNIXlike operating systems. Other operating systems (such as Windows)
may use carriage return individually, or carriage return plus line feed.
Selecting Any New Line will recognize any of these representations of
a new line as the row delimiter.
Ignore Top
Rows
Specifies the number of lines at the beginning of the file to ignore
when reading the source file during data ingest and lens builds. Enter
the number of lines to ignore and click Update. To use this with the
Raw Files Contains Header option, ensure that the line containing
the column names is visible and is the first remaining line.
Column
Delimiter
Specifies the single character used to separate the columns (or fields)
of a row in your source data files. Comma and tab are the most
commonly used column delimiters.
Page 56
Data Ingest Guide - Define Datasets to Describe Data
Parser Control
Description
Escape
Character
Specifies the single character used to escape delimiter characters that
occur within your data values.
If your data values contain delimiter characters, those characters must
be escaped, otherwise the parser will assume the special character
denotes a new row or column.
For comma-separated values (CSV) files, it is common practice to
escape delimiters by enclosing the entire field value within double
quotes. If your source data uses this convention, then you should
specify a Quote Character instead of an Escape Character.
Quote
Character
The quote character is used to enclose individual data values in CSVformatted files. The quote character is usually the double quote
character ("). If a data value contains a delimiter, then enclosing the
value in double quotes treats every character within the quotes as
data, including the delimiters. If the data also contains the quote
character, the quote character can also be used to escape itself.
For example, suppose you have a row with these three data values:
weekly special wine, beer, and soda "2 for 1" or 9.99 each
If the column delimiter is a comma, and the quote character is a
double quote, a correctly formatted row in the source data would look
like this:
"weekly special","wine, beer, and soda","""2 for 1"" or
9.99 each"
Raw File
Contains
Header
A header is a special row containing column names at the beginning of
a data source file. If your source data files have a header row as the
first line in the file, select this check-box. This will treat the first line in
each source file as a header row instead of as a row of data.
Upload Field
Names
Allows you to upload a comma or tab delimited text file containing the
field information you want to set. When a dataset has a lot of fields to
manage, it may be easier to update several field names, descriptions,
data types, and visibility settings all at once rather than editing each
field one-by-one.
For more information, see Bulk Upload Field Header Information.
Specify a Single-Character Custom Delimiter
If your delimited data uses a special delimiter character that is not available in the default choices, you
can define a custom delimiter as either a single-character string or decimal-encoded ASCII value.
Page 57
Data Ingest Guide - Define Datasets to Describe Data
1. Go to the Parsing Controls panel. Make sure you are using the Delimited parser.
2. Choose Add Custom from the Column Delimiter or Row Delimiter menu.
3. Choose the encoding to use: String or ASCII Decimal Code.
4. Enter the delimiter value.
For String, you can enter any single character that you can type on your keyboard.
For ASCII Decimal Code, enter the decimal-encoded representation of the ASCII character. For
example, 29 is the ASCII code for group separator, 30 is for record separator, 65 is for the letter A.
5. Click OK to add the custom delimiter to the selected parser delimiter menu.
Page 58
Data Ingest Guide - Define Datasets to Describe Data
Specify a Multi-Character Column Delimiter
In some cases, your source data may use a multi-character column delimiter. The delimited parser does
not support multi-character delimiters, but you can work around it by using the Regex parser instead.
1. Go to the Parse Data step of the dataset workspace.
2. In the Parsing Controls panel, choose the Regex parser.
3. Enter a Regular Expression that matches the structure of your data lines.
For example, if your multi-character column delimiter was a two colons (::), and your data had 6
fields, then you could use a regular expression such as:
(.*)::(.*)::(.*)::(.*)::(.*)::(.*)
4. Click Continue.
Parse Hive Tables
When creating a dataset from a Hive table, there is no need to define parsing controls in Platfora.
Platfora uses the Hive table definition to obtain metadata about the source data, such as which files to
process, the parsing logic for rows and columns, and the field names and data types contained in the
source data. You can only create a dataset based on Hive tables, not Hive views.
The following example shows how to define a table in Hive based on comma-delimited files that reside
in a directory of HDFS. The EXTERNAL keyword lets you provide a LOCATION so that Hive accesses
the files at their current location in HDFS. Without the EXTERNAL clause, Hive moves the files into its
own area of HDFS. When dropping an EXTERNAL table in Hive, data in the table is not deleted from the
file system.
CREATE EXTERNAL TABLE users(user_id INT, name STRING, gender STRING,
birthdate STRING)
COMMENT 'This table stores user data'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
Page 59
Data Ingest Guide - Define Datasets to Describe Data
STORED AS TEXTFILE
LOCATION '/data/users/*';
For Platfora to be able to access the Hive table defined above, you would need to make sure the system
user that the Platfora server runs as has read access to the /data/users directory of HDFS.
See the Apache Hive Wiki for more information about using Hive.
Hive to Platfora Data Type Mapping
When you create a dataset based on a Hive table, Platfora maps the data types of the Hive columns to
one of the Platfora internal data types.
Platfora has a number of built-in data types that can be used to classify the fields in a dataset. Hive also
has a set of primitive and complex data types it supports for Hive table columns.
Platfora does not currently support the Hive BINARY primitive data type.
HIVE DECIMAL data type is mapped to DOUBLE by default. This may result in a lost of precision due
to roundoff errors. You can choose to map DECIMAL columns to FIXED. This retains precision for
numbers that have four or fewer digits after the decimal point, and loses precision for more precise
numbers.
Hive complex data types (MAP, ARRAY, STRUCT and UNIONTYPE) are imported into Platfora as a single
JSON-formatted STRING. You can then use the Platfora expression language to define new computed
columns in the dataset that extract a particular key:value pair from the imported JSON structure.
Hive Data Type
Platfora Data Type
TINYINT
INTEGER
SMALLINT
INTEGER
INT
INTEGER
BIGINT
LONG
DECIMAL
DOUBLE
FLOAT
DOUBLE
DOUBLE
DOUBLE
STRING
STRING
MAP
STRING (JSON-formatted)
ARRAY
STRING (JSON-formatted)
STRUCT
STRING (JSON-formatted)
Page 60
Data Ingest Guide - Define Datasets to Describe Data
Hive Data Type
Platfora Data Type
UNIONTYPE
STRING (JSON-formatted)
TIMESTAMP (must be in Hive timestamp format DATETIME
of yyyy-MM-dd HH:mm:ss[:SSS]
Enable Hive SerDes in Platfora
If you are using Hive as a data source, Platfora must be able to parse the underlying source data files
that a Hive table definition refers to. For Hive to be able to support custom file formats, you implement
Serialization/Deserialization (SerDe) libraries in Hive that describe how to read (or parse) the data. Any
custom SerDe libraries that you implement in Hive must also be installed in Platfora.
In order for Platfora to be able to read and process data files referenced by a Hive table, any custom
SerDe library (.jar file) that you are using in your Hive table definitions must also be installed in
Platfora.
To install a Hive SerDe in Platfora, copy the SerDe .jar file to the following location on the Platfora
master server (create the extlib directory in the Platfora data directory if it doesn't exist):
$PLATFORA_DATA_DIR/extlib
Restart Platfora after installing all of your Hive SerDe jars:
platfora-services restart
How Platfora Uses Hive Partitions and Buckets
Hive source tables can be partitioned, bucketed, neither, or both. In Platfora, datasets defined from Hive
table sources take advantage of the partitioning defined in Hive. However, Platfora does not exploit the
clustering or sorting of bucketed tables at this time.
Defining a partitioning field on a Hive table organizes the data into separate files in the source file
system. The goal of partitioning is to improve query performance by keeping records together in the way
that they are accessed. When a Hive query uses a WHERE clause to filter data on a partitioning field, the
filter effectively describes which data files are relevant. If a Platfora lens includes a filter on any of the
partitioning columns defined in Hive, Platfora will only read the partitions that match the filter.
A bucketed table is created using the CLUSTER BY field [SORT BY field] INTO n
BUCKETS clause of the Hive table definition. Bucketing defines a hash partitioning of data based on
values in the table. A bucketed table may also be sorted within each bucket. When the table is bucketed,
each partition must be reorganized during the load phase to enforce the clustering and sorting. Platfora
does not exploit the clustering or sorting of bucketed tables at this time.
Platfora doesn't support Hive partitions with spaces in the partition name. Use the
underscore character (_) instead of white spaces.
Page 61
Data Ingest Guide - Define Datasets to Describe Data
Parse JSON Files
This section explains how to use the Platfora JSON parser to create datasets based on JSON files. JSON
is a plain-text file format comprised of two basic data structures: objects and arrays. The Platfora JSON
parser allows you to choose a top-level object to signify a record or row, and name:value pairs within an
object to signify columns or fields (including nested objects and arrays).
What is JSON?
JavaScript Object Notation (JSON) is a data-interchange format based on a subset of the JavaScript
Programming Language. JSON is a plain-text file format comprised of two basic data structures: objects
and arrays.
A name is just a string identifier, also sometimes called a key. A value can be a string, a number, true,
false, null, an object, or an array. An array is an ordered, comma-separated collection of values
enclosed in brackets []. Objects and arrays can be nested in a tree-like structure within a JSON record
or document.
For example, here is a user record in JSON format:
{
}
"userid" : "joro99"
"firstname" : "Joelle",
"lastname" : "Rose",
"email" : "[email protected]",
"phone" : [
{ "type" : "home", "number": "415 123-4567" },
{ "type" : "mobile", "number": "650 456-7890" },
{ "type" : "work", "number": null }
]
And the same user record in XML format:
<user>
<userid>joro99</userid>
<firstname>Joelle</firstname>
<lastname>Rose</lastname>
<email>[email protected]</email>
<phone>
<number type="home">415 123-4567</number>
<number type="mobile">650 456-7890</number>
<number type="work"></number>
</phone>
</user>
Supported JSON File Formats
This section describes how the Platfora JSON parser expects JSON files to be formatted, and how to
specify what makes a record or row in a JSON file. There are two general JSON file formats supported
by Platfora: JSON Object per line and JSON Object.
Page 62
Data Ingest Guide - Define Datasets to Describe Data
The JSON Object per line format supports files containing top-level JSON objects, with one object
per line. For example, here is a JSON file where each top-level object represents a user record with one
user object per line.
{"name": "John Smith","email": "[email protected]", "phone":
[{"type":"mobile","number":"123-456-7890"}]}
{"name": "Sally Jones", "email: "[email protected]", "phone":
[{"type":"home","number":"456-789-1007"}]}
{"name": "Jeff Hamm","email": "[email protected]", "phone":
[{"type":"mobile","number":"789-123-3456"}]}
The JSON Object format supports files containing a top-level array of JSON objects:
[
]
{"name": "John Smith","email": "[email protected]"},
{"name": "Sally Jones", "email: "[email protected]"},
{"name": "Jeff Hamm","email": "[email protected]"}
or one large JSON object with the records to import contained within a sub-array:
{ "registration-date": "Sept 24, 2014",
"users": [
{"name": "John Smith","email": "[email protected]"},
{"name": "Sally Jones", "email: "[email protected]"},
{"name": "Jeff Hamm","email": "[email protected]"} ]
}
In some cases, the structure of your JSON file might be more complicated. You must always specify
one level from the JSON object tree to use as the basis for rows. You can, however, still extract columns
from a top-level object as well.
As an example, suppose you had the following JSON file containing movie review records. You want
a row to be created for each reviews record, but still want to retain the value of movie_title and year for
each row:
[{"movie_title":"Friday the 13th",
"year":1980,
"reviews":[{"user":"Pam","stars":3,"comment":"a bit predictable"},
{"user":"Alice","stars":4,"comment":"classic slasher
flick"}]},
{"movie_title":"The Exorcist",
"year":1984,
"reviews":[{"user":"Jo","stars":5,"comment":"best horror movie ever"},
{"user":"Bryn","stars":4,"comment":"I will never eat pea
soup again"},
{"user":"Sam","stars":4,"comment":"loved it"}]},
{"movie_title":"Rosemary's Baby",
"year":1969,
"reviews":[{"user":"Fred","stars":4,"comment":"Mia Farrow is great"},
{"user":"Lou","stars":5,"comment":"the one that started it
all"}]}
]
Page 63
Data Ingest Guide - Define Datasets to Describe Data
Using the JSON Object parser, you would choose the reviews array as the record filter. You could
then add the movie_title and year columns by their path as follows:
$. movie_title
$. year
The $. notation starts the path from the base of the object tree hierarchy.
Use the JSON Parser
The Platfora JSON parser takes a sample of the source data to determine the format of your JSON files,
and then shows the object hierarchy so you can choose the rows and columns to include in the dataset.
1. When you select data that is in valid JSON format, Platfora recognizes the file format and chooses a
JSON parser.
2. The basis of a record or row depends on the format of your JSON files. You can either choose to use
each line in the file as a record (JSON Object per line), or choose a sub-array in the file to use as
the basis of a record (JSON Object).
3. If your object hierarchy is nested, you can add a Filter to a specific object in the hierarchy. This
allows you to use objects nested within a sub-array as the basis for rows.
4. Use the Record object tree to select the columns to include in the dataset. You can browse up to 20
JSON records when choosing columns.
5. You can add additional columns based on objects above what was used as the row Filter. Use the
Data Field Path to add a column by its path in the top-level object hierarchy. The $. notation is
used to specify a path from the root of the file.
6. Sometimes you can't delete columns that are added by mistake. For example, the parser may
incorrectly guess the row filter, or you might make a mistake adding columns using Data Field
Path. If this happens, you can always hide these columns on the Manage Fields step.
Page 64
Data Ingest Guide - Define Datasets to Describe Data
For the JSON Object per line format, each line in the file represents a row.
For the JSON Object format, the top-level object is used by default to signify rows. If the objects
you want to use as rows are contained within a sub-array, you can specify a Filter to the array name
containing the objects you want to use.
For example, in this JSON structure, the Filter value would be users (use the objects in the users array
as the basis for rows):
{ "registration_date": "September 24, 2014",
"users": [
{"name": "John Smith","email": "[email protected]"},
{"name": "Sally Jones", "email": "[email protected]"},
{"name": "Jeff Hamm","email": "[email protected]"} ]
}
Or in the example below, you could use the filter users.address to select the contents of the address
array as the basis for rows.
{ "registration_date": "September 24, 2014",
"users": [
{"name": "John Smith","email": "[email protected]",
"address": [ "street":"111 Main St.", "city":"Madison",
"state":"IL", "zip":"35460" ] },
{"name": "Sally Jones", "email": "[email protected]",
"address": [ "street":"32 Elm St.", "city":"Dallas",
"state":"TX", "zip":"23456" ] },
{"name": "Jeff Hamm","email": "[email protected]"},
"address": [ "street":"101 2nd St.", "city":"San Mateo",
"state":"CA", "zip":"94403" ] }
]
}
Once the parser knows the root object to use as rows, the JSON object tree is displayed in the Parsing
Controls panel. You can add fields contained within nested objects and arrays by selecting the field
name in the JSON tree. The field is then added as a dataset column. You can browse through a sample of
20 records to check for fields to add.
If you unselect a field containing a nested object or array (remove it from the dataset), and later decide
to select it again (add it back to the dataset), make sure that Format as JSON string is selected. This
Page 65
Data Ingest Guide - Define Datasets to Describe Data
will format the contents of the field as a JSON string rather than as a regular string. This is important if
you plan to do additional processing on the values using the JSON string functions.
In some cases, you may want to extract columns from an object one or more levels above the record
filter in the JSON structure.
For example, in this JSON structure above, the Filter value would be users (use the objects in the users
array as the basis for rows), but you may also want to include the registration_date object as a column.
To capture upstream objects as columns, you can add the field by its path in the object tree.
• The $. notation starts the path from the base of the object tree hierarchy.
• To access fields within a nested object, specify a dot-separated path of field names (for example
top_level_field_name.nested_field_name).
• To extract a value from an array, specify the dot-separated path of field names and the array
position starting at 0 for the first value in an array, 1 for the second value, and so on (for example
field_name.0).
• If the name identifier contains dot or period characters within the name itself, escape the name by
enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]
Page 66
Data Ingest Guide - Define Datasets to Describe Data
• If the field name is null (empty), use brackets with nothing in between as the identifier, for example
[].
Parse XML Files
This section explains how to use the Platfora XML parser to create datasets based on XML files.
XML is a plain-text file format that encodes data using different components, such as elements and
attributes, in a document hierarchy. The Platfora XML parser allows you to choose a top-level element
to signify the starting point of a record or row, and attributes or elements to signify columns or fields.
What is XML?
Extensible Markup Language (XML) is a markup language for encoding documents. XML is a textual
file format that can contain different components, including elements and attributes, in a document
hierarchy.
A valid XML document starts with a declaration that states the XML version and document encoding.
For example:
<?xml version= "1.0" encoding= "UTF-8"?>
An element is a logical component in the document. Elements always begin with opening tag and
end with a matching closing tag. Element content can contain text, markup, attributes, or other nested
Page 67
Data Ingest Guide - Define Datasets to Describe Data
elements, called child elements. For example, here is a parent users element that contains individual
child elements for each user:
<users>
<user name="John Smith" email="[email protected]"/>
<user name="Sally Jones" email="[email protected]"/>
<user name="Jeff Hamm" email="[email protected]"/>
</users>
Elements can be empty. For example, this image element has no content between its open and closing
tags:
<image href="mypicture.jpg"/>
Elements can also have attributes in their opening tag. Attributes are name=value pairs that contain
useful data about the element. For example, here is how you might list attributes of an element called
address:
<address street="45 Pine St." city="Atlanta" state="GA" zip="53291"/>
Elements can also have both attributes and content. For example, this address element has the actual
address components as attributes, and the address type as its content:
<address street="45 Pine St." city="Atlanta" state="GA" zip="53291">home
address</address>
For details on the XML standard, go to http://www.w3.org/XML/.
Supported XML File Formats
This section describes how the Platfora XML parser expects XML files to be formatted, and how to
specify what makes a record or row in an XML file. There are two general XML file formats supported
by Platfora: XML Element per line and XML Document.
The XML Element per line format supports files containing one XML record per line, each record
having the same top-level element and structure. For example, here is an XML file where each top-level
element represents a user record with one record per line.
<user name="John Smith" email="[email protected]"><phone type="mobile"
number="123-456-7890"/></user>
<user name="Sally Jones" email="[email protected]"><phone type="home"
number="456-789-1007"/></user>
<user name="Jeff Hamm" email="[email protected]"><phone type="mobile"
number="789-123-3456"/></user>
The XML Document format supports valid XML document files (one document per file).
In the following example, the top-level XML element contains nested XML element records:
<?xml version= "1.0" encoding= "UTF-8"?>
<registration date="Aug 21, 2012">
<users>
<user name="John Smith" email="[email protected]"/>
<user name="Sally Jones" email="[email protected]"/>
Page 68
Data Ingest Guide - Define Datasets to Describe Data
<user name="Jeff Hamm" email="[email protected]"/>
</users>
</registration>
In the following example, the top-level XML element contains a sub-tree of nested XML element
records:
<?xml version= "1.0" encoding= "UTF-8"?>
<registration date="Sept 24, 2014">
<region name="us-east">
<user name="Georgia" age="42" gender="F">
<address street="111 Main St." city="Madison" state="IL"
zip="35460"/>
<statusupdate type="registered"/>
</user>
<user name="Bobby" age="30" gender="M">
<address street="45 Pine St." city="Atlanta" state="GA"
zip="53291"/>
<statusupdate type="unsubscribed"/>
</user>
</region>
</registration>
Page 69
Data Ingest Guide - Define Datasets to Describe Data
Use the XML Parser
The Platfora XML parser takes a sample of the source data to determine the format of your XML files,
and then shows the element and attribute hierarchy so you can choose the rows and columns to include
in the dataset.
1. When you select data that is in valid XML format, Platfora recognizes the file format and chooses an
XML parser.
2. The basis of a record or row depends on the format of your XML files. You can either choose to use
each line in the file as a record (XML Element per line), or choose a child element in the XML
document to use as the basis of a record (XML Document).
3. (XML Document formats only) You can add a Filter that determines which rows to include in the
dataset. For more details, see Parsing Rows from XML Documents.
4. Use the Record element tree to select the elements and attributes to include in the dataset as
columns. You can browse up to 20 sample records when choosing columns. For more details, see
Extracting Columns from XML Using the Element Tree.
5. The Data Field Path field allows you to add a column represented by its path in the element
hierarchy. You can use any XPath 1.0 expression that is relative to the result of the row Filter. For
more details, see Extracting Columns from XML Using an XPath Expression.
6. Sometimes you can't delete columns that are added by mistake. For example, the parser may
incorrectly guess the row filter, or you might make a mistake adding columns using Data Field
Path. If this happens, you can always hide these columns on the Manage Fields step.
Page 70
Data Ingest Guide - Define Datasets to Describe Data
For the XML Element per line format, each line in the file represents a row.
For the XML Document format, by default, the top-level element below the root of the XML
document is used as the basis of rows. If you want to use different elements as the basis for rows, you
can enter a Filter to specify the element name you want to use as the basis of rows.
The Platfora XML parser supports an XPath-like notation for specifying which XML element to use
as rows. As an example of how to use the Platfora XML parser filter notation, suppose you had the
following XML document containing movie review records:
<?xml version= "1.0" encoding= "UTF-8"?>
<records>
<movie title="Friday the 13th" year="1980">
<reviews>
<review user="Pam" stars="3">a bit predictable</review>
<review user="Alice" stars="4">classic slasher flick</review>
</reviews>
</movie>
<movie title="The Exorcist" year="1984">
<reviews>
<review user="Jo" stars="5">best horror movie ever</review>
<review user="Bryn" stars="4">I will never eat pea soup again</
review>
<review user="Sam" stars="4">loved it</review>
</reviews>
</movie>
</records>
The document hierarchy is assumed to start one level below the root element. The root element would
be the records element in this example. From this point in the document, you can use the following
XPath-like notation to specify row filters:
Row Filter
Notation
Description
Example
//
Specifies all elements with the given
name located within the previous
element no matter where they exist
within the previous element. When used
at the beginning of the row filter, this
specifies all elements in the document
with the given name.
Use any review element as the
Page 71
basis for rows:
//review
Data Ingest Guide - Define Datasets to Describe Data
Row Filter
Notation
Description
Example
/
Specifies an element with the given
name one level in the document
hierarchy within the element listed
before it. When used as the first
character in the row filter, it specifies
one level below the root element of the
document.
Use the review element as the
Page 72
basis for rows:
/movie/reviews/review
Data Ingest Guide - Define Datasets to Describe Data
Row Filter
Notation
Description
Example
$
Specifies an element in the row filter as
an extraction point. An extraction point
Use the review element as the
basis for rows while allowing the
ability to extract reviews data for
is an element in a XML row filter that
allows you to define a variable that can
that row as column data:
be used to define a column definition
/movie/$reviews/review
expression relative to that element in
the filter. The last element in a row
filter is always considered an extraction
point, so it is unnecessary to use the $
notation for the last element.
You can specify zero or more extraction
points in a row filter.
Extraction points give you more
flexibility when extracting columns.
Use an extraction point element at the
beginning of a column definition to
signify an expression relative to the
extraction point element.
You might want to use an extraction
point to extract a column or attribute
from a parent element one or more
levels above the last element defined
in the row filter. For example, for the
row filter /a/$b/c/d you could write the
following column definition:
$bxpath_expression
Use caution when adding an extraction
point to the row filter. Platfora buffers all
XML source data in an extraction point
element during data ingest and when it
builds a lens in order to extract column
data. Depending on the source data,
this may impact performance during
data ingest and may increase lens build
times.
Note for the XML structure above, the following row filter expressions are equivalent:
• movie
• /movie
• $/records/movie
Page 73
Data Ingest Guide - Define Datasets to Describe Data
• $//movie
For example, in this XML structure, the Filter value would be $users (use the collection of child
elements contained in the in the users element as the basis for rows):
<?xml version= "1.0" encoding= "UTF-8"?>
<registration date="Sept 24, 2014">
<users>
<user name="John Smith" email="[email protected]"/>
<user name="Sally Jones" email="[email protected]"/>
<user name="Jeff Hamm" email="[email protected]"/>
</users>
</registration>
Once the parser knows the element to use as rows, the XML element tree is displayed in the Record
panel. You can add fields based on XML attributes or nested XML elements by selecting the element or
attribute name in the XML element tree. The field is then added as a dataset column. You can browse
through a sample of 20 records in a single file to check for fields to add.
If you unselect a field containing nested XML elements (remove it from the dataset), and later decide
to select it again (add it back to the dataset), make sure that Format as XML string is selected. This
will format the contents of the field as XML rather than a regular string. This is important if you plan to
Page 74
Data Ingest Guide - Define Datasets to Describe Data
do additional processing on the values using the XPATH string functions. For more details, see Parsing
of Nested Elements and Content.
Another way to add columns is to enter an XPath expression in Data Field Path that represents a path
in the element hierarchy. You might want to do this to extract columns from a parent element one or
more levels above the row filter in the XML document hierarchy.
Note the following rules and guidelines when using an XPath expression to extract columns:
• The Platfora XML parser only supports XPath 1.0.
• The expression must be relative to the last element or any extraction point element in the row Filter.
• Platfora recommends starting the expression with a variable using the $element/ syntax. The
element must be the last element or an extraction point element in the row Filter.
• XML namespaces are not supported. The XML parser strips all XML namespaces from the XML
file.
• Variables are only allowed at the beginning of the expression.
For example, assume you have the following row filter: /movie/$reviews/review
Page 75
Data Ingest Guide - Define Datasets to Describe Data
You could create a column definition expression for any element or attribute in the document
hierarchy that comes after the review element. Additionally, because the row filter includes an
extraction point for $reviews, you could also create a column definition relative to that node:
$reviewsxpath_expression.
For more information about XPath, see http://www.w3.org/TR/xpath/.
If the element you are parsing contains nested XML elements and content, and you want to preserve
the XML structure and hierarchy, select Format as XML string. This will allow you to do further
processing on this data with the XPATH_STRING, XPATH_STRINGS and XPATH_XML functions.
If the column contains nested elements and Format as XML string is not enabled, Platfora returns
NULL.
Repeated elements are wrapped inside a <list> ... </list> parent element to maintain valid
XML structure.
Parse Avro Files
The Platfora Avro parser supports Avro container files where the top-level object is an Avro record
data type. The file must have a JSON-formatted schema declared at the beginning of the file, and the
serialized data must be in the Avro binary-encoded format.
1. On the Parse Data step of the dataset workspace, select Avro as the File Type in the Parsing
Controls panel.
2. The Avro parser uses the JSON schema of the source file to extract the name:value pairs from each
record object in the Avro file.
Page 76
Data Ingest Guide - Define Datasets to Describe Data
What is Avro?
Apache Avro is a remote procedure call (RPC) and data serialization framework. Its primary use is to
provide a data serialization format for persistent data stored in Hadoop.
Avro uses JSON for defining schema, and JSON or binary format for data encoding. When Avro data
is stored in a persistent file (called a container file), its schema is stored along with it. This allows any
program to be able to read the serialized data in the file. For more information about the Avro schema
and encoding formats, see the Apache Avro Specification documentation.
Avro to Platfora Data Type Mapping
Avro has a set of primitive and complex data types it supports. These are mapped to Platfora's internal
data types.
Complex data types are imported into Platfora as a single JSON-formatted STRING. You can then use
the JSON String Functions in the Platfora expression language to define new computed columns in the
dataset that extract a particular name:value pair from the imported JSON structure.
Avro Data Type
Platfora Data Type
BOOLEAN
INTEGER
INT
INTEGER
LONG
LONG
FLOAT
DOUBLE
DOUBLE
DOUBLE
STRING
STRING
BYTES
STRING (Hex-encoded)
RECORD
STRING (JSON-formatted)
ENUM
STRING (JSON-formatted)
ARRAY
STRING (JSON-formatted)
MAP
STRING (JSON-formatted)
UNION
STRING (JSON-formatted)
FIXED
FIXED
Page 77
Data Ingest Guide - Define Datasets to Describe Data
Parse Web Access Logs
A web access log contains records about incoming requests made to a web server. Platfora has a builtin Web Access Log parser that automatically recognizes web access logs that adhere to the NCSA
common or combined log formats.
1. On the Parse Data step of the dataset workspace, select Web Access Log as the File Type in
the Parsing Controls panel.
2. The Web Access Log parser extracts fields according to the supported NCSA log formats.
Supported Web Access Log Formats
Platfora supports web access logs that comply with the NCSA common or combined log formats. This is
the log format used by many popular web servers (such as Apache HTTP Server).
An example log line for the common format looks something like this:
123.1.1.456 - - [16/Aug/2012:15:01:52 -0700] "GET /home/index.html
HTTP/1.1" 200 1043
The NCSA common log format contains the following fields for each HTTP access record:
• Host - The IP address or hostname of the HTTP client that made the request.
• Logname - Identifies the client making the HTTP request. If no value is present, a dash (-) is
substituted.
• User - The user name used by the client for authentication. If no value is present, a dash (-) is
substituted.
• Time - The timestamp of the request in the format of dd/MMM/yyyy:hh:mm:ss +-hhmm.
• Request - The HTTP request. The request field contains three pieces of information: the requested
resource (/home/index.html), the HTTP method (GET) and the HTTP protocol version
(HTTP/1.1).
Page 78
Data Ingest Guide - Define Datasets to Describe Data
• Status - The HTTP status code indicating the success or failure of the request.
• Response Size - The number of bytes of data transferred as part of the HTTP request, not including
the HTTP header.
The NCSA combined log format contains the same fields as the common log format with the addition of
the following optional fields:
• Referrer - The URL that linked the requestor to your site. For example, http://
www.platfora.com.
• User-Agent - The web browser and platform used by the requestor. For example, Mozilla/4.05
[en] (WinNT; I).
• Cookie - Cookies are pieces of information that the HTTP server can send back to a client along
the with the requested resource. A client browser may store this information and send it back to the
HTTP server upon making additional resource requests. The HTTP server can establish multiple
cookies per HTTP request. Cookie values take the form KEY=VALUE. Multiple cookie key/value
pairs are delineated by semicolons (;). For example, USERID=jsmith;IMPID=01234.
For web access logs that do not conform to the default expected ordering of fields and data types,
Platfora will make a best guess at parsing the rows and columns found in the web log files, and use
generic column headers (for example column1, column2, etc.). You can then rename the columns to
match your web log format.
Parse Other File Types
For other file types that cannot be parsed using the built-in parsing controls, Platfora provides two
generic parsers: Regex and Line. As long as your source data has one record per line, you can use one
of these generic parsers to extract columns from semi-structured source data.
Parse Raw Lines with a Regular Expression
The Regex parser allows you to search lines in the source data and extract columns using a regular
expression. It evaluates each line in the source data against a regular expression to determine if there is a
Page 79
Data Ingest Guide - Define Datasets to Describe Data
match, and returns each capturing group of the regular expression as a column. Regular expressions are a
way to describe a set of rows based on characteristics they share in common.
1. On the Parse Data step of the dataset workspace, select Regex as the File Type in the Parsing
Controls panel.
2. Enter a regular expression that matches the entire line with parenthesis around each column matching
pattern you want to return.
3. Confirm the regular expression is correct by comparing the raw data to the wrangled data.
Platfora uses capturing groups to determine what parts of the regular expression to return as columns.
The Regex line parser applies the user-supplied regular expression against each line in the source file,
and returns each capturing group in the regular expression as a column value.
For example, suppose you had user records in a file, and the lines were formatted like this (no common
delimiter is used between fields):
Name: John Smith Address: 123 Main St. Age: 25 Comment: Active
Name: Sally R. Jones Address: 2 E. El Camino Real Age: 32
Name: Rod Rogers Address: 55 Elm Street Comment: Suspended
You could use the following regular expression to extract the Full Name, Last Name Only, Address, Age,
and Comment column values:
Name: (.*\s(\p{Alpha}+)) Address:\s+(.*) Age:\s+([0-9]+)(?:\s+Comment:\s
+(.*))?
Page 80
Data Ingest Guide - Define Datasets to Describe Data
Parse Raw Lines with Platfora Expressions
The Line parser simply returns each line in the source file as one column value, essentially not parsing
the source data at all. This allows you to bypass the parsing step and instead define a series of computed
fields to extract the desired column values out of each line.
1. On the Parse Data step of the dataset workspace, select the Line in the Parsing Controls
panel.
This creates a single column where each row contains an entire record.
2. Go to the Manage Fields step.
3. Define computed fields that extract columns from the raw line.
Prepare Base Dataset Fields
When you first add a dataset, it only has its Base fields. These are the fields parsed direcly from the raw
source data. This section describes the tasks involved in making sure the base data is correct and ready
for Platfora's analyst users. In some cases, the data values contained in the base fields may be ready for
consumption. Most likely, however, the raw data values will need some additional processing. It is best
practice to confirm and edit all of the base fields in a dataset before you begin defining computed field
expressions to do any additional processing on the dataset.
Confirm Data Types
The dataset parser will guess the data type of a field based on the sampled source data, but you may need
to change the data type depending on the additional processing you plan to do.
The expression language functions require input values to be of a certain data type. It is best practice to
confirm and change the data types of your base fields before defining computed fields. Changing them
later may introduce errors to your computed field expressions.
Page 81
Data Ingest Guide - Define Datasets to Describe Data
Note that you can only change the data type of a Base field. Computed field data types are determined
by the return type of the computed expression.
1. On the Manage Fields step of the dataset workspace, verify the data types that Platfora has
assigned to the base fields.
2. Select a column and change the data type in the column header or the Field Info panel.
Note that you cannot accurately convert the data type of a field to DATETIME from the drop-down
data type menus. See Cast DATETIME Data Types.
About Platfora Data Types
Each dataset field, whether a base or a computed field, has a data type attribute. The data type defines
what kind of values the field can hold. Platfora has a number of built-in data types you can assign to
dataset fields.
The dataset parser attempts to guess a field's data type by sampling the data. A base field's data type
restricts the expressions you can apply to that field. For example, you can only calculate a sum with
numeric fields. For computed fields, the expression's result determines the field's data type.
You may want to change a base field's data type to accommodate the computed field processing you
plan to do. For example, many value manipulation functions require input values to be strings.
Platfora supports the following data types:
Table 1: Platfora Data Types
Type
Description
Range of Values
STRING
variable length non-unicode maximum string length of 2,147,483,647
string data
Page 82
Data Ingest Guide - Define Datasets to Describe Data
Type
Description
Range of Values
DATETIME
date combined with a time date range: January 1, 1753, through December 31,
of day with fractional
9999, time range: 00:00:00 through 23:59:59.997
seconds based on a 24-hour
clock
FIXED
Fixed decimal values
with accuracy to a tenthousandth of a numeric
unit
-922,337,203,685,477.5808 through 2^63 - 1
(+922,337,203,685,477.5807), with accuracy to a
ten-thousandth of a numeric unit.
INTEGER
32-bit integer (whole
number)
-2,147,483,648 to 2,147,483,647
LONG
64-bit long integer (whole
number)
-9,223,372,036,854,775,808 to
9,223,372,036,854,775,807
DOUBLE
double-precision 64-bit
floating point number
4.94065645841246544e-324d to
1.79769313486231570e+308d (positive or
negative)
Change Field Names
The field name defined in the dataset is what users see when they browse the Platfora data catalog. In
some cases, the field names imported from the source data may be fine. In other cases, you may want to
change the field names to something more understandable for your users.
It is important to decide on base field names before you begin defining computed
fields and references (joins), as changing a field name later on will break
computed field expressions and references that rely on that field name.
1. On the Manage Fields step of the dataset workspace, select the column you want to rename.
Page 83
Data Ingest Guide - Define Datasets to Describe Data
2. Enter the new name in the column header or the Field Info panel.
If a name change breaks other computed field expressions or reference links in the dataset, the
error panel will show all of the affected computed fields and references. You can either change the
dependent field back to the original name, or edit the affected fields to use the new name.
Add Field Descriptions
Field descriptions are displayed in the data catalog view of a dataset or lens, and can help users decide if
a field is relevant for their needs. Data administrators should add helpful field descriptions that explain
the meaning and data value characteristics of a field.
1. On the Manage Fields step of the dataset workspace, select the column you want to update.
Page 84
Data Ingest Guide - Define Datasets to Describe Data
2. In the Field Info panel, click inside the Description text box and enter a description.
Hide Columns from Data Catalog View
Hiding a column or field in a dataset definition removes it from the data catalog view of the dataset.
Users cannot see hidden columns when browsing datasets in the data catalog, or select them when they
build a lens.
1. On the Manage Fields step of the dataset workspace, select the column you want to hide.
2. Check Hide Column in the column header or in the Field Info panel.
Why Hide Dataset Columns?
A data administrator can control what fields of a dataset are visible to Platfora users. Hidden fields are
not visible in the data catalog view of the dataset and connot be selected for a lens.
You might choose to hide a field for the following reasons:
• Protect Sensitive Data. In some cases, you may want to hide fields to protect sensitive information.
In Platfora, you can hide detail fields, but still allow access to summary information. For example,
in a dataset containing employee salary information, you may want to hide sensitive identifying
information such as names, job titles, and individual salaries, but still allow analysts to view average
salary by department or job level. In database applications, this is often referred to as column-level
security or column access control.
• Hide Unpopulated or Sparse Data Columns. You may have columns in your raw data that did
not have any data collected, or the data collected is too sparse to be valid for analysis. For example,
a web application may have a placeholder column for comments, but it was never implemented on
the website so the comments column is empty. Hiding the column prevents analysts from choosing a
field with mostly null values when they go to build a lens.
Page 85
Data Ingest Guide - Define Datasets to Describe Data
• Control Lens Size. High cardinality dimension fields can significantly increase the size of a lens.
Hiding such fields prevents analysts from creating large lenses unintentionally. For example, you
may have a User ID field with millions of unique values. If you do not want analysts to be able to
create a lens at that level of granularity, you can hide User ID, but still keep other dimension fields
about users available, such as age or gender.
• Use Computed Values Instead of Base Values. You may add a computed field to transform the
values of the raw data. You want your users to choose the transformed values, not the raw values. For
example, you may have a return reason code column where the reason codes are numbers (1,2,3, and
so on). You want to transform the numbers to the actual reason information (Did not Fit, Changed
Mind, Poor Quality, and so on) so the data is more usable during analysis.
• Hide Computed Fields that do Interim Processing. As you work on your dataset to cleanse and
transform the data, you may need to add interim computed fields to achieve a final result. These are
fields that are necessary to do a processing step, but are not intended for final consumption. These
working fields can be hidden so they do not clutter the data catalog view of the dataset.
Default Values and NULL Processing
If a field or column value in a dataset is empty, it is considered a NULL value. During lens processing,
Platfora replaces all NULL values with a default value instead. Platfora lenses and vizboards have no
concept of NULL values. NULLs are always substituted with the default field values specified in the
dataset definition.
How Platfora Processes NULL Values
A value can be NULL for the following reasons:
• The raw data is missing values for a particular field.
• A computed field expression returns an empty or invalid result.
• A record in the focus (or fact) dataset does not have a corresponding record in a referenced (or
dimension) dataset. During lens processing, any rows that do not join will use the default values
in place of the unjoined dimension fields. For lenses that include fields from referenced datasets,
Platfora performs an outer join between the focus dataset and any referenced datasets included in the
lens. This means that rows in the fact dataset are compared to related rows in the referenced datasets.
Any row that does not have a corresponding row in the referenced dataset is considered an unjoined
foreign key. The dimension columns for unjoined foreign keys are treated as NULL and replaced
with the default values.
A Platfora aggregate lens is analagous to a summary or roll-up table in a data warehouse. During lens
processing, the measure values are pre-aggregated and grouped by each dimension field value included
in the lens. For dimension fields, NULL values are replaced with the default values before the measure
aggregations are calculated.
For measure fields, 0 is used in place of NULL to compute the measure value. Average (AVG)
calculations exclude NULL values from the row count.
Page 86
Data Ingest Guide - Define Datasets to Describe Data
Default Values by Data Type
If you do not specify your own default values in the dataset, the following default values are used in
place of any NULL value. The default value depends on the data type of the field or column.
Data Type
Default Value
LONG, INTEGER, DOUBLE, FIXED
0
STRING
NULL (as a string)
DATETIME
January 1, 1970 12:00:00:000 GMT
LOCATION (latitude,longitude coordinate
position)
0,0
Change the Default Value for a Column
You can specify different default values on a per-column basis. These values will replace any NULL
values in that column during lens build processing. Analysts will see the default values instead of NULL
(empty) values when they are working with the data in a vizboard.
To change the default value for a column:
1. Go to the Manage Fields step of the dataset workspace and select a column.
2. Click the Default Value text box in the Field Info panel to edit it.
Bulk Upload Field Header Information
When a dataset has a lot of fields to manage, it may be easier to update several field names, data types,
descriptions, and visibility settings all at once rather than editing each field one-by-one in the Platfora
Page 87
Data Ingest Guide - Define Datasets to Describe Data
application. To do this, you can upload a comma or tab delimited text file containing the field header
information you want to set.
1. Create an update file on your local machine containing the field information you want to update.
This file must meet the following specifications:
• It must be a comma-delimited or tab-delimited text file.
• It can contain up to four lines (separated by a new line). Any additional lines in the file will be
ignored.
• Field names are specified on the first line of the file.
• Field data types are specified on the second line of the file ( DOUBLE, FIXED, INTEGER, LONG,
STRING, or DATETIME).
• Field descriptions are specified on the third line of the file.
• Field visibility settings are specified on the fourth line of the file (Hidden or Not Hidden).
• On a line, values must be be specified in the column order of the dataset.
2. On the Parse Data step of the dataset workspace, click Upload Field Names. Find and open
your update file.
3. After uploading the file, advance to the Manage Fields step to confirm the results.
Example Update Files
Here is an update file that updates the field names, descriptions, data types, and visibility settings for
the first four columns of a dataset.
UserID,Name,Email,Address
INTEGER,STRING,STRING,STRING
Page 88
Data Ingest Guide - Define Datasets to Describe Data
The unique user ID,The user's name,The email linked to this user's
account, The user's mailing address
Hidden,Not Hidden,Not Hidden,Not Hidden
The double-quote character can be used to quote field names or descriptions. This is useful if a field
name or description contains the delimiter character (comma or tab). For example:
UserID,Name,Email
,"The user's name, both first and last.",,The user's mailing address
HIDDEN
Notice how lines can be left blank, and values can be skipped over by leaving them out or by
specifying nothing between the two delimiters. Missing values will not be updated in the Platfora
dataset definition (the dataset will use the previously set values).
Transform Data with Computed Fields
The way you transform data in Platfora is by adding computed fields to your dataset definition. A dataset
computed field contains an expression that describes a single data processing step.
Sometimes several steps are needed to achieve the result that you want. The result of a dataset computed
field can be used in the expressions of other dataset computed fields, allowing you to define a chain of
processing steps.
FAQs - Dataset Computed Fields
This section answers the most frequently asked questions (FAQs) about creating and editing dataset
computed fields.
What kinds of things can I do with dataset computed fields?
Computed fields are useful for deriving meaningful values from base fields (such as calculating
someone's age based on their birthday), doing data cleansing and pre-processing (such as grouping
similar values together or substituting one value for another), or for computing new data values based on
a number of input variables (such as calculating a profit margin value based on revenue and costs).
Platfora has an extensive library of built-in functions that you can use to define data processing tasks.
These functions are organized by the type of data they operate on, or the kind of processing they do. See
the Expression Quick Reference for a list of what's available.
Can I do further processing on the results of a computed field?
Yes. A computed field is treated just like any other field in the dataset. You can refer to it in other
computed field expressions or aggregate the results to create a measure.
To analyst users, computed fields are just like any other dataset field. Users can include them in a lens
and analyze their results in a vizboard.
Page 89
Data Ingest Guide - Define Datasets to Describe Data
One exception is a computed field that uses an aggregate function in its expression (measure). You
cannot combine row functions and aggregate functions in the same expression. A row function cannot
take a measure field as input. Per-row processing on aggregated data in not allowed.
How do I edit a computed field expression in a dataset?
Go to the Manage Fields step of the dataset workspace, and find the computed column you want
to edit. With the column selected, click the expression in the Field Info panel. This will open the
expression builder.
How do I remove a computed field from a dataset?
Go to the Manage Fields step of the dataset workspace, find the computed column you want to edit,
and click the X in the field header. Note that this might cause errors if other computed fields refer to the
deleted field.
If you need the computed field for an interim processing step, but want to remove it from the selection of
fields that the users see, you can hide it. Hiding a field keeps it in the dataset definition and allows it to
be referred to by other computed field expressions. However, users cannot see hidden fields in the data
catalog, or select them in a lens or vizboard. See Hide Columns from Data Catalog View.
Where can I find examples of useful computed field expressions?
Platfora's expression reference documentation has lots of examples of useful expressions. See
Expression Language Reference.
Why isn't my computed field showing any sample values?
Certain types of computed field expressions can only be computed during lens build processing. Because
of the complicated processing involved, the dataset workspace can't show sample results for:
• Measures (computed fields containing aggregate functions)
• Event Series Processing (computed fields containing PARTITION expressions)
• Computed field expressions that reference fields in other datasets
Why can't I change the data type of a computed field?
A computed field's data type is set by the output type of its expression. For example, a CONCAT function
always outputs a STRING. If you want the output data type to be something else, you can nest the
expression inside the appropriate data type conversion function. For example:
TO_INT(CONCAT(field1, field2))
Page 90
Data Ingest Guide - Define Datasets to Describe Data
Can analyst users add computed fields if they want?
Analyst users can't add computed fields to a dataset. You must be a data administrator and have the
appropriate dataset permissions to edit a dataset.
In a vizboard, analyst users can create computed fields to manipulate the data they already have in their
lens. With some exceptions, analyst users can add a vizboard computed field that can do almost anything
that a dataset computed field can do.
However, event series processing (ESP) computed fields and most aggregate functions (measure
expressions) cannot be used to create vizboard computed fields.
Add a Dataset Computed Field
You can add a new dataset computed field on the Manage Fields step of the dataset workspace. A
computed field has a name, a description, and an expression. The computed field expression describes
some processing task you want to perform on other fields in the dataset.
Computed fields contain expressions that can take other fields as input. These fields can be base fields or
they can be other computed fields. When you save a computed field, it appears as a new column in the
dataset definition.
1. Go to the Manage Fields step of the dataset workspace.
2. Choose Computed Field from the dataset workspace Add menu.
This opens the Add Field dialog containing the expression builder controls.
Page 91
Data Ingest Guide - Define Datasets to Describe Data
3. Enter a name for your field and a description.
The description is optional but very useful for others that will use the field later.
4. Choose a function from the Functions list.
Use the drop-down to restrict the type of functions you see. Functions are organized by the type of
data they operate on, or the type of processing they do.
5. Double-click a function in the Functions list to add it to the Expression area.
The Expression panel updates with the function's template. Also, the Fields list refreshes with
the fields that can be used as input to the selected function. For example, the CONCAT function only
accepts STRING type fields.
6. Double-click a field in the Fields list to add it into the Expression area.
7. Continue adding functions and fields into your expression until it is complete.
8. Make sure your expression is correct.
The system checks your syntax as you build the expression. The yellow text box below the
Expression area displays any error messages. You can save expressions that contain errors, but
will not be able to save the dataset until all expressions evaluate successfully.
9. Click Save to add the new computed field to the dataset.
Your new computed field appears as a new column in the dataset.
10.Check the computed column values to make sure the expression logic is working as expected.
The dataset workspace can't show sample results for measures, event series processing computed
fields, or computed fields that operate on fields of a referenced dataset.
Expressions are an advanced topic. For information on working with the Platfora expression syntax, see
Expressions Guide .
Add Binned Fields
A binned field is a special kind of computed field that groups ranges of values together to create new
categories. The dataset workspace has tools for quickly creating bins on numeric type fields. Binned
fields are a way to reduce the number of values in a high-cardinality column, or to group data in a
Page 92
Data Ingest Guide - Define Datasets to Describe Data
way that makes it easier to analyze. For example, you might want to bin the values in an age field into
categories such as under 18, 19 to 29, 30-39, and 40 and over.
Bin Numeric Values
You can bin numeric values by adding a binned quick field.
1. On the Manage Fields step of the dataset workspace, select a column that is a numeric data type.
2. In the Field Info panel, click Add Bins.
3. Choose a Bin Method and enter your bin intervals.
• Even Intervals will group numeric values into even numbered bins. The bin name that is
returned is determined by rounding the value down to the starting value of its containing bin. For
Page 93
Data Ingest Guide - Define Datasets to Describe Data
example, if the interval is 10, then a value of 5 would return 0 (it is in the bin 0-9), and a value of
11 would return 10 (it is in the bin 10-19).
• Custom Intervals groups values into user-defined ranges and assigns a text label to each
range. For example, suppose you had a Flight Duration field that was in minutes and you wanted
to bin the values into one hour intervals (60 minutes). Each value you enter creates a range
between the current value and the previous one. So if you entered a starting value of 60 (one
hour), the starting range would be less than one hour. If the last value you entered was 600 (10
hours), the ending range would be over 10 hours.
With custom intervals, the values you enter should correspond to the data type. For example,
an integer would have values such as 60, 120, etc. A double would have values such as 60.00,
120.00, etc.
Page 94
Data Ingest Guide - Define Datasets to Describe Data
The output of a custom interval is always a string (text label).
4. Edit the name of the new binned field.
5. Edit the description of the new binned field.
6. Click Add.
The binned column is added to the dataset.
7. Verify that the bin values are calculated as expected.
Bin Text Values
If you wanted to bin text or STRING values, you can define a computed field that groups values together
using a CASE expression.
For example, here is a CASE expression to bucket values of a name field together by their first letter:
CASE
WHEN
WHEN
WHEN
WHEN SUBSTRING(name,0,1)=="A"
SUBSTRING(name,0,1)=="B" THEN
SUBSTRING(name,0,1)=="C" THEN
SUBSTRING(name,0,1)=="D" THEN
THEN "A"
"B"
"C"
"D"
Page 95
Data Ingest Guide - Define Datasets to Describe Data
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
WHEN
ELSE
SUBSTRING(name,0,1)=="E"
SUBSTRING(name,0,1)=="F"
SUBSTRING(name,0,1)=="G"
SUBSTRING(name,0,1)=="H"
SUBSTRING(name,0,1)=="I"
SUBSTRING(name,0,1)=="J"
SUBSTRING(name,0,1)=="K"
SUBSTRING(name,0,1)=="L"
SUBSTRING(name,0,1)=="M"
SUBSTRING(name,0,1)=="N"
SUBSTRING(name,0,1)=="O"
SUBSTRING(name,0,1)=="P"
SUBSTRING(name,0,1)=="Q"
SUBSTRING(name,0,1)=="R"
SUBSTRING(name,0,1)=="S"
SUBSTRING(name,0,1)=="T"
SUBSTRING(name,0,1)=="U"
SUBSTRING(name,0,1)=="V"
SUBSTRING(name,0,1)=="W"
SUBSTRING(name,0,1)=="X"
SUBSTRING(name,0,1)=="Y"
SUBSTRING(name,0,1)=="Z"
"unknown" END
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
THEN
"E"
"F"
"G"
"H"
"I"
"J"
"K"
"L"
"M"
"N"
"O"
"P"
"Q"
"R"
"S"
"T"
"U"
"V"
"W"
"X"
"Y"
"Z"
Expressions are an advanced topic. For information on working with Platfora expressions and their
component parts, see Expressions Guide.
Add Measures for Quantitative Analysis
A measure is a special type of computed field that returns an aggregated value for a group of records.
Measures provide the basis for quantitative analysis when you build a lens or visualization in Platfora.
Every dataset, lens, or visualization must have at least one measure. There are a couple of ways to add
measures to a dataset.
FAQs - Dataset Measures
This section describes the basic concept of measures, and why they are needed in a Platfora dataset.
Measures are necessary if you plan to build aggregate lenses from a dataset, and use the data for
quantitative analysis.
What is a measure?
Measures provide the basis for quantitative analysis in a visualization or lens query. A measure is a
numeric value representing an aggregation of values from multiple rows. For example, measures contain
data such as total dollar amounts, average number of users, count distinct of users, and so on.
Measure values always result from a computed field that uses an aggregate function in its expression.
Examples of aggregate functions include COUNT, DISTINCT, AVG, SUM, MIN, MAX, VARIANCE,
and so on.
Page 96
Data Ingest Guide - Define Datasets to Describe Data
Why do I need to add measures to a dataset?
In some data analysis tools, measures (or metrics as they are sometimes called) can be aggregated at the
time of analysis because the amount of data to aggregate is relatively small. In Platfora, however, the
data in a lens is pre-aggregated to optimize performance of big data queries. Therefore, you must decide
how to aggregate the metrics of your dataset up front. You do this by defining measures either in the
dataset or at lens build time. When you go to analyze the data in a vizboard, you can only do quantitative
analysis on the measures you have available in the lens.
How do I add measures to a dataset?
There are a couple of ways to add measures to a dataset:
• Add a computed field to the dataset that uses an aggregate function in its expression. Measures
computed in this way allow data administrators more control over how the data is aggregated, and
what level of detail is available to users. For example, you may want to prevent users from seeing
the original values of salary field, but allow users to see averages or percentiles of salary data. Also,
more complex aggregate calculations, such as standard deviation or ranking, can only be done with
computed field expressions.
• Choose quick field aggregations on certain columns of the dataset. This way, if a user chooses the
field in the lens builder, they will automatically get the measure aggregations you have selected.
Users can always override quick field selections if they want.
• Use the default measure. Every dataset has one default measure, which is a simple count of dataset
records.
Can analyst users add their own measures if they want?
Analyst users can always choose quick measure aggregations when they go to build a lens, but they can't
add computed measures to a dataset. You must be a data administrator and have the appropriate dataset
permissions to add computed fields to a dataset.
In a vizboard, users can manipulate the measure data they already have in their lens. They can use
ROLLUP and window functions to compute measure results over different time frames or categories.
Most aggregate calculations must be computed during lens build processing. However, a few aggregate
expressions are allowed without having to rebuild the lens. DISTINCT, MIN, and MAX can be used to
define new measures in the vizboard without having to rebuild the lens.
What doe the Original Value quick field do?
If Original Value is selected on a field, then all possible values of the field are included in a lens (if
the field is selected in the lens). It also means the field can be used as a dimension (for grouping measure
data) in an analysis. If a field only make sense as a measure, you should deselect Original Value. This
will only include the aggregate results in the lens and keep the lens size down.
Page 97
Data Ingest Guide - Define Datasets to Describe Data
The Default 'Total Records' Measure
Platfora automatically adds a default measure to every dataset you create. This measure is called
Total Records, and it counts the number of records (or rows) in the dataset. You can change the name,
description, or visibility of this default measure, but you cannot delete it. When you build a lens from a
dataset, this measure is always selected by default.
Add Quick Measures
If you have a field in your dataset that you want to use for quantitative analysis, you can select that field
and quickly add measures to the dataset. A quick measure sets the default aggregation(s) to use when a
user builds a lens.
Quick measures are an easy way to add measures to a dataset without having to define new computed
fields or write complicated expressions. Quick measures are added to a field in a dataset, and they set the
default measures to create if a user chooses that field for their lens. Users can always decide to override
the default measure selections when they define a lens.
Page 98
Data Ingest Guide - Define Datasets to Describe Data
1. On the Manage Fields step of the dataset workspace, select the field you want to use as a measure.
2. In the Field Info panel, choose how the values should be aggregated by default.
DISTINCT (count of distinct values) is available for all field types. MIN (lowest value) and
MAX (highest value) are available for numeric-type or datetime-type fields. SUM (total) and AVG
(average) are available for numeric-type fields only.
Leaving Original Value selected will also add the field as a dimension
(grouping column) if it is selected for a lens. In most cases, fields that are
intended to be used as measures (aggregated data only) should not have
Original Value selected, as this can cause the lens to be larger than intended.
Add Computed Measures
In addition to quick measures, you can create more sophisticated measures using computed field
expressions. A computed field expression containing an aggregate function is considered a measure.
Page 99
Data Ingest Guide - Define Datasets to Describe Data
1. Go to the Manage Fields step of the dataset workspace.
Review the sample data values before writing your measure expression.
2. Choose Computed Field from the dataset workspace Add menu.
This opens the Add Field dialog containing the expression builder controls.
3. Enter a name for your field and a description.
The description is optional but very useful for others that will use the field later.
4. Choose Aggregate from the Functions dropdown.
The list shows available Aggregate functions.
5. Double-click a function from the list to add it to the Expression area.
The Expression panel updates with the function's template. Also, the Fields list refreshes with
those fields you can use with the function. For example,MIN and MAX functions can aggregate
numeric or datetime data types.
6. Double-click a field to add it into the Expression area.
7. Continue adding functions and fields into your expression until it is complete.
Aggregate functions can only take fields or literal values as input.
8. Make sure your expression is correct.
The system checks your syntax as you build the expression. The yellow text box below the
Expression area displays any error messages. You can save expressions that contain errors, but
will not be able to save the dataset until all expressions evaluate successfully.
9. Click Save to add the new computed measure field to the dataset.
Page 100
Data Ingest Guide - Define Datasets to Describe Data
Your new field appears in the dataset. At this point, the field has no sample values. This is expected
for measure fields. As an aggregate field, it depends on a defined group of input rows to calculate a
value.
10.(Optional) Hide the field you used as input to the aggregate function.
Hiding the input field is useful when only the aggregated data is useful for future analysis.
Expressions are an advanced topic. For information on working with Platfora's expression syntax, see
Expressions Guide.
Prepare Date/Time Data for Analysis
Working with time series data is an important part of data analysis. To prepare time-based data for
analysis, you must tell Platfora which fields of your dataset contain DATETIME type data, and how your
timestamp fields are formatted. This allows users to analyze data chronologically and see trends in the
data over time.
FAQs - Date and Timestamp Processing
This section answers the common questions about how Platfora handles date and time data in a dataset.
Date and time data should be assigned to the DATETIME data type for Platfora to recognize it as a date
or timestamp.
In what format does Platfora store timestamp data?
Internally, Platfora stores all DATETIME type data in UTC format (coordinated universal time). If
your timestamp data does not have a time zone component to it, Platfora uses the local timezone of the
Platfora server.
When time-based data is in DATETIME format it can be ordered chronologically. You can also use
the DATETIME processing functions to calculate time intervals between two DATETIME fields. For
example, you can calculate the time difference between an order date and a ship date field.
How does Platfora parse timestamp data?
There are a handful of timestamp formats that Platfora can recognize automatically. On the Parse
Data step of the dataset workspace, pay attention to the data type assigned to your timestamp columns.
If the data type is DATETIME, then Platfora was able to parse the timestamp correctly.
If the data type is STRING, then Platfora was not able to parse the timestamp correctly. You will have to
create a computed field to tell Platfora how your date/time data is formatted. See Cast DATETIME Data
Types.
Why are all my dates/times 1970-01-01T00:00:00.000Z (January 1, 1970 at
12:00 AM)?
This is the default value for the DATETIME data type in Platfora. If you see this value in your date or
timestamp columns, it could mean:
Page 101
Data Ingest Guide - Define Datasets to Describe Data
• Platfora does not recognize the format of your timestamp string, and was not able to parse it
correctly.
• Your data values are NULL (empty). Check the raw source data to confirm.
• Your data does not have a time component (or a date component) to it. Platfora only has one data
type for dates and times: DATETIME. It does not have just DATE or just TIME. If one of these
components is missing in your timestamp data, the defaults will be substituted for the missing
information. For example, if you had a date value that looks like this: 04/30/2014, then Platfora will
convert it to this: 2014-04-30T00:00:00.000Z (the time is set to midnight).
What are the Date and Time datasets for?
Slicing and dicing data by date and time is a very common reporting requirement. Platfora's built-in
Date and Time datasets allow users to explore time-based data at different granularities in a vizboard.
For example, you can explore date-based data by day, week, or month or time-based data by hour,
minute, or second.
Why does my dataset have all these date and time references when I didn't add
them?
Every DATETIME type field in a dataset automatically generates two references: one to the built-in Date
dataset and one to the built-in Time dataset. These datasets have a built-in hierarchy that allows users to
explore dates at different granularities.
How do I remove the automatic references to Date and Time?
You cannot remove the automatic references to Date and Time, but you can rename them or hide them.
Page 102
Data Ingest Guide - Define Datasets to Describe Data
Cast DATETIME Data Types
If Platfora can recognize the format of a date field, it will automatically cast it to the DATETIME data
type. However, some date formats are not automatically recognized by Platfora and need to be converted
to DATETIME using a computed field expression.
1. On the Manage Fields step of the dataset workspace, find your base date field.
If the data type is STRING and not DATETIME, that means that Platfora could not automatically
parse the date format.
2. Choose Computed Field from the dataset Add menu.
3. Enter a name for the new computed field.
4. Write an Expression using the TO_DATE function. This function converts values to DATETIME
using the date format you specify.
5. Click Save to add the computed field to the dataset.
6. Verify that the new DATETIME field is formatting the date values correctly. Also check that the
automatic references to the Date and Time datasets are created.
About Date and Time References
Every DATETIME type field in a dataset automatically generates two references: one to the built-in Date
dataset and one to the built-in Time dataset. These datasets have a built-in hierarchy that allows users to
Page 103
Data Ingest Guide - Define Datasets to Describe Data
explore dates at different granularities. You cannot remove these auto-generated references, but you can
rename or hide them.
1. Go to the Create References step of the dataset workspace.
For each DATETIME type field in the dataset, you will see two references: one to Date and one to
Time.
2. Click a reference to select it.
3. In the Define References panel, you can edit the reference name or description. This is the
reference name as it will appear to users in the data catalog.
4. If you make changes, you must click Update for the changes to take effect.
5. If you don't want a reference to appear in the data catalog at all, you can hide it.
About the Default 'Date' and 'Time' Datasets
Slicing and dicing data by date and time is a very common reporting requirement. Platfora allows you to
analyze date and time-based data at different granularities by automatically linking DATETIME fields to
Platfora's built-in Date and Time dimension datasets.
The source data for these datasets is added to the Hadoop file system when Platfora first starts up (in
/platfora/system by default). If you have a different fiscal calendar for your business, you can
either replace the built-in datasets or add additional ones and link your datasets to those instead. You
cannot delete the default Date and Time references, however you can hide them if you do not need them.
Page 104
Data Ingest Guide - Define Datasets to Describe Data
The Date dataset has Gregorian calendar dates ranging from January 1, 1800 to December 31, 2300.
Each date is broken down into the following columns:
Date Dimension
Column
Data Type
Description
Date
DATETIME
A single date in the format yyyy-MM-dd, for example
2014-10-31.
This is the key of the Date dataset.
Day_of_Month
INTEGER
The day of the month from 1-31
Day_of_Year
INTEGER
The day of the year from 1-366
Month
INTEGER
Calendar month, for example January 2014
Month_Name
STRING
The month name (January, February, etc.)
Month_Number
INTEGER
The month number where January=1 and December=12
Quarter
STRING
The quarter number with year (Q1 2014) where quarters
start on January 1, April 1, July 1, or October 1.
Quarter_Name
STRING
The quarter number without year (Q1) where quarters
start on January 1, April 1, July 1, or October 1
Week
INTEGER
The week number within the year where week 1 starts on
the first Monday of the calendar year
Weekday
STRING
The weekday name (Monday, Tuesday, etc.)
Weekday_Number INTEGER
The day of the week where Sunday is 1 and Saturday is 7
Work_Day
STRING
One of two values; Weekend (Saturday, Sunday) or
Weekday (Monday - Friday)
Year
INTEGER
Calendar year, for example 2014
The Time dataset has each time of day divided into different levels of granularity, from the most general
(AM/PM) to the most detailed (Time in Seconds).
Prepare Location Data for Analysis
Adding geographic location information to a dataset allows vizboard users to use maps and geospatial analytics to discover new insights in the data. To prepare location data for analysis, you must
tell Platfora which fields of your dataset contain geographic coordinates (latitude and longitude), and
optionally a place name to associate with those coordinates (such as the name of a business).
Page 105
Data Ingest Guide - Define Datasets to Describe Data
FAQs - Location Data and Geographic Analysis
This section answers the common questions about how Platfora handles location data in a dataset.
Location information can be added to a dataset by geo-encoding certain fields of the dataset, or by
creating a geo location reference to another dataset that contains geo-encoded location data.
What is location data?
Location data represents a geographic point on a map of the Earth's surface. It is comprised of latitude /
longitude coordinates, plus an optional label that associates a place name with the set of coordinates.
What is geographic analysis?
Geographic analysis is a type of data analysis that involves understanding the role that location plays in
the occurrence of other factors. By looking at the geo-spatial distribution of data on a map, analysts can
see how location impacts different variables.
In a vizboard, analysts can use the geo map viz type to do geographic analysis.
How does Platfora do geographic analysis?
Platfora enables geographic analysis by allowing data administrators to encode their datasets with
location information. This geo-encoded data then appears as special location fields in the dataset, lens,
and vizboard.
These special location fields can then be used to create map visualizations in a Platfora vizboard.
Platfora uses Google Maps to render map visualizations.
What are the prerequisites to doing geographic analysis in Platfora?
In order to do geographic analysis in Platfora, you must have:
Page 106
Data Ingest Guide - Define Datasets to Describe Data
• Access to the Google Maps web service from your Platfora master server. Your Platfora System
Administrator must configure this for you.
• Datasets with latitude / longitude coordinates in them.
Platfora provides some curated datasets for US states, counties, cities and zip codes. You can import
these datasets and use them to create geo references if needed (assuming your datasets have a column
that can be used to link to these datasets).
What are the high-level steps to prepare data for geographic analysis?
1. Geo-encode the location data in your datasets by creating geo location fields or geo location
references.
2. Make sure to include location fields in your lens when you build it.
3. In the vizboard, choose the map viz type.
What is a location field?
A location field is a new type of field you create in a dataset. It has a field name, latitude / longitude
coordinates, plus an optional label that associates a place name with a set of coordinates. To create a geo
location field in a dataset, you must tell Platfora which columns of the dataset contain this information.
What is a geo reference?
A geo location reference is a reference to another dataset that contains location fields. Geo references
should be used when the dataset you are referencing is primarily used for location purposes.
How are geo references different from regular references?
References and geo references are basically the same -- they both create a link to another dataset. A geo
reference, however, can only point to datasets that have geo location fields in it. The purpose of a geo
reference is to link to datasets that primarily contain location information.
Geo references and regular references are also displayed differently in the data catalog view of the
dataset, the lens, and the vizboard. Notice that either type of reference can contain location fields. Geo
references just use a different icon. This visual cue helps users find location data more easily.
Page 107
Data Ingest Guide - Define Datasets to Describe Data
When should I create geo references versus regular references?
The purpose of a geo reference is to link to datasets that primarily contain just location fields. This helps
users identify location data more easily when they want to do geographic analysis. Geo location type
fields are featured more prominently via a geo reference.
If you have a dataset that has lots of other fields besides just location fields, you may want to use a
regular reference instead. Users will still be able to use the location fields in the referenced dataset to
create map vizzes if they want. The location data is just featured less prominently in the data catalog and
lens.
Can I change a regular reference into a geo reference (or vice versa)?
No. If you want to change the type of a reference, you will have to delete the regular reference and
recreate it as a geo reference (or the other way around). You should use the same reference name so that
lenses and vizboards that are using the old reference name do not break.
You cannot have two references with the same name, even though they are different types of references.
You will either have to delete or rename the old reference before you create the new one.
Understand Geo Location Fields
A geo location field is a new type of field you create in a dataset. It has a field name, latitude / longitude
coordinates, plus a label that associates a place name value with a set of coordinates. To create a geo
location field in a dataset, you must tell Platfora which columns of the dataset contain this location
information.
You can think of a geo location field as a complex data type comprised of multiple dataset columns. In
order to create a geo location field, your dataset must have:
• A latitude column with numeric data type values
• A longitude column with numeric data type values
• A place name column containing STRING data type values. Place name values must be unique for
each latitude, longitude coordinate pair.
The values of this column will be used to:
• label tooltips for points on a map viz
• when creating a filter on a location field
• label marks and axes when using a location field in non-map visualizations
Page 108
Data Ingest Guide - Define Datasets to Describe Data
The reason for creating geo location fields is so analyst users can plot location data in a map
visualization. Location fields are shown with a special pin icon
in the data catalog, lens, and
vizboard. This location icon lets users know that the field can be used on a map.
Page 109
Data Ingest Guide - Define Datasets to Describe Data
Add a Location Field to a Dataset
The process of adding a geo location field to a dataset involves mapping information from other dataset
columns. To create a location field, a dataset needs a latitude column, a longitude column, and label
column containing place names.
1. Make sure that your dataset has the necessary columns.
Latitude and longitude columns are required to create a geo location field. Each coordinate must be in
its own column, and the columns must be a numeric data type.
A location name column is optional, but highly recommended. If you do use location names, the
values must be unique for each latitude, longitude coordinate pair. For example, a column containing
just city names may not be unique (there may be a city named Paris in multiple states and countries).
You may need to create a unique place name column by combining the values of multiple fields in a
computed field expression.
2. Choose Add > Geo Location.
3. Under Geo Location Type > Geo Location Fields choose either Latitude, Longitude (if
you only have the coordinates) or Latitude, Longitude with Name (if you also have a column
to use as place name labels).
4. Give the location field a Name.
Page 110
Data Ingest Guide - Define Datasets to Describe Data
Consider using a standard naming convention for all location type fields. For example, always use
Location or Geo in the field name. This will make it easier for users to find location fields using
search.
5. (optional) Enter a Description for the location field.
6. Choose the fields in the current dataset that map to Latitude, Longitude and Location Name.
If you don't see the expected dataset columns as choices, make sure the dataset columns are the
correct data type -- DOUBLE, FIXED, LONG or INTEGER for Latitude and Longitude, STRING
for Location Name.
7. Click Add.
8. Make sure the geo location field was added to the dataset as expected. Location fields and geo
references are both added in the references section of the dataset on the Geo Locations tab.
Understand Geo References
If your datasets do not have geographic coordinates in them, you can reference special geo datasets that
do have coordinate information in them. For example, if your dataset has US zip code information in it,
you can reference a special geo dataset that contains latitude/longitude coordinates for each US zip code.
A geo reference is similar to a regular reference in Platfora. The difference is that geo references are
used specifically for the purpose of linking to datasets containing geo location fields.
A regular reference links to datasets that have other dimension information besides just location
information. Although a regular referenced dataset may have location information in it as well, location
information is not the primary reason the dataset exists.
Prepare Geo Datasets to Reference
A geo dataset is a dataset that contains mainly location information. The main purpose of a geo dataset
is to be the target of a geo location reference from other datasets in the data catalog. Linking another
dataset to a geo dataset allows users to do geographic analysis in a vizboard.
Platfora comes with some built-in geo datasets that you can install and use for US States, US Counties,
US Cities, and US Zipcodes.
Page 111
Data Ingest Guide - Define Datasets to Describe Data
Optionally, you may have your own data that you want to use to create your own geo datasets. For
example, you may have location information about sites that are relevant to your business, such as store
locations or office locations.
Load the US Geo Datasets
Platfora installs with some curated geo-location datasets for US States, US Counties, US Cities, and US
Zipcodes. You can load these datasets into the Platfora data catalog, and then use them to create geo
references from datasets that do not have location information in them.
The geo location datasets contain United States location data only. If you have International location
data, or custom locations you want to create (such as custom business locations) you can look at these
datasets as examples for creating your own geo-location datasets.
In order to reference these geo datasets, your own datasets must have a column that can be used to join
to the key of appropriate geo dataset. For example, to join to the US States dataset, your dataset must
have a column that has two-digit state codes (CA, NY, TX, etc.).
1. Log in to the Platfora master server in a terminal session as the platfora system user.
2. Run the geo dataset install script:
$ $PLATFORA_HOME/client/examples/geo/US/install_geo_us.sh
You should see output such as:
Importing
Importing
Importing
Importing
Importing
Importing
Importing
Importing
Importing
Importing
Importing
Importing
Importing
Importing
Importing
Importing
dataset: "US States"
dataset: "US Counties"
dataset: "US Cities"
dataset: "US Zipcodes"
permissions for SourceTable:
permissions for SourceTable:
permissions for SourceTable:
permissions for SourceTable:
permissions for SourceTable:
permissions for SourceTable:
permissions for SourceTable:
permissions for SourceTable:
permissions for SourceTable:
permissions for SourceTable:
permissions for SourceTable:
permissions for SourceTable:
'US
'US
'US
'US
'US
'US
'US
'US
'US
'US
'US
'US
Cities'
Cities'
Cities'
Counties'
Counties'
Counties'
Zipcodes'
Zipcodes'
Zipcodes'
States'
States'
States'
3. Go to the Platfora web application in your browser.
4. Go to the Datasets tab in the Data Catalog. Look for the US States, US Counties, US Cities,
and US Zipcodes datasets.
Page 112
Data Ingest Guide - Define Datasets to Describe Data
The default permissions on these datasets allow Everyone to view them and build lenses, but only
the default System Administrator (admin) account to edit or delete them. You may want to grant edit
permissions to other system or data administrator users.
Create Your Own Geo Datasets
You may have location information about sites that are relevant to your business, such as store or office
locations. If you have location data that you want to reference from other datasets, you can create a
special geo dataset. Geo datasets are datasets that are intended to be the target of a geo reference.
Creating a geo dataset is basically the same as creating any other dataset. However, you prepare the
fields and references within a geo dataset so that only (or mostly) location fields are visible in the data
catalog. Hide all other fields that are not location fields.
Prepare the dataset so that only (or mostly) location fields appear as top-level columns in the dataset. For
example, in the Airports dataset, there are 3 possible locations for an airport (from most granular to most
general): Airport Location, Airport City Location, and Airport State Location.
If the dataset references other datasets, hide the internal references so users don't see a complicated tree
of references in the data catalog. The goal is to flatten and simplify the reference structure for users. For
Page 113
Data Ingest Guide - Define Datasets to Describe Data
example, in the Airports dataset, there is an internal reference to US Cities. That reference is hidden so
users don't see it in the data catalog.
Use interim computed fields to 'pull up' the required latitude and longitude columns from the referenced
dataset into the current dataset. For example, in the Airports dataset, the latitude and longitude columns
for Airport Location are already in the current dataset. The latitude and longitude columns for Airport
City Location, however, are in the referenced US Cities dataset.
Page 114
Data Ingest Guide - Define Datasets to Describe Data
Then create geo location fields in the current dataset. The computed fields add the required columns
needed to create a location field in the current dataset. The goal is to create all possible geo location
fields in the current dataset, so users don't have to navigate through multiple references to find them.
Consider using a common naming convention for location fields, such as always having Location in the
name. This will help users easily find location fields using search.
After all of the location fields have been added to your geo dataset, consider adding a drill path from
the most general location field (for example, Airport State Location) to the most specific (for example,
Airport Location). This will allow users to drill-down on points in a map visualization.
Page 115
Data Ingest Guide - Define Datasets to Describe Data
Don't forget to designate a key for your geo dataset. A dataset must have a key to be the target of a geo
reference from another dataset.
This approach takes a bit of work, but the end result makes it clear to users what fields they can use in
map visualizations. Here is what a geo reference from the Flights dataset to a specially prepared Airport
Locations geo dataset might look like.
Add a Geo Reference
A geo reference is basically the same as a regular reference. It creates a link from a field in the current
dataset (the focus dataset) to the primary key field of another dataset (the target dataset). You should
use a geo reference when the dataset you are linking to is mostly used for location purposes. The target
Page 116
Data Ingest Guide - Define Datasets to Describe Data
dataset of a geo reference must have a primary key defined and also contain at least one geo location
type field. Also, the fields used to join two datasets must be of the same data type.
1. Make sure that your dataset has the necessary foreign key columns to join to the target geo dataset.
For example, to join to the US Cities dataset provided by Platfora, your dataset must have a state
column containing two digit, capitalized state values (CA, TX, NY, and so on), and a city column
with city names that have initial capital letters, proper spacing, and no abbreviated names (for
example, San Francisco, Los Angeles, Mountain View -- not san francisco, LA, or Mt. View).
2. Choose Add > Geo Location.
3. Under Geo Location Type > Geo References choose the dataset you want to link to.
Only datasets that have keys defined and geo location fields in them will appear in the list.
4. Give the geo reference a Name.
Consider using a standard naming convention for all geo location references. For example, always
use Location or Geo in the name. This will make it easier for users to find geo references and
location fields using search.
5. (optional) Enter a Description for the geo reference.
6. Choose the Foreign Key field(s) in the current dataset to link to the key field(s) of the target
dataset.
The foreign key field must be of the same data type as the focus dataset key field.
Page 117
Data Ingest Guide - Define Datasets to Describe Data
If the focus dataset has a compound key, you must choose a corresponding foreign key for each field
in the key.
7. Click Add.
8. Make sure the geo location reference was added to the dataset as expected. Location fields and geo
references are both added in the references section of the dataset on the Geo Locations tab.
This is how geo references appear in the data catalog view of the dataset, the lens, and the vizboard. The
location fields under a geo reference are listed before other dimension fields.
Prepare Drill Paths for Analysis
Adding a drill path to a dataset allows vizboard users to drill down to more granular levels of detail in a
viz. A drill path is defined in a dataset by specifying a hierarchy of dimension fields.
For example, a Product drill path might have categories for Division, Type, and Model. Drill path levels
depend on the granularity of the dimension fields available in the dataset.
FAQs - Drill Paths
This topic answers some frequently asked questions about defining and using drill paths in Platfora.
What is a drill path?
A drill path is a hierarchy of dimension fields, where each level in the hierarchy is a sub-division of the
level above. For example, the default drill path on Date starts with a top-level category of Year, subdivided by Quarter, then Month, then Date. Drill paths allow vizboard users to interact with data in a
viz. Users can double-click on a mark in a viz (or a cell in a cross-tab) to navigate from summarized to
detailed levels of categorization.
Who can define a drill path?
You must be a Data Administrator system role or above, have Edit permissions on the dataset, and
have data access permissions to the datasets included in the drill path hierarchy in order to define a drill
path.
Page 118
Data Ingest Guide - Define Datasets to Describe Data
Any user can navigate a drill path in a viz or cross-tab (provided they have sufficient data access
permissions).
Where are drill paths defined?
Drill paths are defined in a dataset. You can define drill paths when adding a new dataset or when
editing an existing one. Choose Add > Drill Path in the dataset workspace.
Can a field be included in more than one drill path?
Yes. A dataset can have multiple drill paths, and the same fields can be used in more than one drill path.
However, there is currently no way for a user to choose which drill path they want in a vizboard if a field
has multiple paths. The effective drill path will always be the path that comes first alphabetically (by
drill path name).
For example, the Date dataset has two pre-defined drill paths: YQMD (Year > Quarter > Month > Date)
and YWD (Year > Week > Date). If a user adds the Year field to a viz, they should be able to choose
between Quarter or Week as the next drill level. However, since there is no way to choose between
multiple drill paths, the current behavior is to pick the first drill path (YQMD in this case). The ability to
choose between multiple drill paths will be added in a future release.
Can I define a drill path on a single column of a dataset?
No. A drill path is a hierarchy of more than one dimension field.
If you want to drill on different granularities of data contained in a single column, you can create
computed fields to bin or bucket the values at different granularities. See Add Binned Fields.
For example, suppose you had an age field, and wanted to be able to drill from age in 10-year
increments, to age in 5-year increments, to actual age. To accomplish this, you'd first need to define two
additional computed fields: age-by-10 (10-year buckets) and age-by-5 (5-year buckets). Then you could
create a drill path hierarchy of age-by-10 to age-by-5 to age.
Can a drill path include fields from more than one dataset?
Yes. A drill path can include fields from the focus dataset, as well as from any datasets that it references.
For example, you can define one drill path that includes fields from both the Date and Time datasets via
their associated references.
Are there any default drill paths defined?
Yes. The built-in datasets for Date and Time have default drill paths defined. Any DATETIME type fields
that reference these datasets will automatically include these default drill paths.
Platfora recommends leaving the default Date and Time drill paths as is. You can always override the
default Date and Time drill paths by defining your own drill paths in the datasets that you create.
Page 119
Data Ingest Guide - Define Datasets to Describe Data
Why do the Date and Time datasets have multiple drill paths defined?
The built-in datasets for Date and Time are automatically referenced by any dataset that contains a
DATETIME type field. These datasets include some built-in drill paths to facilitate navigation between
different granularities of dates and times. You may notice that the Date dataset has two pre-defined drill
paths, and the Time dataset has four.
The multiple drill paths accommodate different ways of dividing date and time. In each drill path
hierarchy, each level is evenly divisible by the next level down. This ensures consistent drill behavior for
whatever field is used in a viz.
What things should I consider when defining a drill path?
A couple things to consider when defining drill paths:
• Consistent Drill Levels. Levels in the hierarchy should ideally be evenly divisible subsets of each
other. For example, in the Time dataset, the drill increments go from AM/PM to Hour by 6 to
Hour by 3 to Hour. Each level in the hierarchy is evenly divisible by levels below it. This ensures
consistent drill-down navigation in a viz.
• Alphabetical Drill Path Names. When a field participates in multiple drill-paths, the effective drill
path is the one that comes first alphabetically. Plan your drill path names accordingly.
• The Lens Decides the Drill Behavior. Ultimately, the fields that are included in the lens will dictate
the drill path levels available in a vizboard. If a level in the drill path hierarchy is not included in
the lens, it is simply skipped by the drill-down navigation. Consider defining one large drill path
hierarchy with all possible levels, and then use the lens field selections to control the levels of
granularity applicable to your analysis.
• Aggregate Lenses Only. Viz users can only navigate through a drill path in a viz that uses an
aggregate lens. Drill paths are not applicable to event series lenses.
How do I include a drill path in my lens or vizboard?
To include a drill path in a lens or vizboard, simply choose the fields that you are interested in analyzing.
As long as there is more than one field from a given drill path in the lens, then drill-down capabilities are
automatically included. The lens builder does not currently indicate if a field is a member of a drill path
or not.
You do not have to include every level of the drill path hierarchy in a lens -- the vizboard drill-down
behavior can skip levels that are not present. For example, if you have defined a drill path that goes from
year to month to day, but you only have year and day in your lens, the effective drill path for that lens
then becomes year to day (month is skipped).
Page 120
Data Ingest Guide - Define Datasets to Describe Data
Add a Drill Path
Drill paths are defined in a dataset. You can define drill paths when adding a new dataset or when
editing an existing one.
1. Edit the dataset.
2. On the Manage Fields step of the dataset workspace, choose Add > Drill Path.
3. Enter a name for the drill path.
Keep in mind that drill path precedence is determined alphabetically by name whenever a field is part
of multiple drill paths.
4. Add the fields that you want to include in the drill path. You can include fields from a referenced
dataset as well.
5. Use the up and down arrows to set the drill hierarchy order. The most general categorization should
be on top, and the most detailed categorization should be on the bottom.
6. Save the drill path.
Page 121
Data Ingest Guide - Define Datasets to Describe Data
Model Relationships Between Datasets
This section explains the relationships between datasets, and how to model dataset references, events
and elastic datasets in Platfora to support the type of analysis you want to do on the data.
Understand Data Modeling in Platfora
This section explains the different kind of relationships you model between datasets to support
quantitative analysis, event series analysis, and/or behavioral segment analysis.
The Fact-Centric Data Model
A fact-centric data model is centered around a particular real-world event that has happened, such as
web page views or sales transactions. Datasets are modeled so that a central fact dataset is the focus of
an analysis, and dimension datasets are referenced to provide more information about the fact. In data
warehousing and business intelligence (BI) applications, this type of data model is often referred to as a
star schema.
For example, you may have web server logs that serve as the source of your central fact data about pages
viewed on your web site. Additional dimension datasets can then be related (or joined) to the central fact
to provide more in-depth analysis opportunities.
In Platfora, you would model dataset relationships in this way to support the building of aggregate lenses
for quantitative data analysis. Fact-centric data modeling involves the following high-level steps in
Platfora:
1. Define a key in the dimension dataset. A key is one or more dataset columns that uniqely identify a
record.
Page 122
Data Ingest Guide - Define Datasets to Describe Data
2. Create a reference in your fact dataset that points to the key of the dimension dataset.
How References Work in Platfora
Creating a reference allows the datasets to be joined when building aggregate lenses and executing
aggregate lens queries, similar to a foreign key to primary key relationship between tables in a relational
database.
Once you have added your datasets, you can model the relationships between them by adding references
in your dataset definitions. A reference is a special kind of field in Platfora that points to the key of
another dataset. A reference is created in a fact dataset and points to the key of a dimension dataset.
For example, you may have web server logs that serve as the source of your central fact data about pages
viewed on your web site. Additional dimension datasets can then be related (or joined) to the central fact
to provide more in-depth analysis opportunities.
Page 123
Data Ingest Guide - Define Datasets to Describe Data
Upstream datasets point to other datasets. Downstream datasets are the datasets being pointed to.
For example, the Page Views dataset is upstream of the Visitors dataset, and the Visitors dataset is
downstream of Page Views.
Once a reference is created, the fields of all downstream datasets are available through the dataset where
the reference was created. Data administrators can define computed expressions using downstream
dimension fields, and analyst users can choose downstream dimension fields when they build a lens.
Measure fields, however, are not available through a reference.
Page 124
Data Ingest Guide - Define Datasets to Describe Data
The Entity-Centric Data Model
An entity-centric data model 'pivots' a fact-centric data model to focus an analysis around a particular
dimension (or entity). Modeling the data in this way allows you to do event series analysis, behavioral
analysis, or segment analysis in Platfora.
For example, suppose you had a common dimension that spanned multiple facts. In a relational database,
this is sometimes referred to as a conforming dimension. In this example, our conforming dimension is
customer.
Modeling the fact datasets around a central customer dataset allows you to analyze different aspects of a
customer's behavior. For example, instead of asking "how many customers visited my web site?" (factcentric), you could ask questions like "which customers visit my site more than once a day?" or "which
customers are most likely to respond to a direct marketing campaign?" (entity-centric).
In Platfora, you would model dataset relationships in this way to support the building of event series
lenses and/or segments for behavioral data analysis. Entity-centric data modeling involves the following
high-level steps in Platfora:
1. Identify or create a dimension dataset to serve as the common entity you want to analyze. If your
existing data is only comprised of fact datasets, you can create an elastic dataset (a virtual dimension
used to model entity-centric relationships).
2. Define a key for the dimension dataset. A key is one or more dataset columns that uniqely identify a
record.
3. Create references in your fact datasets that point to the key of the common entity dimension dataset.
4. Model events in your common entity dimension dataset.
Page 125
Data Ingest Guide - Define Datasets to Describe Data
How Events Work in Platfora
An event is similar to a reference, but the direction of the join is reversed. An event joins the primary
key field(s) of a dimension dataset to the corresponding foreign key field(s) in a fact dataset, plus
designates a timestamp field for ordering the event records.
Adding event references to a dataset allows you to define an event series lens from that dataset. An event
series lens can contain records from multiple fact datasets, as long as the event references have been
modeled in the dimension dataset.
For example, suppose you had a common dimension dataset (customer) that was referenced by multiple
fact datasets (clicks, emails, calls). By creating different events within the customer dataset, you can
build an event series lens from customer that allows you to analyze different aspects of a customer's
behavior.
By looking at different customer events together in a single lens, you can discover additional insights
about your customers. For example, you could analyze customers who were the target of an email or
direct marketing campaign who then visited your website or made a call to your call center.
How Elastic Datasets Work in Platfora
Elastic datasets are a special kind of dataset used for entity-centric data modeling in Platfora. They are
used to consolidate unique key values from other datasets into one place for the purpose of defining
segments, event series lenses, references, or computed fields. They are elastic because the data they
contain is dynamically generated at lens build time.
Elastic datasets can be created when you have a flat data model with the majority of your data in a single
dataset. Platfora requires you to have separate dimension datasets in order to create segments and event
series lenses. Elastic datasets allow you to create 'virtual dimensions' so you can do the entity-centric
data modeling required to use these features of Platfora. Elastic datasets can be used to work with data
that is not backed by a single data source, but instead is embedded in other various datasets.
For example, suppose you wanted to do an analysis of the IP addresses that accessed your network.
You had various server logs that contained IP addresses, but did not have a separate IP Address dataset
modeled out in Platfora. In order to consolidate the unique IP addresses that occurred in your other
Page 126
Data Ingest Guide - Define Datasets to Describe Data
server log datasets, you could create an elastic dataset called IP Address. You could then model
references and events that pointed to this elastic, virtual dataset of IP addresses.
There are two ways to create an elastic dataset:
1. From one or more columns in an existing dataset. This will generate the elastic dataset, a static
example data file, and the corresponding reference at the same time.
2. From a file containing static example data. You can also use a dummy file of key examples to define
an elastic dataset. The file is then used for example purposes only.
After the elastic dataset has been created, you then need to model references (if you want to create
segments) and events (if you want to create event series lenses).
Elastic datasets are virtual - all of their data values are consolidated from the other datasets that
reference it. They are not file-based like other datasets. The actual key values that comprise the elastic
dataset are computed at lens build time. The example data shown in the Platfora data catalog is filebased, but it is only used for example purposes.
Elastic datasets inherit the data access permissions of the datasets that reference them. So for example,
if a user has access to the Web Logs and Network Logs datasets, they will have access to the IP address
values consolidated from those datasets via the IP Address elastic dataset.
One thing to keep in mind is the sample file used to show the values seen in the dataset workspace and
the Platfora data catalog. The values in this sample data file are viewable to all Platfora users by default.
If you are concerned about this, don't use real data values to create this sample data file.
Since elastic datasets contain no actual data of their own, they cannot be used as the focus of an
aggregate lens. They can be included by reference in an aggregate lens, or be used as the focus when
building an event series lens.
Also, since they are used to consolidate key values from other datasets, every base field in the dataset
must be included in the elastic dataset key. Additional base fields that are not part of the key are not
allowed in an elastic dataset (additional computed fields are OK though).
Page 127
Data Ingest Guide - Define Datasets to Describe Data
Add a Reference
A reference creates a link from a field in the current dataset (the focus dataset) to the primary key field
of another dataset (the target dataset). The target dataset must have a primary key defined. Also, the
fields used to join two datasets must be of the same data type.
1. Go to the Create References step of the dataset workspace.
2. Choose Add > Reference.
3. In the the Define References panel, select the Referenced Dataset to link to. Only datasets
that have keys defined will appear in the list.
If you do not see the dataset you want to reference in the target list, make sure
that it has a key defined and that the data type of the key field(s) is the same
as the foreign key field(s) in the focus dataset. For example, if the key of the
target dataset is an INTEGER data type, but the focus dataset only has STRING
fields, you will not see the dataset in the target list because the data types are
not compatible.
4. Choose the Foreign Key field(s) in the current dataset to link to the Key field(s) of the target
dataset.
The foreign key field must be of the same data type as the target dataset key field.
If the target dataset has a compound key, you must choose a corresponding foreign key for each field
in the key.
5. Enter a Name for the reference.
6. (optional) Enter a Description for the reference.
7. Click Add.
The new reference is added to the dataset in the References section.
Page 128
Data Ingest Guide - Define Datasets to Describe Data
You may also want to hide the foreign key field(s) in the current dataset so that users only see the
reference fields in the data catalog.
To refer to the referenced dataset from here on out, use the reference name (not the original dataset
name).
Add an Event Reference
An event is a special reverse-reference that is created in a dimension dataset. Before you can model
event references, you must define regular references first. Event references allow you to define an event
series lens from a dataset.
In order to create an event in a dataset, the current dataset and the event dataset you are linking to must
meet the following requirements:
• The current dataset must have a key defined.
• The current dataset must be the target of a reference. See Add a Reference.
• The event dataset that you are linking to must have a timestamp field in it (a DATETIME type field).
If the dataset does not meet these requirements, you will not see the Add > Event option.
1. Edit the dimension dataset in which you want to model event references.
2. Choose Add > Event.
3. Provide the event information and click Add Event.
Event Name
This is a logical name for the event. This is the name users will see
in the data catalog or a lens.
Event Dataset
This is the fact dataset that you are linking to. You will only see
datasets that have references to the current dataset.
Page 129
Data Ingest Guide - Define Datasets to Describe Data
Event Dataset Reference This is the name of the reference in the event dataset. If the event
dataset has multiple references to the current dataset, then choose
the appropriate one for your event.
Ordering Field
This is a timestamp field in the event dataset. When an event series
lens is built, this is the field used to order event records. Only
DATETIME
type fields in the event dataset are shown.
4. The event is added to a separate Events tab in the References section. Click on an event to edit
the event details.
Add an Elastic Dataset
Elastic datasets are a special kind of dataset used for entity-centric data modeling in Platfora. There are
two ways to create an elastic dataset - from a column in an existing dataset or from a file containing
sample values.
Elastic datasets are used to consolidate unique key values from other datasets into one place for the
purpose of defining segments or event series lenses. They are elastic because the data they contain is
dynamically generated at lens build time.
The data values used when adding the elastic dataset are for example purposes only. They are visible to
users as example values when they view or edit the elastic dataset in the data catalog. The actual data
values of an elastic dataset come from the datasets that reference it.
Page 130
Data Ingest Guide - Define Datasets to Describe Data
Create an Elastic Dataset from an Existing Dataset
As a convenience, you can create an elastic dataset while working in another dataset. This creates the
elastic dataset, a static example data file, and the corresponding fact-to-dimension reference at the same
time.
1. Edit the dataset that contains the key values you want to consolidate.
2. Choose Add > Elastic Dataset.
3. Choose the column in the current dataset that you want to base the elastic dataset on. If the key
values are comprised from multiple columns (a compound key), click Add Additional Fields to
choose the additional columns.
4. Enter a name for the new elastic dataset that will be created.
5. Enter a name for the new reference that will be created in the current dataset.
6. Enter a description for the new elastic dataset that will be created.
7. Click Add.
8. Notice that the reference to the elastic dataset is created in the current dataset.
9. Save the current dataset.
10.You are notified that the new elastic dataset is about to be created using sample values from the
column(s) you selected in the current dataset. Click Confirm.
The sample values are written to Platfora's system directory in the Hadoop file system. For
example, in HDFS at:
/platfora/system/current+dataset+name/sample.csv
Page 131
Data Ingest Guide - Define Datasets to Describe Data
This file is only used as sample data when viewing or editing the elastic dataset.
This sample file is not removed from HDFS if you delete the elastic dataset in
Platfora. You'll have to remove this file in HDFS directly.
Create an Elastic Dataset Using a Sample File
If you are creating an elastic dataset based on sensitive values, such as social security numbers or
email addresses, you may want to use a sample file of fake data to create the elastic dataset. This
way unauthorized users will not be able to see any real data via the sample values. This is especially
important for Platfora systems using HDFS delegated authorization.
1. Upload a file to Platfora as a basis for the elastic dataset. This file should contain a newline separated
list of sample values.
2. Go to the Define Key step of the dataset workspace.
3. Select Include in Key for all Base fields of the dataset.
4. Select Elastic Dataset to change the dataset's type from file-based to elastic.
5. Click OK to confirm the dataset type change.
6. Change the Dataset Name for the elastic dataset. For example, you may want to use a special
naming convention for elastic datasets to help you find them in the Platfora data catalog.
7. Save and Exit the dataset.
Page 132
Data Ingest Guide - Define Datasets to Describe Data
After you create the elastic dataset, you have to add references in your fact datasets to point to it. This is
how the elastic dataset gets populated with real data values at lens build time. It consolidates the foreign
key values from the datasets that reference it.
Delete or Hide a Reference
Deleting a reference removes the link between two datasets. If you want to keep the reference link, but
do not want the reference to appear in the data catalog, you can always hide it instead. The automatic
references to Date and Time cannot be deleted, but they can be hidden.
Before deleting a reference, make sure that you do not have computed fields,
lenses, or vizboards that are using referenced fields. A missing reference can
cause errors the next time someone updates a lens or vizboard that is using fields
downstream of the reference.
1. Go to the Create References step of the dataset workspace.
2. Find the reference on the References tab, event references on the Events tab, or geo reference on
the Geo Locations tab.
3. Delete or hide the reference.
• To delete the reference, click the delete icon.
• To hide the reference select Hidden. This will keep the reference in the dataset definition, but
hide it in the data catalog view of the dataset.
Page 133
Data Ingest Guide - Define Datasets to Describe Data
Update a Reference
You can edit and existing reference, event, or geo reference to change its name or description. Make sure
to click Update to apply any changes you make.
Before changing the name of a reference, make sure that you do not have
computed fields, lenses, or visualizations that are using it. Changing the name can
cause errors the next time someone updates a lens or vizboard that is using the
old reference name.
1. Go to the Create References step of the dataset workspace.
2. Find the reference on the References tab, event references on the Events tab, or geo reference on
the Geo Locations tab.
3. Click the reference you want to update.
4. In the Define References panel, update the name or description.
5. Click Update to apply your changes.
Define the Dataset Key
A key is single field (or combination of fields) that uniquely identifies a row in a dataset, similar to a
primary key in a relational database. A dataset must have a key defined in order to be the target of a
Page 134
Data Ingest Guide - Define Datasets to Describe Data
reference, the base dataset of a segment, or the focus of an event series lens. Also, the key field(s) used
to join two datasets must be of the same data type.
1. Go to the Define Key step of the dataset workspace.
2. Select Include in Key in the column header of the key field(s).
A dataset may have a compound key (a combination of fields that are the
unique identifier). Select each field that comprises the key.
3. Click Save.
Page 135
Chapter
4
Use the Data Catalog to Find What's Available
The data catalog is a collection of data items available and visible to Platfora users. Data administrators build
the data catalog by defining and modeling datasets in Platfora that point to source data in Hadoop. When users
request data from a dataset, that request is materialized in Platfora as a lens. The data catalog shows all of the
datasets (data available for request) and lenses (data that is ready for analysis) that have been created by Platfora
users.
Topics:
•
FAQs - Data Catalog Basics
•
Find Available Datasets
•
Find Available Lenses
•
Find Available Segments
•
Organize Datasets, Lenses and Vizboards with Labels
FAQs - Data Catalog Basics
The data catalog is where users can find datasets, lenses, and segments that have been created in
Platfora. This topic answers the basic questions about the data catalog.
How can I see the relationships between datasets?
There isn't one place in the data catalog where you can see how all of the datasets are related to each
other. You can however, open a particular dataset to see how it relates to other datasets in the data
catalog.
Page 136
Data Ingest Guide - Use the Data Catalog to Find What's Available
The dataset detail page shows the Referenced Datasets that the current dataset is pointing to.
If a dataset is the target of an incoming reference, it is considered a dimension dataset. Dimension
datasets show both upstream and downstream relationships on their dataset detail page, whereas fact
datasets only show downstream relationships (or no relationships at all).
If a dimension dataset has an event or segment associated with it, then it is also considered an entity
dataset. Entity datasets serve as a conforming dimension to join multiple fact datasets together. Entity
datasets can be used as the focus of an event series lens.
Page 137
Data Ingest Guide - Use the Data Catalog to Find What's Available
What does it mean when a dataset or lens has a lock on it?
If you are browsing the data catalog and see datasets or lenses that are grayed-out and locked, this means
that you do not have sufficient data access permissions to see the data in that dataset or lens. Contact
your Platfora system administrator to ask if you can have access to the data.
What does it mean when a dataset has (static) or (dynamic) after its name?
This means that the dataset is a derived dataset. A derived dataset
query) in Platfora, whereas a regular dataset
is defined from a viz (or lens
is defined from a data source outside of Platfora.
A static derived dataset takes a snapshot of the viz data at a point in time - the data does not change if
the parent lens is updated. A dynamic derived dataset does not save the actual viz data, but instead saves
Page 138
Data Ingest Guide - Use the Data Catalog to Find What's Available
the lens query used to produce the data - the data is dynamically updated whenever the parent lens is
updated.
Why doesn't 'My Datasets' or 'My Lenses' have anything listed?
Even though you may work with certain datasets and lenses on a regular basis, they won't show in the
My Datasets or My Lenses panels unless you were the original user who created them.
Find Available Datasets
Datasets represent a collection of source data in Hadoop that has been modeled for use in Platfora. You
can browse or search the data catalog to find datasets that are available to you and that meet your data
requirements. Once you find a dataset of interest, you can request that data by building a lens (or check
if a lens already exists that has the data you need).
Page 139
Data Ingest Guide - Use the Data Catalog to Find What's Available
Search within Datasets
Using the Quick Find search, you can find datasets by name. Quick find also searches the field names
within the datasets.
1. Go to the Datasets tab in the Data Catalog.
2. Search by dataset name or by a field name within the dataset using the search.
Page 140
Data Ingest Guide - Use the Data Catalog to Find What's Available
Dataset List View
List view allows you to sort the available datasets by different criteria to find the dataset you want.
1. Go to the Datasets tab in the Data Catalog.
2. Select List view.
3. Click a column header to sort by that column.
4. Once you find the dataset you want, use the dataset action menu to access it.
5. While in list view, you can select multiple datasets to delete at once.
Find Available Lenses
Lenses contain data that is already loaded into Platfora and immediately available for analysis. Lenses
are always built from the focus of a single dataset in the data catalog. Before you build a new lens, you
Page 141
Data Ingest Guide - Use the Data Catalog to Find What's Available
should check if there is already a lens that has the data you need. You can browse the available lenses in
the Platfora data catalog.
1. Go to the Lenses tab in the Data Catalog.
2. Choose the List view to easily sort and search for lenses.
3. Search by lens name or by a field name within the lens using the search.
4. Click a column header to sort by that column, such as finding lenses by their focus dataset.
5. Once you find a lens you want, use the lens action menu to access it.
Page 142
Data Ingest Guide - Use the Data Catalog to Find What's Available
Find Available Segments
The data catalog does not have a centralized tab or view that shows all of the segments that have been
created in Platfora. You can find segments by looking in a particular dataset to see if any segments have
been created from that dataset.
1. Go to the Data Catalog.
2. Find the dataset that you would want to segment and open it.
3. Choose the Segments tab.
If you do not see any segments listed, this means the dataset has not been used to define segments. A
dataset must be the target of an incoming reference in order to define segments from it.
4. Click a segment to see its details.
5. The segment details shows:
Segment Name
The name given to the segment when it was defined in the vizboard
Built On
The last time the segment was updated
Segment of
The focus dataset (and its associated reference to the current dataset)
that was used to define the segment
Occuring in Dataset
This is always the same as the currently selected dataset name
Origin Lens
The lens that was queried to define this segment
Segment Conditions
The criteria that a record must meet to be counted in the segment
Page 143
Data Ingest Guide - Use the Data Catalog to Find What's Available
IN and NOT IN
Value Labels
The value labels given to records that are in the segment, and those that
are not
Segment Count
The number of rows of this dataset that met the segment conditions
6. From the segment action menu you can Delete the segment or edit its Permissions.
Segments cannot be created or edited from the data catalog. To edit the segment conditions, you must
go to a vizboard where the segment is in use. Choose Show Vizboards to find vizboards that are
using the segment.
Organize Datasets, Lenses and Vizboards with Labels
If you have datasets, lenses, and vizboards that you use all of the time, you can tag them with a label
so you can easily find them. Labels allow you to organize and categorize Platfora objects for easier
search and collaboration. For example, you can label all datasets, lenses and vizboards associated with a
particular department or project.
Anyone in Platfora can create a label and apply it to any object to which they have view data access.
Labels are just an organizational tool. They do not have any security or privacy settings associated with
them. Labels can be created, viewed, applied, or deleted by any Platfora user, even labels created by
other users. There is no ownership associated with labels.
Create a Label
Before you create new labels, first decide how you want to categorize and organize your data objects
in Platfora. For example, do you want to tag objects by user names? by project? by department? by use
case? a combination of these?
Page 144
Data Ingest Guide - Use the Data Catalog to Find What's Available
Labels can be created for each category you want to search by, and within a category, you can create up
to 10 levels of nested sub-label categories. By default, there is one parent label category called All which
cannot be renamed or deleted. Any label you add will be a sub-label of All.
1. You can manage labels from the Data Catalog or the Vizboards area of Platfora.
2. Select Manage Labels from the Labels menu.
3. Select Create Sublabel from the desired parent label in the hierarchy. The default parent label
category is All.
4. Enter a name for the label.
5. Click Create.
6. Click OK.
Page 145
Data Ingest Guide - Use the Data Catalog to Find What's Available
Apply a Label to a Dataset, Lens, or Vizboard
You can apply as many labels as you like to a dataset, lens, or vizboard. Applying a label to an object
allows you to search for that object by that label name.
1. You can apply labels from the Data Catalog or the Vizboards area of Platfora.
2. Select Labels from the dataset, lens or vizboard action menu.
3. Click the plus sign (+) to apply a label. Click the minus sign (-) to remove a label that has been
previously applied.
4. Click OK.
Delete or Rename a Label
When you delete a label, the label is removed from all objects to which it was applied. The objects
themselves are not affected.
Page 146
Data Ingest Guide - Use the Data Catalog to Find What's Available
When you rename a label, the label will be updated to the new name wherever it is applied. You do not
need to re-apply it to the objects after renaming.
If you are getting errors using a label after is has been renamed, try reloading
the browser page. Sometimes old label names are cached by the browser and can
cause unexpected results.
Search by Label Name
Once you have applied labels to your objects, you can use the label breadcrumbs and search to find
objects by their assigned labels. You can search by label in the Data Catalog or Vizboards areas of
the Platfora application.
1. Click any level in the breadcrumb hierarchy to filter by that label category.
2. Select an existing label to filter on.
Page 147
Chapter
5
Define Lenses to Load Data
To request data from Hadoop and load it into Platfora, you must define and build a lens. A lens can be though of
as a dynamic, on-demand data mart purpose-built for a specific analysis project.
Topics:
•
FAQs - Lens Basics
•
Lens Best Practices
•
About the Lens Builder Panel
•
Understand the Lens Build Process
•
Create a Lens
•
Estimate Lens Size
•
Manage Lenses
•
Manage Segments—FAQs
FAQs - Lens Basics
A lens is a type of data storage that is specific to Platfora. This topic answers some frequently asked
questions about lenses.
What is a lens?
A lens is a type of data storage that is specific to Platfora. Platfora uses Hadoop as its data source and
processing engine to build and store its lenses. Once a lens is built, this prepared data is copied to
Platfora, where it is then available for analysis. A lens can be though of as a dynamic, on-demand data
mart purpose-built for a specific analysis project.
Who can create a lens?
Lenses can be created by any Platfora user with the Analyst system role (or above), provided that user
also has the appropriate security permissions to the underlying source data and the dataset.
Page 148
Data Ingest Guide - Define Lenses to Load Data
How do I create a lens?
You create a lens by first choosing a Dataset in the Platfora data catalog, then choose Create Lens
from the dataset detail page or the dataset action menu.
If the Create Lens option is grayed-out, you don't have the appropriate security
permissions on the dataset. Ask your system administrator or the dataset owner
to grant you access.
How big can a lens be?
It depends on how much disk space and memory you have available in Platfora, and if your system
administrator has set a limit on how much data you can request at once. As a general rule, a lens should
not be bigger than the amount of memory you have available in your entire Platfora cluster.
For most Platfora users, your system administrator sets a lens quota which limits how big of a lens
you can build. The default lens quota depends on your system role: 1 GB for Analysts, 1 TB for Data
Administrators, and Unlimited for System Administrators. You can see your lens quota when you go to
build a lens.
Likely, your organization uses Hadoop because you are collecting and storing a lot of data. It probably
doesn't make sense to request all of that data all at once. You can limit the amount of data you request by
using lens filters and only choosing the fields you need for your analysis.
How long does it take to build a lens?
It really depends - a lens build can take a few minutes or several hours.
There are a lot of factors that determine how long a lens build will take, and a lot of those factors depend
on your Hadoop cluster, not necessarily on Platfora. Since the lens build jobs happen in Hadoop, the
biggest factor is the resources that are available in your Hadoop cluster to run Platfora's MapReduce
jobs. If the Hadoop cluster is busy with other workload, or if there is not enough memory on the Hadoop
task nodes, then Platfora's lens builds will take longer.
The time it takes to build a lens also depends on the size of the input data, the number and cardinality of
the dimension fields you choose, and the complexity of the processing logic you have defined in your
dataset definitions.
What are the different kinds of lenses?
Platfora has two types of lenses you can build: an Aggregate Lens or an Event Series Lens. The type of
lens you build determines what kinds of visualizations you can create and what kinds of analyses you
can perform when using the lens in a vizboard.
An aggregate lens can be built from any dataset. It contains aggregated measure data grouped by the
various dimension fields you select from the dataset. Choose this lens type if you want to do ad hoc data
analysis.
An event series lens can only be built from dimension datasets that have an Event reference defined
in them. It contains non-aggregated events (fact dataset records), partitioned by the primary key of the
Page 149
Data Ingest Guide - Define Lenses to Load Data
selected dimension dataset, and sorted by the time the event occurred. Choose this lens type if you want
to do time series analysis, such as funnel paths.
How does Platfora handle rows that can't be loaded?
When Platfora processes the data during a lens build, it logs any problem rows that it could not process
according to the logic defined in the dataset. These 'dirty rows' are shown as lens build warnings.
Platfora administrators can investigate these warnings to determine the extent of the problem.
Lens Best Practices
When you define a lens, you want the selection of fields to be broad enough to support all of the
business questions you want to answer. A lens can be used by many visualizations and many users at the
same time. On the other hand, you want to constrain the overall size of the lens so that it will fit into the
available memory and so queries against the lens are fast.
Check for existing lenses before you build a new one.
Once you find a dataset that contains the data you want, first check for any existing lenses that have been
built from that dataset. There may already be a lens that you can use for your analysis. Also, if there is
an existing lens that contains some but not all of the fields you want, you can always modify the lens
definition to add additional fields. This is more efficient than building a whole new lens from scratch.
Define lens filters to reduce the amount of data you request.
You can add a lens filter on any dimension field of a dataset. Lens filters constrain the number of records
pulled into the lens from the data source. For example, if you store 10 years worth of data in Hadoop,
but only need to access the past year's worth of data, you can set a date-based filter to limit the lens to
get only the data you need.
Keep in mind that you can also create filters within visualizations. Lens filters
should be used to limit the number of records (and overall lens size). You don't
want to have a lens be too narrow in scope as to limit its analysis opportunities.
Don't include high cardinality fields that are not essential to your analysis.
The size of an aggregate lens depends mostly on the cardinality (number of unique values) of the
dimension fields selected. The more granular the dimension data, the bigger the aggregate lens will be.
For example, aggregating time-based data to the second granularity will make the lens significantly
bigger than if you chose to analyze the data to the hour granularity.
For fields that you intend to use as measures only (you only need the aggregated values), make sure to
deselect Original Value. When Original Value is selected, the field is also included in your lens as
a dimension field.
Page 150
Data Ingest Guide - Define Lenses to Load Data
Don't include DISTINCT measures unless they are essential to your analysis.
Measures that calculate DISTINCT counts must also include the original field values that they are
counting. If you add a DISTINCT measure on a high-cardinality field, this can make your aggregate lens
larger than expected. Only include DISTINCT measures in your lens when you are sure you need them
for your analysis.
For any dimension field you have in your lens, you can also calculate a DISTINCT
count in the vizboard using a vizboard computed field. DISTINCT is the one
measure aggregation that doesn't have to be calculated at lens build time.
About the Lens Builder Panel
When you create a new lens or edit an existing one, it opens the lens builder panel. The lens builder is
where you choose and confirm the dataset fields that you want in your lens. The lens builder panel looks
slightly different depending on the type of lens you are building (aggregate or event series lens). You
can click any field to see its definition and description.
1. Lens Name
2. Focus Dataset Name
3. Focus Dataset Size
4. Lens Type (Aggregate or Event Series)
5. Lens Size and Quota Information
6. Field Selection Controls
7. Field Information and Descriptions
Page 151
Data Ingest Guide - Define Lenses to Load Data
8. Quick Measure Controls
9. Lens Filter Controls
10.Lens Management Controls
11.Segment Controls
12.Lens Actions (save lens definition and/or initiate a build job to get data from Hadoop)
Understand the Lens Build Process
The act of building a lens in Platfora generates a series of MapReduce jobs in Hadoop to select,
process, aggregate, and prepare the data for use by Platfora's visual query engine, the vizboard. This
section explains how source data is selected for processing, what happens to the data during lens build
processing, and what resulting data to expect in the lens. By understanding the lens build process,
administrators can make decisions to improve lens build performance and ensure the resulting data
meets the expectations of business users.
Understand Lens MapReduce Jobs
When you build or update a lens in Platfora, it generates a series of MapReduce jobs in the Hadoop
MapReduce cluster. The number of jobs and time to complete each job depends on the number of
datasets involved, the number and size of the fields selected, and if that lens definition has been built
before (incremental vs non-incremental lens builds).
This topic explains all of the MapReduce jobs or steps that you might possibly see in a lens build, and
what is hapening in each step. These steps are listed in the order that they occur in the overall lens build
process.
These MapReduce jobs appear on the Platfora System page as distinct steps of a lens build. Depending
on the lens build, you might see all of these steps or just a few of them. Depending on the number of
datasets involved in the lens build, you may see some steps more than once:
Order
Job / Step
What's Happening in this Step?
1
Inspect Source Data
This step scans the data source to determine the
number and size of the files to be processed. If a lens
was built previously using the same dataset and field
selections, then the inspection checks for any new or
changed files since the last build.
If you have defined lens filters in an input partitioning
field, these filters are applied at this time before any
other processing occurs.
Page 152
Data Ingest Guide - Define Lenses to Load Data
Order
Job / Step
What's Happening in this Step?
2
Waiting for lens build slot
to become available
To prevent Platfora from overwhelming the Hadoop
cluster with too many concurrent lens build jobs,
Platfora limits the number of concurrent jobs it runs.
Any lens build submitted after that limit is reached
waits for existing lens builds to finish before starting.
The limit is 3 by default.
This limit is controlled by the
platfora.builder.lens.build.concurrency property.
3
Event series processing for This step only occurs in lenses that include event
series processing computed fields (computed fields
computed_field_name
defined using a PARTITION statement). This job does
the value partitioning and multi-row pattern match
processing of event series computed fields.
4
Build Data Dictionaries
This step scans the source files and determines the
distinct values for each dimension (grouping column)
in the lens. For example, a gender field might have two
distinct values (Male, Female) and a state field might
have 50 distinct values (CA, NY, WA, TX, etc.).
For high-cardinality fields, you may see an additional
Build Partitioned Data Dictionaries preceding this step.
This splits up the distinct values so that the dictionary
can be distributed across multiple nodes.
This job is run for each dataset included in the lens.
5
Encoding Attribute
This step encodes the dimension values (or attributes)
using the data dictionaries. When data dictionaries
are small, this step does not require its own job (it is
performed as part of dictionary building). When a data
dictionary is large, encoding attributes is a separate
MapReduce job.
6
Encoding Reference
This step joins datasets that are connected by
references. When data dictionaries are small, this step
does not require its own job (it is performed as part of
dictionary building). When a data dictionary is large,
joining datasets is a separate MapReduce job.
Page 153
Data Ingest Guide - Define Lenses to Load Data
Order
Job / Step
What's Happening in this Step?
7
Aggregate Datasets
For aggregate lenses, this step calculates the
aggregated measure values for each dimension
value and each unique combination of dimension
values. For example, if the lens included a measure
for SUM(sales), and the dimension fields gender and
state, then the sum of sales would be calculated for
each gender, each state, and each state/gender
combination.
For event series lenses, this step partitions the
individual event records by the focus dataset key and
orders the event records in each partition by time.
This job is run for each dataset included in the lens.
8
Load Datasets
This step creates a columnar representation of the
data and writes out the lens data structures to disk in
the Hadoop file system.
This job is run for each dataset included in the lens.
9
Index Datasets
For lenses that include fields from multiple datasets,
this step creates indexes on key fields to allow joins
between the datasets when they are queried.
This job is run for each referenced dataset included in
the lens.
10
Transfer to Final Location This is a step specific to Amazon Elastic Map Reduce
(EMR) lens builds. It copies lens output files from the
intermediate directory in the EMR job flow to the final
destination in S3.
11
Preload Built Data Files to This step copies the lens data structures from the
Hadoop file system to the data directory locations on
Local Disk
the Platfora servers. Pre-fetching the lens data from
Hadoop reduces the initial query time when a lens is
first accessed in a vizboard.
Understand Source Data Input to a Lens Build
This section describes how Platfora determines what source data files to process for a given lens build.
Source data input refers to the raw data files in Hadoop that are considered for a particular lens build. A
Platfora dataset points to a location in a data source (a directory in HDFS, an S3 bucket, a Hive table,
etc.). By choosing a focus dataset for your lens, you set the scope of source data to be considered for that
lens. As to what source data files actually get processed by a lens build depends on other characteristics
of the lens, such as if the lens has been built before or if there are any lens filters that exclude source
data files.
Page 154
Data Ingest Guide - Define Lenses to Load Data
Understand Incremental vs Full Lens Builds
Whenever possible, Platfora tries to conserve processing resources on the Hadoop cluster by only
processing the source data files it needs for the lens. If a source data file has already been processed once
for a particular lens definition, Platfora can reuse that work from past lens builds and not process that
file again. However, if the underlying data has changed in some way, Platfora must re-process all of the
source data files in order to ensure data accuracy.
This section describes how a lens build determines if it needs to process all of the source data (full lens
build) or just the new source data that was added since the last time the lens was built (incremental lens
build). Incremental lens builds are more desirable because they are faster and use fewer resources.
When you first create a lens, Platfora builds the lens data using a full build. During the build, Platfora
stores a record of the build inputs. Then, as it manages that lens, Platfora can determine if any build
inputs changed. Platfora rebuilds a lens whenever a user manually fires a build by pressing a lens' Build
button or a scheduled build is fired by the system.
Whenever a build is fired, Platfora first compares the last build inputs to the new build inputs. If nothing
changed between the two builds, Platfora reuses the results of the last build. If there are changes and
those changes fall within certain conditions, Platfora does an incremental lens build. If it cannot do an
incremental build, Platfora does a full rebuild.
Platfora defaults to incremental builds because they are faster than full rebuilds. You can optimize lens
build performance in your environment by understanding the conditions that determine if a lens build is
full or incremental.
An incremental build appends new data to an existing lens without changing any previously built data.
So, Platfora can only incrementally build changes that add but that do not modify or delete old build
inputs. For this reason, Platfora can only incrementally build lenses that rely only on HDFS or HIVE
data sources.
HDFS directory or HIVE partitions permit incremental builds because they support wildcard
configurations. Wildcard configurations typically acquire new data through pattern matching incoming
data. They do not modify or delete existing data. An incremental lens build retrieves the newly added
data, processes it, and appends it to the old data in Platfora. The old data is not changed.
Even though a data source is HIVE or HDFS, it does not guarantee that a lens will always build
incrementally. Under certain conditions, Platfora always builds the full lens. When any of the following
happens between the last build and a new build, Platfora does a full lens build:
• The lens has a LAST X DAYS filter and the last build occurred outside the filter's parameters.
• The lens is modified. For example, a user changes the description or adds a field.
• The dataset is modified. For example, a user adds a field to the dataset.
• A referenced dimension dataset changes in any way.
• A data source is modified. For example, a file is modified or a file is deleted from an HDFS
directory.
Page 155
Data Ingest Guide - Define Lenses to Load Data
Additionally, Platfora builds the full lens under the following conditions:
• The lens includes event series processing fields. Due to the nature of patten matching logic, lenses
with ESP fields require full lens builds that scan all of a dataset's input data.
• The HDFS Delegated Authorization feature is enabled.
A full lens build can be resource intensive and it can take a long time. Which is why Platfora always
tries to do an incremental build if it can. You can increase the chances Platfora does an incremental build
by relaxing the build behavior for dimension data.
Understand Input Partitioning Fields
An input partitioning field is a field in a dataset that contains information about how to locate the source
files in the remote file system. Defining a filter on these special fields eliminates files from lens build
processing as the very first step of the lens build process, as compared to other lens filters which are
evaluated later in the process. Defining a lens filter on an input partitioning field is a way to reduce the
amount of source data that is scanned by a lens build.
For Hive data sources, partitioning fields are defined on the data source by the Hive administrator.
Hive partitioning fields appear in the PARTITIONED BY clause of the Hive table definition. Hive
administrators use partitioning fields to organize the Hive data into separate files in the Hadoop file
system. The goal of Hive table partitioning is to improve query performance by keeping records together
for faster access.
For HDFS or S3 data sources, Platfora administrators can define a partitioning field when they create
a dataset. A partitioning field for HDFS or S3 is any computed field that uses a FILE_NAME()
or FILE_PATH() function. File or directory path partitioning is useful when the source data that
comprises a dataset comes from multiple files, and there is useful information in the directory or file
names themselves. For example, useful path information includes dates or server names.
Page 156
Data Ingest Guide - Define Lenses to Load Data
Not all datasets will have partitioning fields. If there are partitioning fields available, the lens page
displays a special icon
next to them.
Platfora applies filters on input partitioning fields as the first step of a lens build. Then, Platfora
computes any event series processing computed fields. Any other lens field filters are then applied later
in the build process.
Event series processing computed fields are those that are defined using a PARTITION statement. The
interaction of input partitioning fields and event series processing is important to understand if you are
using even series processing computed fields.
Understand How Datasets are Joined
This topic explains how datasets are joined together during the lens build process, and what to expect
in the resulting lens data. Joins only occur for datasets that have references to other datasets, and fields
from the referenced datasets are also included in the lens definition.
About Focus Datasets and Referenced Datasets
When you build a lens, you must choose one dataset as the starting point. This is called the focus dataset
for the lens. The focus dataset may have references to other datasets allowing you to choose dimension
fields from both the focus dataset and the referenced datasets as well. If a lens includes fields from
multiple datasets, then all of the selected fields are combined into one consolidated row in the lens
output. This consolidation of fields is done by joining together the rows of the various datasets on the
fields that they share in common.
Page 157
Data Ingest Guide - Define Lenses to Load Data
The Default Join Behavior: (Left) Outer Joins
Consider a lens that includes fields from both the focus dataset and a referenced dataset. When Platfora
builds this lens, it does an OUTER JOIN between the focus dataset and any referenced datasets. The
OUTER JOIN operation compares rows in the focus dataset to related rows in the referenced datasets.
If a row in the focus dataset cannot join to a row in the referenced dataset, then Platfora still includes
these unjoined focus rows in the lens results. However, the values for the referenced fields that did not
join are treated as NULL values. These NULL values are then replaced with default values and joined to
the consolidated focus row. Platfora notifies you with an 'unjoined foreign keys' warning whenever there
is a focus row that did not join.
How Lens Filters can Change Join Behavior to (Right) Inner Joins
If a lens filter is on a field from the focus dataset, then the default join behavior is still an OUTER JOIN.
The focus dataset rows are used as the basis of the join.
However, if the lens filter is on a field from a referenced dataset, the lens build process uses an INNER
JOIN instead. The referenced dataset rows are used at the basis for comparison. This means that focus
rows can potentially be excluded from the lens entirely.
Before doing the join, the lens build first filters the rows of the referenced dataset and discards any
rows that don't match the filter criteria. Then, the build joins the filtered referenced dataset to the focus
dataset.
When it uses an INNER JOIN, Platfora entirely excludes all unjoined rows from the lens results.
Because the lens build performs the filter first and it excludes unjoined rows, an INNER JOIN can return
fewer focus rows than you may expect.
Create a Lens
A lens is always defined from the focal point of a single dataset in the data catalog. Once you have
located a dataset that has the data you need, first check and see if there are any existing lenses that you
can use. If not, click Create Lens on the dataset details page to define and build a new lens from that
dataset.
Page 158
Data Ingest Guide - Define Lenses to Load Data
To create a new lens, you must be at least an Analyst role or above. You must have data access
permissions on the source data and at least Define Lens from Dataset object permissions on the
focus dataset, as well as any datasets that are included in the lens by reference.
1. Go to the Platfora Data Catalog.
2. Find the dataset you want, and open it.
3. Go to the Lenses section and click Add.
4. In the lens builder panel, define your lens and choose the data fields you want to analyze.
a) Name your lens. Choose the name carefully - lenses cannot be renamed.
b) Choose the lens type. An aggregate lens is the default lens type, but you can choose to build an
event series lens if your datasets are modeled in a certain way.
c) Choose lens fields. The types of fields you choose depend on the type of lens you are building.
d) (Optional) Define lens filters. Filters limit the scope of data being requested.
e) (Optional) Allow ad-hoc segments. Choose whether or not to allow vizboard users to create
segments based on members in a particular referenced dataset.
5. Save and build the lens.
Page 159
Data Ingest Guide - Define Lenses to Load Data
Name a Lens
The first step of defining a lens is to give it a meaningful name. The lens name should help users
understand what kind of data they can find in the lens, so they can decide if it will meet their analysis
needs. Choose the lens name carefully - you cannot rename a lens after it has been saved or built for the
first time.
You won't be able to save or build a lens until you give it a name. The lens name must be unique - the
name can't be the same as any existing lens, dataset, or segment in Platfora. You can't change the lens
name after you save the lens.
It is also a good idea to give the lens a description to help users understand what data is in the lens. You
can always edit the description later.
Choose the Lens Type
There are two types of lenses you can create in Platfora: an Aggregate Lens or an Event Series Lens.
The type of lens you can choose depends on the underlying characteristics of the dataset you pick as
the focus of your lens. The type of lens you build also determines what kinds of visualizations you can
create and what kinds of analyses you can perform when using the lens in a vizboard.
Page 160
Data Ingest Guide - Define Lenses to Load Data
1. Aggregate lenses can be built from any dataset. Event series lenses can only be built from datasets
that meet certain data modeling requirements. If your dataset does not meet the requirements for an
event series lens, you will not see it as a choice.
2. When building an aggregate lens, you can choose any measure or dimension field from the current
dataset. You can also choose additional dimension fields from any datasets that are referenced from
the current dataset.
3. To build an event series lens, the dataset must have one or more event references created in it. Events
are a special kind of reverse-reference that includes timestamp information. Events do not apply to
aggregate lenses, only to event series lenses.
When building an event series lens, you can choose dimension fields from the focus dataset or any
related event dataset. Measure fields are not always applicable to event series analysis, since the data
in an event series lens is not aggregated.
About Aggregate Lenses
An aggregate lens can be built from any dataset. There are no special data modeling requirements to
build an aggregate lens. Aggregate lenses contain aggregated measure data grouped by the various
dimension fields you select from the dataset. Choose this lens type when you want to do ad hoc data
analysis.
An aggregate lens contains a selection of measure and dimension fields chosen from the focal point
of a single fact dataset. A completed or built lens can be thought of as a table that contains aggregated
measure data values grouped by the selected dimension values.
For example, suppose you had the following simple dataset containing 6 rows:
id
date
customer
product
quantity
unit price
total amount
1
Jan 1 2013
smith
tea
2
1.00
2.00
2
Jan 1 2013
hutchinson
coffee
1
1.00
1.00
3
Jan 2 2013
smith
coffee
1
1.00
1.00
4
Jan 2 2013
smith
coffee
3
1.00
3.00
5
Jan 2 2013
smith
tea
1
1.00
1.00
6
Jan 3 2013
hutchinson
tea
1
1.00
1.00
In Platfora, a measure is always aggregated data. So in the example above, the field total amount would
only be considered a measure if an aggregate function, such as SUM, were applied to that field.
A dimension is always used to group the aggregated measure data. Suppose we chose the product field
as a dimension in our lens. There would be two groups in this case: coffee and tea.
Page 161
Data Ingest Guide - Define Lenses to Load Data
If our lens only contained that one measure (sum of total amount) and that one dimesion (product), then
the data in the lens would look something like this:
dimension = product
measure = total amount (Sum)
tea
4.00
coffee
5.00
Suppose we added one more measure (sum of quantity) and one more dimesion (customer) to our lens.
The measure values are then calculated for each combination of dimension values. In this case, the data
in the lens would look something like this:
dimensions = product, customer,
product+customer
measure = total amount (Sum)
measure = quantity (Sum)
tea
4.00
4
coffee
5.00
5
smith
7.00
7
hutchinson
2.00
2
smith, tea
3.00
3
smith, coffee
4.00
4
hutchinson, tea
1.00
1
hutchinson, coffee
1.00
1
About Event Series Lenses
An event series lens can only be built from dimension datasets that have at least one event reference
defined in them. It contains non-aggregated fact records, partitioned by the key of the focus dataset,
sorted by the time an event occurred. Choose this lens type if you want to do time series analysis, such
as funnel paths.
To build an event series lens, the dataset you choose as the focus of your lens must meet the following
data model requirements:
• The dataset must have a primary key.
• The dataset must have at least one event reference modeled in it. Events are a special reversereferences that associate a dimension dataset to a fact dataset, and designate a timestamp field for
ordering of the fact records.
Page 162
Data Ingest Guide - Define Lenses to Load Data
An event series lens contains a selection of dimension fields chosen from the focal point of a single
dimension dataset, and from any event datasets associated to that dimension dataset.
Measure fields are not always applicable to event series lenses, since the data is not aggregated at lens
build time. If you do decide to add measure fields, you can only choose measures from the event datasets
(not from the focus dataset). These measures will be added to the lens, but will not always be visible
in the vizboard depending on the type of analysis you choose. For example, measures are hidden in the
vizboard if you choose to do funnel analysis.
A completed or built lens can be thought of as a table that contains individual event records partitioned
by the primary key of the dimension dataset, and ordered by a timestamp field.
An event series lens can contain records from multiple event datasets, as long as the event references
have been modeled in the dimension dataset.
For example, suppose you had a dimension dataset that contained these 2 user records. This dataset has a
primary key (a user_id field that is unique for each user record in the dataset):
user_id
name
A
smith
B
hutchinson
This user dataset contains a purchase event reference that points to a dataset containing these 6 purchase
event records:
transaction
date
user_id
product
quantity
unit price
total amount
1
Jan 1 2014
A
tea
2
1.00
2.00
2
Jan 1 2014
B
coffee
1
1.00
1.00
3
Jan 2 2014
A
coffee
1
1.00
1.00
4
Jan 3 2014
A
coffee
3
1.00
3.00
5
Jan 4 2014
A
tea
1
1.00
1.00
6
Jan 3 2014
B
tea
1
1.00
1.00
Page 163
Data Ingest Guide - Define Lenses to Load Data
In an event series lens, individual event records are partitioned by the primary key of the dimension
dataset and sorted by time. If our event series lens contained one measure (sum of total amount) and one
dimesion (product), then the data in the lens would look something like this:
user_id
date
product
total amount
A
Jan 1 2014
tea
2.00
A
Jan 2 2014
coffee
1.00
A
Jan 3 2014
coffee
3.00
A
Jan 4 2014
tea
1.00
B
Jan 1 2014
coffee
1.00
B
Jan 3 2014
tea
1.00
Notice that there are a couple of differences between event series lens data and aggregate lens data:
• The key field (user_id) and timestamp field (date) of the event are automatically included in the lens.
• Measure data is not pre-aggregated. Instead, individual event records are partitioned by the key field
and ordered by time.
Having the lens data structured in this way allows analysts to create special event series viz types in
the vizboard. Event series lenses allow you to analyze sequences of events, including finding patterns
between multiple types of events (purchases and returns, for example).
Choose Lens Fields
Choosing fields for a lens depends on the lens type you pick (Aggregate Lens or Event Series
Lens), and the type of analysis you plan to do. Aggregate lenses need both measure fields (aggregated
variable data) and dimension fields (categorical data). Event series lenses only need dimension fields measures are optional and not always applicable to event series analysis.
About Lens Field Types
Fields are categorized into two basic roles: measures and dimensions. Measure fields are the quantitative
data. Dimension fields are the categorical data. A field also has an associated data type, which describes
the types of values the field contains (STRING, DATETIME, INTEGER, LONG, FIXED, or DOUBLE).
Page 164
Data Ingest Guide - Define Lenses to Load Data
Fields are grouped by the dataset they originate from. As you choose fields for your lens, you will notice
that each field has an icon to denote what kind of field it is, and where it originates from.
Icon
Field Role
Description
Measure
(Numeric)
Measure fields are quantitative data that have an aggregation
applied, such as SUM or AVG. Measures always produce
aggregated values in Platfora. Measure values are always a
numeric data type (INTEGER, LONG, FIXED, or DOUBLE) and are
always the result of an aggregation.
Every aggregate lens must have at least one measure. The default
measure is Total Records (a count of the records in the dataset).
Measures are not applicable to event series lenses and funnel
analysis visualizations.
Datetime
Measure
Datetime measure fields are a special variety of measure
fields. They are datetime data that have either the MIN or MAX
aggregate functions applied to them. Datetime measure values
are always the DATETIME data type.
Datetime measures are not applicable to event series lenses and
funnel analysis visualizations.
Categorical
Dimension
Dimension fields are used to filter dataset records, group measure
data (in an aggregate lens), or define set conditions (in an event
series lens). Categorical dimension fields contain STRING type
data.
Numeric
Dimension
Dimension fields are used to filter dataset records, group measure
data (in an aggregate lens), or define set conditions (in an event
series lens). Numeric dimension fields contain INTEGER, LONG,
FIXED, or DOUBLE type data.
You can apply an aggregate function to a numeric dimension field
to turn it into a measure.
Date
Dimension
Dimension fields are used to filter dataset records and group
measure data. Date dimension fields contain DATETIME type data.
Every datetime field also auto-creates a reference to Platfora's
built-in Date and Time datasets. These date and time references
allow you to analyze the time-based data at different granularities
(week, day, hour, and so on).
Location
Field
Location fields are a special kind of dimension field used only in geo map
visualizations. They are comprised of a set of geo coordinates (latitude,
longitude) and optionally a label name.
Page 165
Data Ingest Guide - Define Lenses to Load Data
Icon
Field Role
Description
Current
Dataset
Fields
Fields that are within the currently selected dataset are grouped together at
the top of the lens field list.
References
A reference groups fields together that come from another dataset.
A reference joins two datasets together on a common key. You
can select dimension fields from any dataset, however you can
only choose measure fields from the current dataset (if building an
aggregate lens).
Geo
References
A
geo reference
is similar to a regular reference. The difference is that geo references are
used specifically for the purpose of linking to datasets containing location
fields.
Events
An
event
is like a reference, except that the direction of the join is reversed. An
event
groups fields together that come from another dataset containing fact
records that are associated with a point in time. Event fields are only
applicable to event series lenses.
Segments
Platfora groups together all segment fields that are based on members of
a referenced dimension dataset. You can select segment fields originally
defined in any lens as long as the segment is based on members in the
referenced dimension dataset.
Segment
Field
A
segment
is a special type of dimension field that groups together members of a
population that meet some defined common criteria. A segment is based
on members of a dimension dataset (such as customers) that have some
behavior in common (such as purchasing a particular product). Any segment
defined on a particular dimension dataset is available as a segment field in
lens that references that dataset.
Segments are created in a viz based on the lens used in the viz.
After creating a segment, Platfora creates a special lens build to
populate the segment members. However, after segments are
defined, you can optionally choose to include a segment field
in any lens that references that dimension dataset. For more
information, see Allow Ad-Hoc Segments.
Page 166
Data Ingest Guide - Define Lenses to Load Data
Choose Fields for Aggregate Lenses
Every aggregate lens must have at least one measure field and one dimension field to be a valid lens.
Choose only the fields you need to do your analysis. You can always come back and modify the lens
later if you decide you need other fields. You can choose fields from the currently selected dataset, as
well as from any datasets it references.
1. Click Add+ or Add- to add or remove all of the fields grouped under a dataset or reference. Note
that this does not apply to nested references and events. You must select fields from each referenced
dataset independently.
2.
3.
Click the plus icon
Click the minus icon
lens.
to add a field to your lens. The plus sign means the field is not in the lens.
to remove the field from your lens. The minus sign means the field is in the
4. Open the quick field selector to confirm the measure aggregations you have chosen on a field.
Original Value (the default), means the field will be included in the lens as a dimension.
5. Expand references to find additional dimension fields.
6. Use the Fields added to lens tab to confirm the field selections you have made.
Page 167
Data Ingest Guide - Define Lenses to Load Data
Choose Measure Fields (Aggregate Lens)
Every aggregate lens needs at least one measure. In Platfora, measure fields are always the result of
an aggregate calculation. If you have metric fields in your dataset that you want to use as the basis for
quantitative analysis, you must decide how to aggregate those metrics before you build a lens.
1. Some measures are pre-defined in the dataset. Pre-defined measures are always at the top of the
dataset field list.
2. Other non-measure fields can be converted into a measure by choosing additional aggregation types
in the lens definition.
Page 168
Data Ingest Guide - Define Lenses to Load Data
Define Quick Measures
A quick measure is an aggregation applied to a dimension field to turn it into a measure. You can add
quick measures to your lens based on any dimension field in the current focus dataset (for an aggregate
lens) or event dataset (for an event series lens).
1. First check if the dataset has pre-defined measures that meet your needs.
Pre-defined measure fields are always listed at the top of the dataset. These measures are aggregated
computed fields that have already been defined in the dataset. Clicking on a predefined measure will
show the aggregate expression used to define the measure.
2. Find the field that you want to use as a measure and add it to your lens definition.
3.
Click the gear icon
to open the quick field selector.
4. Choose the measure aggregations you want to apply to that field:
• Sum (total) is available for numeric type dimension fields only.
• Avg (average) is available for numeric type dimension fields only.
• Distinct (a count of the distinct values in a column) is available for all field types.
• Max (highest value) is available for numeric type dimension fields only.
• Min (lowest value) is available for numeric type dimension fields only.
Each selection will create a new measure in the lens when it is built. Quick measure fields appear in
the built lens with a name such as field_name(Avg).
Page 169
Data Ingest Guide - Define Lenses to Load Data
5. Original Value also keeps the field in the lens as a dimension (grouping column) as well as
aggregating its values for use as a measure. For fields that have lots of unique values, it is probably
best to deselect this option if building an aggregate lens.
Choose the Default Measure
The default lens measure is automatically added to new visualizations created from this lens. This allows
a default chart to be shown in the vizboard immediately after the data analyst chooses a lens for their viz.
If a lens does not have a default measure, the record count of the lens is used as the default measure.
1. Select the measure that you want to designate as the default lens measure. Only pre-defined measures
can be used. You cannot designate quick measures as the default lens measure.
2. Make sure the measure field is added to the lens definition.
3. Click Default to This Measure.
Page 170
Data Ingest Guide - Define Lenses to Load Data
Choose Dimension Fields (Aggregate Lens)
Every aggregate lens needs at least one dimension field. Dimension fields are used to group and filter
measure data in an aggregate lens. You can add dimension fields from the currently selected focus
dataset or any of its referenced datasets.
1. Dimension fields are denoted by a cube icon.
2. Click Add+ or Add- to add or remove all of the dimension fields grouped under a particular dataset
or reference. Add+ and Add- does not apply to nested references. You must select fields from each
referenced dataset independently.
3. Expand references to see the dimension fields available in referenced datasets.
4.
Click the plus icon
to add a dimension field to your lens.
5. Click a dimension field to see the details about it. The following information is available about a
field, depending on it's field type and whether or not the dataset has been profiled. Data heuristic
information is only applicable to aggregate lenses.
Field Detail
Description
Field Type
Either
Base
,
Computed
or
Measure
. Base field values come directly from the source data.
Computed fields and measure values have been transformed or
processed in some way.
Page 171
Data Ingest Guide - Define Lenses to Load Data
Field Detail
Description
Field Name
The name of the field as defined in the dataset.
Expression
If it is a computed field, the expression used to derive the field
values.
Description
The description of the field that has been added to the Platfora
dataset definition.
Example Data
Shows a sampling of the field values from 20 dataset rows. This
is not available for certain types of computed fields, such as
measures, event series computed fields, or computed fields that
reference other datasets.
Data Type
The data type: STRING, DATETIME, INTEGER, LONG,
FIXED, or DOUBLE.
Default Value
The default value that will be substituted for NULL dimension
values when the lens is built. If n/a, then Platfora will use the
defaults of January 1, 1970 for datetimes, NULL for strings, and
0 for numeric data types.
Estimated Distinct Values
If the dataset has been profiled, this is an estimate of how many
unique values the field has. This information is only applicable
to aggregate lenses.
Data Distribution
If the dataset has been profiled, this shows the top 20 values for
that field and an estimation of how the top values are distributed
across all the rows of the dataset. This information is only
applicable to aggregate lenses.
Path
If it is a field from a referenced dataset, shows the dataset name,
reference name, and field name.
Choose Segment Fields (Aggregate Lens)
Segments are members of a referenced dataset that have some behavior in common. After created in a
visualization, the segment field is available to include in any lens that references that dataset. You might
Page 172
Data Ingest Guide - Define Lenses to Load Data
want to include a segment field in a lens if it's commonly used in visualizations or you want to increase
viz query performance.
1. Expand references to see the segments available in referenced datasets.
2. Expand Segments to see the segments available for a particular referenced dataset.
3.
4.
Segment fields are denoted by a cut-out cube icon
Click the plus icon
.
to add a segment field to your lens.
5. Click Add+ or Add- to add or remove all of the segment fields grouped under a particular
referenced dataset. Add+ and Add- does not apply to nested references. You must select fields from
each referenced dataset independently.
6. Click a segment field to see the details about it. The following information is available about a
segment field.
Field Detail
Description
Field Type
Always
Segment
.
Field Name
The name of the segment field as defined in the segment.
Page 173
Data Ingest Guide - Define Lenses to Load Data
Field Detail
Description
Built On
The date of the special segment lens build that populated the
current members of the segment.
Segment Of
The referenced dimension dataset of which the segment values
are a member of. This dataset matches the referenced dataset
under which the segment field is located.
Occurring in Dataset
The fact dataset that includes the behaviors the segment
members have in common. This dataset may be the focus dataset
in the current lens, or it may be from a different dataset that
references this dimension dataset.
Origin Lens
The lens used in the vizboard in which the segment was created
originally.
Segment Conditions
The conditions defined in the segment that determine the
segment members.
"IN" Value Label
The value labels for records that are members of the segment.
"NOT IN" Value Label
The value labels for records that are
not
members of the segment.
Selected Members
The number of segment members out of the total number of
records in the referenced dataset.
Choose Fields for Event Series Lenses
For an event series lens, field selections are mostly dimension and timestamp fields. You can choose
dimension fields from the currently selected dataset, and any fields from the event datasets it references.
Page 174
Data Ingest Guide - Define Lenses to Load Data
Measure fields (aggregated variables) are not applicable to event series analysis, since data is not
aggregated in an event series lens.
1. Click Add+ or Add- to add or remove all of the fields grouped under a dataset, reference, or event
reference. Note that this does not apply to nested references and events. You must select fields from
each referenced dataset independently.
2.
3.
Click the plus icon
Click the minus icon
lens.
to add a field to your lens. The plus sign means the field is not in the lens.
to remove the field from your lens. The minus sign means the field is in the
4. Open the quick field selector to confirm the selections are appropriate for an event series lens.
In an event series lens, aggregated measures (such as SUM or AVG) are not applicable. For example, if
you want to do funnel analysis on some metric of the dataset, make sure that Original Value (the
default) is selected. This means the field will be included in the lens as a dimension.
5. Expand event references (or regular references) to find additional dimension fields.
6. Use the Fields added to lens tab to confirm the field selections you have made.
Timestamp Fields and Event Series Lenses
Timestamp fields have a special purpose in an event series lens. They are used to order all fact records
included in the lens, including fact records coming from multiple datasets. Event series lenses have a
global Timestamp field that applies to all event records included in the lens. There are also global
Page 175
Data Ingest Guide - Define Lenses to Load Data
Timestamp Date and Timestamp Time references, which can be used to filter records on
different granularities of date and time.
Dataset records are not aggregated in an event series lens. Records are partitioned (or grouped) by the
key of the focus dataset and ordered by a datetime field in the event dataset(s).
For example, suppose you built an event series lens based on a customer dataset that had event
references to a purchases dataset and a returns dataset. The lens would partition the event records by
customer and order both types of events (purchases and returns) by the timestamp of the event record.
Event series lenses have a global Timestamp field, and global Timestamp Date and Timestamp
Time references that apply to all event records included in the lens. This is especially relevant if the
lens includes links to multiple event datasets.
Because event series lenses order records by a designated event time (represented by the global
Timestamp), other references to date and time may or may not be relevant to your event series
analysis.
For example, suppose you were building an event series lens based on customers that contained both
purchase and return events. The global Timestamp represents the purchase timestamp or the return
timestamp of the corresponding event record. As an attribute of a customer, suppose you also had the
date the customer first registered on your web site. This customer registration date may be useful for
your analysis if you wanted to group or filter customers by how long they have been a customer. For
example, if you wanted to know 'how does new customers' purchase behavior differ from customers who
registered over a year ago?'
Page 176
Data Ingest Guide - Define Lenses to Load Data
Measure Fields and Event Series Lenses
In Platfora, measure fields are always the result of an aggregate calculation. Since event series lenses
do not contain aggregated data, measure fields are not always applicable to event series analysis.
Measure fields may be included in an event series lens, however they may not show up in the vizboard
(depending on the type of analysis you choose).
1. For event series lenses, you can only choose measures from a referenced event dataset only, not from
the currently selected dataset.
2. Pre-defined measures are listed at the beginning of an event dataset.
If you add a measure to an event series lens, the aggregation will not be calculated at lens build time.
Measure fields that are added to the lens will not show up in the Funnel viz type in Vizboards.
Even though measure fields are not needed for event series analysis, the lens builder still requires
every lens to have at least one measure. Open an event dataset and choose any measure so you won't
get a No measure fields error when you go to build the lens.
3. For event series lenses, quick field aggregations are not applicable. If you want to use a field for
funnel analysis, make sure that Original Value is selected and any aggregations are unselected.
This adds the field to the lens as a dimension.
Page 177
Data Ingest Guide - Define Lenses to Load Data
Define Lens Filters
One way to limit the size of a lens is to define a filter to constrain the number of rows pulled in from the
data source. You can only define filters on dimension fields - one filter condition per field. Filters are
evaluated independently during a lens build, so the order in which they are added to the lens does not
matter.
1. Select the dimension field to filter on.
2. Click the filter icon to the right of the field name. You can only define one filter per field.
3. Define the Filter expression.
Filter expressions are always Boolean expressions, meaning they must evaluate to either true or false.
Note that the selected field name serves as the first argument of the expression, followed by a
comparison operator or logical operator, and then the comparison value. The comparison value must
be of the same data type as the field you are filtering on.
Some examples of lens filter expressions:
BETWEEN 2012-06-01 AND 2012-07-31
LAST 7 DAYS
LIKE("Plat*")
IN("Larry","Curly","Moe")
Page 178
Data Ingest Guide - Define Lenses to Load Data
NOT IN ("Saturday","Sunday")
< 50.00
>= 21
BETWEEN 2009 AND 2012
IS NOT NULL
4. Click Save.
5. Make sure the filter is added in the Filters panel.
Lens Filters on DATETIME Type Fields
This section contains special considerations you must make when filtering on datetime type values.Filter
conditions on DATETIME type fields must be in the format YYYY-MM-DD without enclosing quotes or
any other punctuation. If the date value is in string format rather than a datetime format, the value must
be enclosed in quotes.
Date-based filters can do absolute or relative comparisons.
An absolute date comparison specifies a specific boundary such as:
>= 2013-01-01
The filter expression specifies a range of dates using particular dates in addition to the allowed
comparison operators.
BETWEEN 2013-06-01 AND 2013-07-31
When specifying a range of dates, the earlier date should always come first.
Relative comparisons are always relative to the current date. Relative date filters use the following
format:
LAST <integer> DAYS
LAST 7 DAYS
LAST 0 DAYS
When using a relative date filter, the filter includes all data from the current day. The current day is
defined as the day in Coordinated Universal Time (UTC) when the lens build began. Therefore, the
expression LAST 0 DAYS includes data from the current day only, and the expression LAST 1 DAYS
includes data from the current day and the previous day.
You can use a relative date filter together with a lens build schedule to define a rolling time window.
For example, you could define a lens filter expression of LAST 7 DAYS and schedule the lens to build
nightly. This way, the lens always contains the previous week’s worth of data.
Page 179
Data Ingest Guide - Define Lenses to Load Data
Lens Filters on Input Partition Fields
An input partitioning field is a field in a dataset that contains information about how to locate the source
files in the remote file system. Defining a filter on these special fields eliminates source data files as the
very first step of the lens build process.Not all datasets will have input partitioning fields. If there are
partitioning fields available, the lens page displays a special icon
next to them. You should look for
these special fields when you build a lens. Adding a filter on these fields reduces the amount of source
data to be scanned and processed during a lens build.
See Understand Input Partitioning Fields for more information about these special fields and how they
affect lens build processing.
Troubleshoot Lens Filter Expressions
Invalid lens filter expressions don't always result in an error in the web application. Some invalid filter
expressions are only caught during a lens build and can cause the lens build to fail. This section has
some common lens filter expression mistakes that can cause an error or a lens build failure.
A comparison expression must compare values of the same data type. For example, if you create a filter
on a field that is an INTEGER data type, you can't specify a comparison argument that is a STRING.
Lens filters often compare a field value to a literal value. Specifying a literal value correctly depends on
its data type (string, numeric, or datetime). For example:
• Date literals must be in the format of yyyy-MM-dd without any enclosing quotation marks or other
punctuation.
• String literals are enclosed in double quotes ("). If the string itself contains a quote, it must be
escaped by doubling the double quote ("").
Page 180
Data Ingest Guide - Define Lenses to Load Data
When specifying a range of dates, the earlier date should always come first. For example, when using
the BETWEEN operator:
use BETWEEN 2013-07-01 AND 2013-07-15 (correct)
not BETWEEN 2013-07-15 AND 2012-07-01 (incorrect)
For relative date filter expressions, the only valid date range keyword is DAY or DAYS. For example:
use LAST 7 DAYS (correct)
not LAST 1 WEEK (incorrect)
Below are some more examples of incorrect lens filter expressions, and their corrected versions.
Filtered Field
Lens Filter with Error
Corrected Lens Filter
What's
Wrong?
Date.Year
Date.Year = "2012"
Date.Year =2012
Can't compare an
Integer field to a
String literal
OrderDate
BETWEEN "2013-07-01"
AND "2013-07-15"
BETWEEN 2013-07-01 AND
2013-07-15
Can't compare a
Datetime field to
a String literal
Date.Year
2012
= 2012
No comparison
operator
Title
IN(Mrs,Ms,Miss)
IN("Mrs","Ms","Miss")
String literals
must be quoted
Height
= "60""
= "60"""
Quotes in a literal
string literal must
be escaped
Height
LIKE("\d\'(\d)+"")
LIKE("\d\'(\d)+""")
Quotes in
a regular
expression must
be escaped
PurchaseDate
LAST 1 WEEK
LAST 7 DAYS
Unsupported
keyword for
relative dates
PurchaseDate
BETWEEN 2013-07-15 AND
2012-07-01
BETWEEN 2013-07-01 AND
2013-07-15
Invalid date
range
Page 181
Data Ingest Guide - Define Lenses to Load Data
Allow Ad-Hoc Segments
When the focus dataset in a lens references other datasets, you can choose whether or not to allow
vizboard users to create and use ad-hoc segments based on the members of the referenced datasets. You
can enable this option per reference in the lens.
A segment is a special type of dimension field that vizboard users can create in a viz to group together
members of a population that meet some defined common criteria.
You might want to allow users to create and use ad-hoc segments so they can use segmentation analysis
to analyze members of a population and perform side-by-side comparisons.
When ad-hoc segments are enabled for a reference in a lens, vizboard users have the ability to create
ad-hoc segments in a viz. Additionally, they can use other segments that have been created for that
reference if they are granted permission on the segment.
Allowing ad-hoc segments may increase the lens size depending on the cardinality of the referenced
dataset. By default, ad-hoc segments are not allowed for references in a lens due to lens size
considerations. If the lens already includes the primary key from the referenced dataset, allowing ad-hoc
segments for that reference doesn't significantly increase the lens size.
After a segment has been created, you can choose to include the segment field in the lens. Segment
fields included in a lens perform faster in a viz than the equivalent ad-hoc segment. For more
information on including a segment field in a lens, see Choose Segment Fields (Aggregate Lens).
To allow vizboard users to create and use segments based on members of a particular referenced dataset,
click Ad-Hoc for that reference in the lens builder.
Page 182
Data Ingest Guide - Define Lenses to Load Data
Estimate Lens Size
The size of an aggregate lens is determined by how much source data you request (the number of input
rows), the number of dimension fields you select, and the cardinality (or number of unique values) of
the dimension fields you select. Platfora can help estimate the size of a lens by profiling the data in the
dataset.
About Dataset Profiles
Dataset profiling takes a sampling of rows (50,000 by default) to determine the characteristics of the
data, such as the number of distinct values per field, the distribution of values, and the size of the various
fields in the dataset.
Profiling a dataset runs a series of MapReduce jobs in Hadoop, and builds a special purpose lens called
a profile lens. This lens cannot be opened or used in a vizboard like a regular lens. The sole purpose of
the profile lens is to scan the source data and capture the data heuristics. These heuristics are then used
to estimate lens output size in the lens builder.
Having more information about the source data can guide users to make better choices when they create
a lens, which can reduce the overall lens size and build time.
Some good things to know about dataset profiles:
• When you profile a dataset, any of its referenced datasets are profiled too.
• You do not need to rerun dataset profile jobs every time new source data arrives in Hadoop. The data
characteristics of a dataset typically do not change that often.
• You must have Define Lens from Dataset object permission on a dataset in order to profile it.
• The time is takes to run a profile job depends on the amount of source data in Hadoop. If there is a lot
of data to scan and sample, it can take a while.
• Profile lenses use the naming convention, dataset_name profile.
• The data hueristics collected during profiling are only applicable to estimating the output size of an
aggregate lens. There is no lens output size estimation for event series lenses.
Page 183
Data Ingest Guide - Define Lenses to Load Data
Profile a Dataset
You can profile a dataset as long as you have data access and Define Lens from Dataset
permissions on the dataset. Profiling a dataset initiates a special lens build to sample the source data and
collect its data characteristics.
1. Go to the Data Catalog and choose the Datasets tab.
2. Use the List view or the Quick Find to locate the dataset you want to profile.
3. Choose Profile Dataset from the dataset action menu.
4. Click Confirm.
5. To check the status of the profile job, go to the System page and choose the Activities tab.
After the profile job is finished, you can review the results that are collected when you build or edit a
lens from that dataset. Dataset profile information can only be seen in the lens builder panel, and is only
applicable when building aggregate lenses.
Page 184
Data Ingest Guide - Define Lenses to Load Data
About Lens Size Estimates
Platfora uses the information collected by the dataset profile job to estimate the input and output size
of a lens. Dataset profile and estimation information is shown in the lens builder workspace. Lens size
estimates dynamically change as you add or remove fields and filters in the lens definition.
Lens estimations and are calculated before the lens is built. As a result, they may
be off from the actual size of a built lens. Profile data is not available for certain
kinds of computed fields that require multi-row or multi-dataset processing, such
as aggregated computed fields (measures), event series processing computed
fields, or computed fields that refer to fields in other datasets. As a result,
estimates may be off by a large factor when there are lots of these types of
computed fields in a lens. This is especially true for DISTINCT measures.
Lens input size estimations apply to both aggregate and event series lens types.
However, lens output size estimates are only applicable to aggregate lenses.
Lens Input Size Estimates
Lens input size estimates reflect how much source data will be scanned in the very first stage of a lens
build. Lens input size estimates are available on all datasets (event datasets that have not been profiled
yet). Input size estimation is applicable to all lens types.
1. Platfora looks at the source files in the Hadoop data source location to estimate the overall dataset
size. You do not need to profile a dataset to estimate this number. If the source data files are
compressed, this represents the compressed file size.
Page 185
Data Ingest Guide - Define Lenses to Load Data
2. The size of the input data to a lens build can be reduced if the dataset has input partitioning fields.
The partitioning fields of a dataset are denoted with a special filter icon
.
Not all datasets will have partitioning fields, but if they do, the names of those fields will be shown
under the Input data size estimate.
3. After applying a lens filter, the lens builder estimates the percentage of total dataset size that will be
excluded from the lens based on the filter.
Lens Output Size Estimates
Lens output size refers to how big the final lens will be after it is built. Lens output size estimates are
only available for datasets that have been profiled. Output size estimation is only applicable to aggregate
lenses (not to event series lenses).
1. Based on the fields you have added to the lens, Platfora will use the profile data to estimate the
output size of the lens. The estimate is shown as a range.
2. The relative dimension size helps you identify which dimension fields are the largest (have the most
distinct values), and therefore are contributing most to the overall lens size. Hover your mouse over a
mark in the distribution chart to see the field it represents.
3. Dimension fields are marked with small/medium/large size icons to denote the cost of adding that
field to a lens. Small means the field has less than 1000 unique values. Medium means the field has
between 1001-9999 unique values. Large means the field has over 10,000 unique values.
4. When you select a dimension field, you can see the data characteristics for that field, such as:
Page 186
Data Ingest Guide - Define Lenses to Load Data
• An estimate of how many unique values the field has.
• The top 20 values for that field and an estimation of how the top values are distributed across all
the rows of the dataset.
• A sampling of the field values from 20 dataset rows. This is not available for certain types of
computed fields, such as event series computed fields, or computed fields that reference other
datasets.
Manage Lenses
After a lens is defined, you can edit it, build it, check its status, update/rebuild it (refresh its data), or
delete it. Lenses can be managed from the Data Catalog or System page.
Edit a Lens Definition
After working with a lens, you may realize that there are fields that you do not need, or additional fields
that you'd like to add to the lens. You can edit the lens definition to add or remove fields as long as you
have data access and edit permissions on the lens. Lens definition changes will not be available in a
vizboard until the lens is rebuilt.
Page 187
Data Ingest Guide - Define Lenses to Load Data
You can edit a lens definition from the Data Catalog page or from a Vizboard. To edit a lens, you
must be at least an Analyst role or above. You must have Edit object permissions on the lens and data
access permissions on the dataset, as well as any datasets that are included in the lens by reference.
Some things to know about changing a lens defintion:
• If you remove fields from a lens definition, it may break visualizations that are using that lens and
those fields.
• Changing the lens defintion requires a full lens rebuild. Rather than just incrementally processing the
most recently added source data, all of the source data must be re-processed.
Update Lens Data
Once a lens has been built, you can update it to refresh its data it at any time. You may want to update a
lens if new data has arrived in the source Hadoop system, or if you have changed the lens definition to
include additional fields. Depending on what has changed since the last build, subsequent lens builds are
usually a lot faster.
Page 188
Data Ingest Guide - Define Lenses to Load Data
To rebuild a lens, you must be at least an Analyst role or above. You must have Edit object
permissions on the lens and data access permissions on the dataset, as well as any datasets that are
included in the lens by reference.
1. Go to the Data Catalog and choose the Lenses tab.
2. Use the List view or the Quick Find to locate the lens you want to update.
3. Choose Rebuild from the lens action menu.
4. Click Confirm.
5. To check the status of the lens build job, go to the System page and choose the Activities tab.
Delete or Unbuild a Lens
Deleting or unbuilding a lens is a way to free up space in Platfora for lenses that are no longer needed
or used. Deleting a lens removes the lens definition as well as the lens data. Unbuilding a lens only
removes the lens data (but keeps the lens definition in case you want to rebuild the lens at a later time).
Page 189
Data Ingest Guide - Define Lenses to Load Data
To delete or unbuild a lens, you must be at least an Analyst role or above and have Own object
permissions on the lens.
Deleting or unbuilding a lens should be done with care, as doing so will invalidate
all of the visualizations that depend on that lens.
1. Go to the Data Catalog and choose the Lenses tab.
2. Use the List view or the Quick Find to locate the lens you want to delete or unbuild.
3. Choose Unbuild or Delete from the lens action menu.
4. Click Confirm or Delete (depending on the action you chose).
When you unbuild or delete a lens, the lens data is immediately cleared from the Platfora servers disk
and memory. However, the built lens data will remain on disk in Hadoop for a period of time (the
default is 24 hours) in case you change your mind and decide to rebuild it.
Page 190
Data Ingest Guide - Define Lenses to Load Data
Check the Status of a Lens Build
Depending on the size of the source data requested, it may take a while to process the requested data in
Hadoop to build a Platfora lens. The lens is not available for use in visualizations until the lens build is
complete. You can check the status of a lens build on the System page.
1. Go to the System page.
2. Go to the Lens Builds tab.
3. In the Activities section, click In Progress.
4. Find the lens in the list and expand it to see the build jobs running in Hadoop.
5. Click on a build status message to see more detailed messages about the tasks of a particular job. This
shows the messages from the Hadoop MapReduce JobTracker.
Manage Lens Notifications
You can configure a lens so Platfora sends an email message to users when it detects an anomaly in lens
data. Do this by defining a lens notification. You might want to define a lens notification so data analysts
know when to view a vizboard to analyze the data further.
When Platfora builds a lens, it queries the lens data to search for data that meets the defined criteria.
When it detects that some data meets the criteria, it notifies users with the results of the query it ran
against the lens. Analysts might then choose to view a vizboard based on the same lens so they can
investigate the data further.
Consider the following rules and guidelines when working with lens notifications:
• Platfora must be configured to connect to an email server using SMTP.
• To define a rule, the lens must be built already.
Page 191
Data Ingest Guide - Define Lenses to Load Data
• You can define multiple notification rules per lens. Each rule results in a different email message.
• The rule can only return results for one measure in the lens. However, it can filter on multiple
measures.
• The email message only contains data on the fields chosen in the lens notification rule.
• Platfora sends email notifications after a lens is built, not after the lens notification is created or
edited.
• You can disable a lens notification after it has been defined. You might want to do that to temporarily
stop notifying users while retaining the logic defined in the notification rule.
Add a Lens Notification Rule
Define a lens notification rule so Platfora sends an email to notify users when the data in a lens build
meets specified criteria. You can define multiple rules per lens.
To add a lens notification rule, you must be at least an Analyst role or above. You must have Edit
object permissions on the lens and data access permissions on the dataset, as well as any datasets that are
included in the lens by reference.
1. Go to the Lenses tab in Data Catalog and find the lens for which you want to define a lens
notification rule.
Page 192
Data Ingest Guide - Define Lenses to Load Data
2. Choose Notifications from the lens action menu.
3. In the Notifications dialog, click Create.
A dialog appears where you can define a lens notification rule.
4. Enter a name for this notification rule.
5. Define the query to run against the lens after the lens is built.
You must select one measure field, and you may select zero or more dimension fields. You can group
by dimension fields, and filter the scope of rows by either dimension or measure fields. Click the
icon to include more fields in the query.
6. Define the criteria in the query results that triggers Platfora to send the email notification.
Select the
icon to define additional criteria.
7. Enter one or more email addresses that should receive the notification messages.
Separate multiple email addresses with commas (,).
8. Choose whether the lens notification email should be sent when the criteria defined here are met or
not met.
9. Click Save.
The lens notification rule is created and enabled by default.
Disable a Lens Notification Rule
Disabling a lens notification rule allows you temporarily stop notifications while retaining the logic
defined in the notification rule.
1. Go to the Lenses tab in Data Catalog and find the lens for which you want to disable a lens
notification rule.
Page 193
Data Ingest Guide - Define Lenses to Load Data
2. Choose Notifications from the lens action menu.
3. In the Notifications dialog, clear the check box for the notification rule you want to disable.
4. Click Close.
Schedule Lens Builds
You can configure a lens to be built at specific times on specific days. By default, lenses are built
on demand, but when you define a schedule for a lens, it is built automatically at the times and days
specified in the schedule. You might want to define a schedule so the lens is built nightly, outside of
regular working hours.
About Lens Schedules
When you create or edit a schedule, you define one or more rules. A rule is a set of times and days that
specify when to build a lens. You might want to create multiple rules so the lens builds at different times
on different days. For example, you might want to build the lens at 1:00 a.m. on weekdays, and 8:00
p.m. on weekends.
Lenses build start times are determined by the clock on the Platfora server.
Users who have permission to build a lense can define and edit its schedule.
Lens builds are run by the user who last updated or created the lens build schedule. This is important
because the user's lens build size limit applies to the lens build. For example, if a user with a role type
that has permission to build unlimited size lenses creates a schedule, and then a user with a role type that
has permission to build 100 GB size lenses edits the schedule, the lens will only successfully build if it is
less than 100 GB.
When a scheduled lens build occurs when the same lens is in progress of being built, the scheduled lens
build is skipped and the in progress lens build continues.
You can define rules with the following day and time repeat patterns:
• Specific days of the week at specific times. For example, every Monday, Tuesday, Wednesday,
Thursday, and Friday at 11:00 pm.
• Specific days of the week at repeated hourly intervals. For example, every Saturday and Sunday,
every four hours starting at 12:15 am.
• Specific days of the month at specific times. For example, on the first and 15th of every month at
1:00 am.
Create a Lens Schedule
You can configure a schedule for a lens so it is built at specific times on specific days. You define the
lens schedule when you edit the lens. The schedule is saved whether or not you save your changes on the
lens page.
Page 194
Data Ingest Guide - Define Lenses to Load Data
To create a lens schedule, you must be at least an Analyst role or above. You must have Edit object
permissions on the lens and data access permissions on the dataset, as well as any datasets that are
included in the lens by reference.
1. Go to the Lenses tab in Data Catalog and find the lens for which you want to define a schedule.
2. Choose Schedule from the lens action menu.
3. In the Lens Build Schedule dialog, define a rule for the schedule using the Day of Week or
Day of Month rules.
4. (Optional) Click Add another rule to define an additional rule for the schedule.
Lenses are only built once if you define multiple overlapping rules for the same
time and day.
5. (Optional) Select Export this lens once the build completes if you want to export the lens
data in CSV format.
The export files will be created in the remote file system location you specify. For example, to export
to HDFS, the location URL would look something like this:
hdfs://10.80.231.123:8020/platfora/exports
6. Click OK.
View All Scheduled Builds
Users with the role of System Administrator can view all scheduled lens builds. Additionally, they
can pause (and later resume) a schedule lens build, which might be useful during maintenance windows
or a time of unusually high lens build demand.
Page 195
Data Ingest Guide - Define Lenses to Load Data
1. Go to the System page.
2. Go to the Activities tab.
3. Click View Full Schedule.
4. The Scheduled Activities dialog displays the upcoming scheduled lens builds and vizboard PDF
emails.
a) (Optional) Click a column name to sort by that column.
b) (Optional) Click Pause for a scheduled lens build to prevent the lens from building at the
scheduled time. It will remain paused until someone resumes the schedule.
5. Click OK.
Manage Segments—FAQs
After a segment is defined in a vizboard, you can edit it, update the segment members, delete it, schedule
updates, and show the data lineage. After creating a segment, the segment appears as a catalog object
on the Data Catalog > Segments page. This topic answers some frequently asked questions about
managing segments.
Page 196
Data Ingest Guide - Define Lenses to Load Data
How do I view a segment definition?
Any user with data access permission on the underlying datasets can view a segment definition. Click
the segment on the Data Catalog > Segments page. You can view the segment and its conditions,
but cannot edit it.
How do I edit a segment definition?
To edit a segment, you must be at least an Analyst role or above. You must have Edit object
permissions on the segment and data access permissions on the datasets used in the segment.
Page 197
Data Ingest Guide - Define Lenses to Load Data
You can edit a segment definition from the Data Catalog page or from a vizboard. When editing a
segment, you can add or remove conditions and edit the segment value labels for members and nonmembers.
How do I update the segment members when the source data changes?
Once a segment is created, Platfora creates its own special type of lens behind the scenes to create and
populate the members of the segment. To update the segment members from the latest source data, you
rebuild the segment lens. To schedule a segment, you must have Edit object permission on the segment.
Page 198
Data Ingest Guide - Define Lenses to Load Data
Choose Rebuild from the segment's menu on the Data Catalog > Segments page.
Can I delete a segment?
Yes. Deleting a segment removes the segment definition and its data from the Platfora catalog. To delete
a segment, you must have Own object permission on the segment.
Any visualization using the segment in an analysis will show an error if the segment is deleted from the
dataset. To use these visualizations without error, remove the deleted segment from the drop zone that
contains it.
You can't delete segments that are currently included in a lens. To delete a segment that is included in a
lens, remove it from the lens and then rebuild the lens.
Page 199
Data Ingest Guide - Define Lenses to Load Data
Choose Delete from the segment's menu on the Data Catalog > Segments page.
Segments created from an aggregate lens can also be deleted using the Delete
button when editing the segment from a viz or from the Data Catalog page.
Can I configure a segment lens build to build on a defined schedule?
Yes, you can configure segment lenses to be built at specific times on specific days like other lens
builds. To schedule a segment, you must have Edit object permission on the segment. Choose
Schedule from the segment's menu on the Data Catalog > Segments page.
For more details on how to define a schedule for a segment lens build, see Schedule Lens Builds.
Can I show the data lineage for a segment?
To show the data lineage for a segment, you must have Edit object permission on the segment.
Page 200
Data Ingest Guide - Define Lenses to Load Data
The data lineage report for a segment shows the following types of information:
• Segment conditions
• Lens field names
• Reference field names
• Filter expressions
• Field expressions
• Lens names
• Dataset names
• Data source names
• Data source locations
• Lens build specific source file names including their paths
• Timestamps
Choose Show info & lineage from the segment's menu on the Data Catalog > Segments page.
Page 201
Chapter
6
Export Lens Data
Platfora allows you to export lens data for use in other tools or applications. Full lens data can be exported in
comma-separated values (csv) format to a remote file system such as HDFS or Amazon S3. You can also get a
portion of lens data out of Platfora by exporting the results of a lens query or visualization.
Topics:
•
Export an Entire Lens as CSV
•
Export a Partial Lens as CSV
•
Query a Lens Using the REST API
•
FAQs - Lens Export Basics
Export an Entire Lens as CSV
Exporting a lens writes out the data in parallel to a distributed file system such as HDFS or S3. You can
export an entire lens from the Platfora data catalog.
Make sure you have the correct URL and path information for the remote file
system, and that Platfora has write permissions to the specified export location.
Also make sure there is enough free space in the export location. If the export
location does not exist, Platfora will create it if it has the appropriate permissions.
Page 202
Data Ingest Guide - Export Lens Data
1. Go to the Lenses tab in the Data Catalog.
2. From the lens Actions menu, select Export Data as CSV.
3. Enter the Export Destination, which is a URI of the export location in the remote file system.
The format of the URI is:
native_filesystem_protocol://hostname:port/path-to-export-location
For example, a URI to a location in HDFS:
hdfs://10.80.231.123:8020/platfora/exports
For example, a URI to a location in an S3 bucket.:
s3n://your-bucket-name/platfora/exports
If exporting to S3, make sure Platfora also has your Amazon access key id and secret key entered in
the properties platfora.S3.accesskey and platfora.S3.secretkey. Platfora needs these
to authenticate to your Amazon Web Services (AWS) account.
4. Click Write.
5. A notification message will appear when the lens export completes.
In the remote file system, a directory is created in the specified export location using the directory
naming convention:
export-location/lens-name/timestamp
The lens data is exported in parallel and is usually split across multiple export files. The export location
contains a series of csv.gz lens data files and a .success file if the export completed successfully.
Page 203
Data Ingest Guide - Export Lens Data
Export a Partial Lens as CSV
Exporting a partial lens writes out the results of a lens query to a distributed file system such as HDFS or
S3. You can export a partial lens from a single viz in a vizboard.
Make sure you have the correct URL and path information for the remote file
system, and that Platfora has write permissions to the specified export location.
Also make sure there is enough free space in the export location. If the export
location does not exist, Platfora will create it if it has the appropriate permissions.
1. Go to Vizboards and open the vizboard containing the data you want to export.
2. From the viz export menu, select Export Data as CSV.
3. Enter the Export Destination, which is a URI of the export location in the remote file system.
The format of the URI is:
native_filesystem_protocol://hostname:port/path-to-export-location
4. Click Write.
5. A notification message will appear when the lens export completes.
Query a Lens Using the REST API
Platfora provides a SQL-like query language that you can use to programmatically access data in a lens.
You can submit a SELECT statement using the REST API, and the query results are returned in CSV
format.
Page 204
Data Ingest Guide - Export Lens Data
The syntax used to define a lens query is similar to a SQL SELECT statement. Here is an overview of the
syntax used to define a lens query:
[ DEFINE new-computed-field_alias AS computed_field_expression ]
SELECT measure-fields, dimension-fields
FROM aggregate-lens-name
[ WHERE dimension-filter-expression ]
GROUP BY dimension-fields
[ SORT BY measure-field [ASC | DESC] [LIMIT number] ]
[ HAVING measure-filter-expression ]
The LIMIT clause applies to the group formed by the GROUP BY clause, not the
entire lens.
If you have been using Platfora vizboards, you have already been generating lens queries by creating
visualizations. Here is how the query language clauses map to actions in the viz builder.
For more information about the lens query language syntax and usage, see the Lens Query Language
Reference.
1. Write a lens query SELECT statement.
For example:
SELECT [Total Records],[Lease Status],Carrier.Name
FROM Planes
WHERE Carrier.Name NOT IN ("NULL")
GROUP BY [Lease Status],Carrier.Name
HAVING [Total Records] > 100
Page 205
Data Ingest Guide - Export Lens Data
Notice how the lens field names containing spaces are escaped by enclosing them in brackets. Also
notice the dot notation to refer to a field from a referenced dataset.
2. Depending on the REST client you are using, you may need to URL encode the query before
submitting it via the REST API.
For example, here is the URL-encoded version of the previous lens query:
SELECT+%5BTotal+Records%5D%2C%5BLease+Status%5D%2CCarrier.Name+FROM
+Planes+WHERE+Carrier.Name+NOT+IN+%28%22NULL%22%29+GROUP+BY+%5BLease
+Status%5D%2CCarrier.Name+HAVING+%5BTotal+Records%5D+%3E+100
3. Submit the encoded query string via the REST API.
For example, using the cURL command-line utility:
curl -u admin:admin http://localhost:8001/api/v1/query?query=SELECT
+%5BTotal+Records%5D%2C%5BLease+Status%5D%2CCarrier.Name+FROM
+Planes+WHERE+Carrier.Name+NOT+IN+%28%22NULL%22%29+GROUP+BY+%5BLease
+Status%5D%2CCarrier.Name+HAVING+%5BTotal+Records%5D+%3E+100 >>
query_output.csv
Notice the first part of the URL specifies the Platfora server hostname and port. This example is
connecting to localhost using the default admin username and password.
Notice the latter part of the URL which specifies the Rest API endpoint: /api/v1/query
The GET method for this endpoint expects one input parameter, query, which is the encoded query
string.
The output is returned in CSV format, which you can redirect to a file if you want to save the query
results.
FAQs - Lens Export Basics
Lens data exports allow you to copy data out of Platfora. This topic answers some frequently asked
questions about lens data exports.
How can I allow or prevent exporting lens data?
Lens exports and downloads are controlled by two configuration settings
platfora.permit.export.data and platfora.permit.lens.to.desktop.
The platfora.permit.export.data setting is a global setting which controls display
of all data downloads and exports GUI controls. When this setting is true, users can export
an entire lens as CSV from the Data Catalog to a DFS (for example, HDFS or S3). The
platfora.permit.export.data setting also allows users to export/download individual viz data
from the Vizboards interface. Downloading/exporting viz data from a vizboard download/exports only
a portion of a lens, not the entire lens.
When both the platfora.permit.export.data setting and the
platfora.permit.lens.to.desktop setting are true, users can download the full lens from the
Page 206
Data Ingest Guide - Export Lens Data
Data Catalog to their desktop as CSV. The platfora.permit.lens.to.desktop setting is
an experimental setting because downloading a large amount of lens data to a desktop can cause the
Platfora application to run out of memory. Use this setting with caution.
To control the ability of individual users or groups to export lens data, you must use permissions.
What kind of permission is needed to export lens data?
You must be an Analyst Limited system role or above to export or download lens data in CSV
format, as well as have data access permissions to all of the datasets included in the lens.
To export the lens data to a location in a remote file system (such as HDFS or S3), Platfora must have
write permissions to the export directory you specify.
How much lens data can I export?
For lens data that you download to your desktop, there is a default maximum row limit of one million
rows. This limit can be adjusted using the platfora.csv.download.max.rows property. Of
course, the size of a lens row can vary greatly - you may have 'wide' rows (lots of lens fields) or 'skinny'
rows (just a few lens fields). The download row limit is just a general safety threshold to prevent too
much export data from crashing your browser client.
For lens data that you export to a remote file system, there is no hard size limit, but you are limited by
the amount of disk space in the remote export location and the amount of memory in the Platfora cluster.
Very large lenses can be costly to export in terms of memory usage. To prevent a large lens export from
using all of the memory on the Platfora servers, only one concurrent lens export query can execute at a
time.
Can I download an entire lens to my desktop?
When both the platfora.permit.export.data setting and the
platfora.permit.lens.to.desktop setting are true, users can download a lens from the Data
Catalog to their desktop as CSV. The platfora.permit.lens.to.desktop setting is an
experimental setting use it with caution. Downloading a large amount of lens data to a desktop can cause
the Platfora application to run out of memory.
Users can always download a partial lens from a viz in a vizboard. This requires that the
platfora.permit.export.data is true.
What is the performance impact of a large lens export?
In order to export data from a lens, Platfora uses the in-memory query engine to select the export data.
This is the same query engine that other Platfora users rely upon to build visualizations. Lens export data
is treated just like any other lens query. This means that large lens exports can impact performance for
other vizboard users by competing for in-memory resources.
Page 207
Data Ingest Guide - Export Lens Data
Is there a way to request just some of the lens data rather than the entire
lens?
Yes. If you don't want to export all of the data in a lens, you can use the vizboard to construct a viz that
limits the number of fields to export and filters the number of rows requested. Then you can export just
the data comprising that viz.
You can also programmatically export lens data by submitting a lens query using Platfora's REST API.
What is the file format of the exported data?
Lens data is exported in comma-separated values (csv) format, and when the files are exported to a
remote file system they are compressed using gzip (gz).
The first row of a export file is a header row containing the lens field names. Measure field names are
enclosed in brackets []. Data values are enclosed in double-quotes (") and separated by commas. If a
data value contains a double-quote character, then it is escaped using a double double-quote ("").
The column order in the export file is dimension fields first (in alphabetical order) followed by measure
fields (in alphabetical order).
Where are the export files located?
When you choose to export lens data via a download to your local computer, a single export file is
created on your Desktop (for Windows) or in Downloads (for Mac). The file naming convention is:
dataset-name_lens-name_epoch-timestamp.csv
When you choose to export lens data to a remote file system such as HDFS or S3, a directory is created
in the specified export location using the directory naming convention:
export-location/lens-name/timestamp
When exporting data to a remote file system, the lens data is exported in parallel and is usually split
across multiple export files. The export location contains a series of csv.gz lens data files and a
.success file if the export completed successfully.
Page 208
Data Ingest Guide - Export Lens Data
Can I automate data exports following a scheduled lens build?
Yes. When you create a lens build schedule, there is an option to export the lens data after the lens build
completes. You must supply a location in the remote file system to copy the export files.
Page 209
Chapter
7
Platfora Expressions
Platfora comes with a powerful, flexible built-in expression language that you can use to transform, manipulate,
and query data. This section describes Platfora's expression language, and describes how to use it to define
dataset computed fields, vizboard computed fields, measures, lens filters, and lens query statements.
Topics:
•
Expression Building Blocks
•
PARTITION Expressions and Event Series Processing (ESP)
•
ROLLUP Measures and Window Expressions
•
Computed Field Examples
•
Troubleshoot Computed Field Errors
•
Write a Lens Query
•
FAQs - Expression Basics
•
Platfora Expression Language Reference
Expression Building Blocks
This section explains the building blocks of an expression, and the general rules for constructing a valid
expression.
Functions in an Expression
Functions perform common data processing tasks. While not all expressions contain functions, most do.
This section describes basic concepts you need to know to use functions.
Function Inputs and Outputs
Functions take one or more input values and return an output value. Input values can be a literal value
or the name of a field that contains a value. In both cases, the function expects the input value to be a
particular data type such as STRING or INTEGER. For example, the CONCAT() function combines
STRING inputs and outputs a new STRING.
Page 210
Data Ingest Guide - Platfora Expressions
This example shows how to use the CONCAT() function to concatenate the values in the month, day,
and year fields separated by the literal forward slash character:
CONCAT(month,"/",day,"/",year)
A function's return value may be the same as its input type or it may be an entirely new data type. For
example, the TO_DATE() function takes a STRING as input, but outputs a DATETIME value. If a
function expects a STRING, but is passed another data type as input, the function returns an error.
Typically, functions are classified by what data type they take or what purpose they serve. For example,
CONCAT() is a string function and TO_DATE() is a data type conversion function. You'll find a
complete list of functions by type in Platfora's Expression Language Reference.
Nesting Functions
Functions can take other functions as arguments. For example, you can use the CONCAT function as an
argument to the TO_DATE() function. The final result is a DATETIME value in the format 10/31/2014.
TO_DATE(CONCAT(month,"/",day,"/",year),"MM/dd/yyyy")
The nested function must return the correct data type. So, because TO_DATE() expects string input and
CONCAT() returns a string, the nesting succeeds.
Only row functions allow nesting. Aggregate functions do not allow nested expressions as input.
Aggregate Functions versus Row Functions
Most functions process one value from one row at a time. These are called row functions because they
operate on one value from a single row at a time. Aggregate functions are a special class of functions.
Unlike row functions, aggregate functions process the values from multiple rows together into a single
return value. Some examples of row functions are:
• SUM()
• MIN()
• VARIANCE()
Aggregate functions are also special because you use them to define measures. Measures always return
numeric values that serve as the quantitative data in an analysis. Aggregate expressions are often refered
to as measure expressions in Platfora.
Limitations of Aggregation Functions
Unlike row functions, aggregate functions can only take simple expressions as input (such as field names
or literal values). Aggregate functions cannot take row functions as arguments. You also cannot use an
aggregate function as input into a row function. You cannot mix aggregate functions and row functions
together in one expression.
Finally, while you can build expressions in both the dataset or the vizboard, only the following aggregate
functions are allowed in a vizboard computed field expressions:
• DISTINCT()
Page 211
Data Ingest Guide - Platfora Expressions
• MIN()
• MAX()
• ROLLUP
Operators in an Expression
Platfora has a number of built-in operators for doing arithmetic, logical, and comparison operations.
Often, you'll use operators to combine or compare values. The values can be literal values, field values,
or even other expressions.
Arithmetic Operators
Arithmetic operators perform basic math operations on two values of the same data type. For example,
you could calculate the gross profit margin percentage using the values of a total_revenue and
total_cost field as follows:
((total_revenue - total_cost) / total_cost) * 100
Or you can use the plus (+) operator to combine STRING values:
"Firstname" + " " + "Lastname"
You can use the plus (+) and minus (-) operators to add or subtract DATETIME values. The following
table lists the math operators:
Operator
Description
Example
+
Addition
amount + 10
(add 10 to the value of the
amount
field)
-
Subtraction
amount - 10
(subtract 10 from the value of the
amount
field)
*
Multiplication
amount * 100
(multiply the value of the
amount
field by 100)
/
Division
bytes / 1024
(divide the value of the
bytes
field by 1024 and return the quotient)
Page 212
Data Ingest Guide - Platfora Expressions
Comparison Operators
Comparison operators are used to define Boolean (true / false) expressions. They test whether two values
are equivalent. Comparisons return 1 for true, 0 for false. If the comparison is invalid, for example
comparing a STRING to an INTEGER, the comparison operator returns NULL.
For example, you could use comparison operators within a CASE expression:
CASE WHEN age <= 25 THEN "0-25"
WHEN age <= 50 THEN "26-50
ELSE "over 50" END
This expression compares the value in the age field to a literal number value. If true, it returns the
appropriate STRING value.
You cannot use comparison operators to test for equality between DATETIME values. The following
table lists the comparison operators:
Operator
Meaning
Example Expression
= or ==
Equal to
order_date = "12/22/2011"
>
Greater than
age > 18
!>
Not greater than
age !> 8
<
Less than
age < 30
!<
Not less than
age !< 12
>=
Greater than or equal to
age >= 20
<=
Less than or equal to
age <= 29
<> or != or ^=
Not equal to
age <> 30
Logical Operators
Logical operators are used in expressions to test for a condition. Logical operators are often used in
lens filters, CASE expressions, and PARTITION expressions. Filters test if a field or value meets some
condition. For example, this tests if a date falls between two other dates.
BETWEEN 2013-06-01 AND 2013-07-31
Logical operators are also used to construct WHERE clauses in Platfora's query language. The following
table lists the logical operators:
Operator
Meaning
Example Expression
AND
Test whether two
conditions are true.
OR
Test if either of two
conditions are true.
Page 213
Data Ingest Guide - Platfora Expressions
Operator
Meaning
BETWEEN
Test whether a date or year BETWEEN 2000 AND 2012
numeric value is within
the min and max values
(inclusive).
IN(list)
Test whether a value is product_type
within a set.
IN("tablet","phone","laptop")
LIKE("pattern")
Simple inclusive caseinsensitive character
pattern matching.
The * character
matches any number
of characters. The ?
character matches
exactly one character.
last_name LIKE("?utch*")
matches Kutcher, hutch but not Krutcher or
crutch
Check whether a field
value or expression is
null (empty)
ship_date IS NULL
evaluates to true when the ship_date field is
Reverses the value of
other operators.
• year NOT BETWEEN 2000 AND 2012
min_value AND
max_value
value
IS NULL
NOT
Example Expression
company_name LIKE("platfora")
matches Platfora or platfora
empty
• first_name NOT LIKE("Jo?n*")
excludes John, jonny but not Jon or Joann
• Date.Weekday NOT
IN("Saturday","Sunday")
• purchase_date IS NOT NULL
evaluates to true when the purchase_date
field is not empty
Fields in an Expression
Expressions often operate on the values of a field. This section explains how to use field names in
expressions.
Referring to Fields in the Current Dataset
When you specify a field name in an expression, if the field name does not contain spaces or special
characters, you can simply refer to the field by its name. For example, the following expression sums the
values of the sales field:
SUM(sales)
Page 214
Data Ingest Guide - Platfora Expressions
Enclose field names with square brackets ([]) if they contain spaces, special characters, reserved
keywords (such as function names), or start with numeric characters. For example:
SUM([Sale Amount])
SUM([2013_data])
SUM([count])
If a field name contains a ] (closing square bracket), you must escape the closing square bracket by
doubling it ]]. So if the field name is:
Min([crs_flight_duration])
You enclose the entire field name in square brackets and escape the closing bracket that is part of the
actual field name:
[Min([crs_flight_duration]])]>
If you are using the expression builder, it provides the correct escapes for you.
Field is a synonym for dataset column. The documentation uses the word field
because that is the terminology used in Platfora's user interface.
Use Dot Notation for Fields in a Referenced Dataset
Your expression might refer to a field in the focus dataset. (Focus dataset is simply the current dataset
you are working with.) You also might include a field in a referenced dataset. When including fields
in a referenced dataset, you must qualify the field name with the proper notation. The convention is
reference_name.field_name.
Don't confuse a reference name with the dataset name; they are not the same. When you create a
reference link in a dataset, you give that reference its own name. Use . (dot) notation to separate the two
components.
For example consider, the Airports dataset which goes by the Departure Airport reference name. To
refer to the City field of the Departure Airport reference to the Airports dataset, you would use the
notation:
[Departure Airport].City
Just as with field names, you must escape reference names if they contain spaces, special characters,
reserved keywords (such as function names), or start with numeric characters.
Aggregated Functions and Fields in a Referenced Dataset
Aggregate functions can only operate on fields in the current focus dataset. You cannot directly calculate
a measure on a field belonging to a referenced dataset. For example, the following expression is not
allowed:
DISTINCT([Departure Airport].City)
Page 215
Data Ingest Guide - Platfora Expressions
Instead, use a two-step process to 'pull up' a referenced field into the current dataset. First, define
Departure Airport City computed field whose expression is just the path to the referenced dataset field:
[Departure Airport].City
Then, you can use the interim Departure Airport City computed field as an argument to the aggregate
expression. For example:
DISTINCT([Departure Airport City])
Literal Values in an Expression
Sometimes you need to use a literal value in an expression, as opposed to a field value. How you specify
a literal value depends on its data type (text, numeric, or date). This section explains how to use literals
in expressions.
Literal STRING Values
To specify a literal or actual STRING value, enclose the value in double quotes ("). For example, this
expression converts the values of a gender field to the literal values of male, female, or unknown:
CASE WHEN gender="M" THEN "male" WHEN gender="F" THEN "female"
ELSE "unknown" END
To escape a literal quote within a literal value itself, double the literal quote character. For example:
CASE WHEN height="60""" THEN "5 feet" WHEN height="72""" THEN "6 feet"
ELSE "other" END
The REGEX() function is a special case. In the REGEX() function, string expressions are also enclosed
in quotes. When a string expression contains literal quotes, double the literal quote character. For
example:
REGEX(height, "\d\'(\d)+""")
Literal DATE and DATETIME Values
To refer to a DATETIME value in a lens filter expression, the date format must be yyyy-MM-dd without
any enclosing quotation marks or other punctuation.
order_date BETWEEN 2012-12-01 AND 2012-12-31
To refer to a literal date value in a computed field expression, you must specify the format of the date
and time components using TO_DATE, which takes a string literal argument and a format string. For
example:
CASE WHEN order_date=TO_DATE("2013-01-01 00:00:59 PST","yyyy-MM-dd
HH:mm:ss z") THEN "free shipping" ELSE "standard shipping" DONE
Page 216
Data Ingest Guide - Platfora Expressions
Literal Numeric Values
For literal numeric values, you can just specify the number itself without any special escaping or
formatting. For example:
CASE WHEN is_married=1 THEN "married" is_married=0 THEN "not_married"
ELSE NULL END
PARTITION Expressions and Event Series Processing (ESP)
Computed fields that contain a PARTITION expression are considered event series processing (ESP)
computed fields. You can add ESP computed fields to Platfora datasets only (not vizboards).
Event series processing is also referred to as pattern matching or event correlation. Use event series
processing (ESP) to partition the rows of a dataset, order the rows sequentially (typically by a
timestamp), and search for matching patterns among the rows.
ESP fields evaluate multiple rows in the dataset, and output one value (or column) per row. You can use
the results of an ESP computed field in other expressions or (after lens build processing) in a viz.
How Event Series Processing Works
This section explains how even series processing works by walking you through a simple use of the
PARTITION expression.
This example uses some weblog page view data. Each row represents a page view at a given point in
time within a user session. Each session is unique and belongs to only one user. Users can have multiple
sessions. Within any session a user can visit any page one or more times.
SessionID
UserID
Timestamp
Page
2A
2
3/4/13 2:02 AM
products.html
1A
1
12/1/13 9:00 AM
home.html
1A
1
12/1/13 9:10 AM
products.html
1A
1
12/1/13 9:05 AM
company.html
1B
1
3/1/13 9:45 PM
home.html
1A
1
3/1/13 9:40 PM
checkout.html
2A
2
3/4/13 2:56 AM
checkout.html
1B
1
3/1/13 9:46 PM
products.html
1A
1
12/1/13 9:20 AM
checkout.html
Page 217
Data Ingest Guide - Platfora Expressions
SessionID
UserID
Timestamp
Page
2A
2
3/4/13 2:20 AM
home.html
2A
2
3/4/13 2:33 AM
blogs.html
1A
1
12/1/13 9:15 AM
blogs.html
Consider the following partial PARTITION expression:
PARTITION BY SessionID
ORDER BY Timestamp
...
This paritions the rows by the SessionID. Within each partition, the function orders each row by
Timestamp in ascending order (the default order).
Suppose you wanted to find sessions where users traversed the pages in order from home.html to
products.html and then to the checkout.html page. To look for this page view pattern, you
complete the expression like this.
PARTITION BY SessionID
ORDER BY Timestamp
PATTERN (A,B,C)
DEFINE A AS Page = "home.html",
B AS Page = "product.html",
C AS Page = "checkout.html"
OUTPUT "TRUE"
The PATTERN clause describes the sequence and the DEFINE clauses assigns values to the PATTERN
elements. This pattern says that there is a match whenever there are 3 consecutive rows that meet
criteria A then B then C. If the computed field containing this PARTITION expression was called
Path=home,product,checkout, you would get output that looks like this:
SessionID
UserID
Timestamp
Page
Path=home,product,checkout
1A
1
12/1/13 9:00 AM
home.html
NULL
1A
1
12/1/13 9:05 AM
company.html
NULL
1A
1
12/1/13 9:10 AM
products.html
NULL
1A
1
12/1/13 9:15 AM
blogs.html
NULL
1A
1
12/1/13 9:20 AM
checkout.html
NULL
1B
1
3/1/13 9:40 PM
home.html
NULL
1B
1
3/1/13 9:45 PM
products.html
NULL
1B
1
3/1/13 9:46 PM
checkout.html
TRUE
Page 218
Data Ingest Guide - Platfora Expressions
SessionID
UserID
Timestamp
Page
Path=home,product,checkout
2A
2
3/4/13 2:02 AM
products.html
NULL
2A
2
3/4/13 2:20 AM
home.html
NULL
2A
2
3/4/13 2:33 AM
blogs.html
NULL
2A
2
3/4/13 2:56 AM
checkout.html
NULL
The lens build processing that happens to produce these results is as follows:
1. Partition (or group) the rows of the dataset by session.
2. Order the rows in each partition by time (in ascending order by default).
3. Evaluate the rows against each DEFINE clause and bind the row to the symbol where there is a
match.
4. Check if the PATTERN clause conditions are met in the specified order and frequency.
5. If the PATTERN criteria is met, output TRUE as the result value for the last row that caused the
pattern to be true. Write the output results to a new computed field: Path=home,product,checkout. If
a row does not cause the pattern to be true, output nothing (NULL).
Understand Pattern Match Processing Order
During lens processing, the build evaluates patterns row-by-row from the partitions top row and going
downwards. A pattern match is evaluated based on the current row, and any rows that come before (in
terms of their position in the partition). The pattern match only looks back from the current row – it does
not look ahead to the next row in the partition.
Order processing is important to consider when you want to look for events that happened later
or next (chronologically speaking). With the default sort order (ascending), the build sorts rows
within a partition from oldest to most recent. This means that you can only pattern match backwards
chronologically (or look for events that happened previously in time).
Page 219
Data Ingest Guide - Platfora Expressions
For example, to answer a question such as "what page did a user visit before they visited the product
page?", the following expression would return the previous (chronologically) viewed page before the
product page:
PARTITION BY SessionID
ORDER BY Timestamp ASC
PATTERN (^product_page?,A)
DEFINE product_page AS "product.html",
A AS TRUE
OUTPUT A.Page
If you want to pattern match forwards chronologically (or look for events that happened later in time),
you would specify DESC sort order in the ORDER BY clause of your PARTITION expression. For
example, to answer a question such as "what page did a user visit after they visited the product page?",
the following expression would return the next (chronologically) viewed page after the product page:
PARTITION BY SessionID
ORDER BY Timestamp DESC
PATTERN (^product_page?,A)
DEFINE product_page AS "product.html",
A AS TRUE
OUTPUT A.Page
Understand Pattern Match Precedence
By default, pattern expressions are matched from left to right. The innermost parenthetical expressions
are evaluated first and then moving outward from there.
For example, the pattern:
PATTERN (((A,B)|(C,D)),E)
Would evaluate differently than:
PATTERN (A,B|C,D,E)
Understand Regex-Style Quantifiers (Greedy and Reluctant)
The PATTERN clause can use regex-style quantifiers to denote the frequency of a match.
By default, quantifiers are greedy. This means that it matches as many rows as possible. For example:
PATTERN (A*,B?)
Causes symbol A to match zero or more rows. Symbol B can match to exactly one row.
Adding an additional question mark ? to a quantifier makes it reluctant. This means that the PATTERN
only matches to a row when the row cannot match to any other subsequent match criteria in the pattern.
For example:
PATTERN (A*?,B)
Causes symbol A to match zero or more rows, but only when symbol B does not produce a match. You
can use reluctant quantifiers to break ties when there is more than one possible match to the pattern.
Page 220
Data Ingest Guide - Platfora Expressions
A quantifier applies to a single match criteria symbol only. You cannot apply quantifiers to parenthetical
expressions. For example, you cannot write ((A,B,C)*, D) to indicate that the asterisk quantifier
applies to the whole (A,B,C) expression.
Best Practices for Event Series Processing (ESP)
Event series processing (ESP) computed fields, unlike other computed fields, require advanced
processing during lens builds. This means they require more compute resources on your Hadoop cluster.
This section discusses what to consider when adding event series computed fields to your dataset
definitions, and the best practices when using this feature.
Use Helpful Field Names and Descriptions
In the Data Catalog and Vizboards areas of the Platfora application, event series computed fields
look just like any other dataset field. When defining event series computed fields, give them names and
descriptions that help users understand the field's purpose. This cues users on how to use a field in an
analysis.
For example, if describing an event series computed field that computes Next Page Viewed, it may be
helpful for users to know that this field is best used in conjunction with the Page field. Whatever the
current value is for the Page field, the Next Page Viewed field has the value of Page for the next click
record immediately following the current page.
Increase Partition Limit for Larger Event Series Processing Jobs
The global configuration property platfora.max.pattern.events sets the maximum number of
rows in a partition to evaluate for a pattern match. The default is one million rows.
If a partition exceeds this number of rows, the result of the PARTITION function is NULL for all the
rows that exceed the limit. For example, if you had an event series computed field that partitioned by
UserID and ordered by Timestamp, the build processes only the first million rows and ignores any rows
beyond that so the event series computed field is NULL for those rows.
If you are noticing a lot of default values in your lens data (for example: ‘January 1, 1970’ for dates or
‘NULL’ for strings), you may want to increase platfora.max.pattern.events so that all of the
rows are processed. Keep in mind that increasing this limit will consume more memory resources on the
Hadoop cluster during lens processing.
Filter Partitioning Fields to Restrict Lens Build Scope
Platfora cannot incrementally build lenses that include event series processing fields. Due to the nature
of patten matching logic, lenses with ESP fields require full lens builds that scan all of a dataset's input
data. You can limit the scope of these lens builds and improve processing time by adding a lens filter on
a dataset partitioning field.
A dataset partitioning field is different from the partition criteria of the ESP field. For Hive data sources,
partitioning fields are defined on the data source by the Hive administrator. For HDFS or S3 data
Page 221
Data Ingest Guide - Platfora Expressions
sources, partitioning fields are defined in a Platfora dataset. If there are partitioning fields available in a
lens, the lens builder displays a special icon
next to them.
Consider How Lens Filters Impact Event Series Processing Results
Lens builds always apply lens filters on dataset partitioning fields as the first step of a lens build. This
means a build excludes some source data before processing any computed field expressions. If your lens
includes both lens filters on partitioning fields and ESP computed fields, you should take this behavior
into consideration as it can change the results of PARTITION expresssions, and ultimately, your analysis
conclusions.
For example, suppose you are analyzing web page visits by user on data from 2012 and 2013:
SessionID
UserID
Timestamp (partition field)
Page
1A
1
12/1/12 9:00 AM
home.html
1A
1
12/1/12 9:05 AM
company.html
1A
1
12/1/12 9:10 AM
products.html
1A
1
12/1/12 9:15 AM
blogs.html
1B
1
3/1/13 9:40 PM
home.html
1B
1
3/1/13 9:45 PM
products.html
1B
1
3/1/13 9:46 PM
checkout.html
2A
2
3/4/13 2:02 AM
products.html
2A
2
3/4/13 2:20 AM
home.html
Page 222
Data Ingest Guide - Platfora Expressions
SessionID
UserID
Timestamp (partition field)
Page
2A
2
3/4/13 2:33 AM
blogs.html
2A
2
3/4/13 2:56 AM
checkout.html
Timestamp is a partitioning field and it has a filter that excludes 2012 sessions. Then, you create a
computed field with an event series PARTITION function that returns a user's first visit date. When the
lens builds, the PARTITION expression would process this filtered data:
SessionID
UserID
Timestamp
Page
1B
1
3/1/13 9:40 PM
home.html
1B
1
3/1/13 9:45 PM
products.html
1B
1
3/1/13 9:46 PM
checkout.html
2A
2
3/4/13 2:02 AM
products.html
2A
2
3/4/13 2:20 AM
home.html
2A
2
3/4/13 2:33 AM
blogs.html
2A
2
3/4/13 2:56 AM
checkout.html
Additionally, the results would say UserID 1 had a first visit date of 3/1/13 even though the user's
first visit was actually 12/1/12. This discrepancy results from the build processing the lens filter on
the partitioning field (Timestamp) before the event series processing field.
Lens filters on other, non-partitioning dataset fields are applied after event series
processing.
ROLLUP Measures and Window Expressions
This section explains how to write ROLLUP and window expressions to calculate complex measures,
such as running totals, benchmark comparisons, rank ordering, percentiles, and so on.
Understand ROLLUP Measures
ROLLUP is a modifier to a measure (or aggregate) expression that allows you to operate on a subset of
rows within the overall result set of a query. Using ROLLUP you can build a frame around one or more
rows in a dataset or query result, and then compute an aggregate result in relation to that frame only.
The result of a ROLLUP expression is always a measure. However, instead of just doing a simple
aggregation, it does more complex aggregate processing over a specified set of rows (or marks in a viz).
Page 223
Data Ingest Guide - Platfora Expressions
If you are familiar with SQL, a ROLLUP expression in Platfora is equivalent to the OVER clause in SQL.
For example, this SQL statement:
SELECT SUM(distance) OVER (PARTITION BY departure_date)
would be equivalent to this ROLLUP expression in Platfora:
ROLLUP SUM(Distance) TO [Departure Date]
What is the difference between a measure and a ROLLUP measure?
A measure is the result of an aggregate function (such as SUM) applied to a group of input data rows. For
example, using the Flights tutorial data that comes with your Platfora installation, suppose you wanted
to calculate the total distance flown by an airline. You could create a measure called Distance(Sum) with
an aggregate expression such as this:
SUM(Distance)
The group of input records passed into this aggregate calculation is then determined by the dimension(s)
used in a visualization or lens query. Records that have the same dimension members are grouped
together in a single row, which then gets represented as a mark in a viz. For example, in this viz there is
one group or mark for each Carrier/Week combination in the input data.
A ROLLUP clause modifies another aggregate function to define additional partitioning, ordering, and
window frame criteria. Like a regular aggregate function, ROLLUP also computes aggregate values over
groups of input rows. However, a ROLLUP measure then partitions the overall rows returned by the
Page 224
Data Ingest Guide - Platfora Expressions
viz query into subsets or buckets, and then computes the aggregate expression separately within each
individual bucket.
A ROLLUP is useful when you want to compute an aggregation over a subset of rows (or marks)
independently of the overall result of the viz query. The ROLLUP function specifies how to partition the
subset of rows and how to compute the aggregation within that subset.
For example, suppose you wanted to calculate the percentage of all miles that were flown in a given
week. You could write a ROLLUP expression that calculates the percent of total distance within the
partition of a week (total distance for the week is 100%). The ROLLUP expression to define such a
calculation would look something like this:
100 * [Distance(Sum)] / ROLLUP [Distance(Sum)] TO ([Departure
Date].Week)
Then when this ROLLUP expression is used in a viz, the group of input records passed into the aggregate
calculation is determined by the dimension(s) used in the viz (such as Carrier in this case), however the
aggregation is calculated independently within each week. In this case, you can see the percentage that
each carrier contributed to the total distance flown in a given week.
How to calculate a ROLLUP over an 'adaptive' partition
A ROLLUP expression can have fixed or adaptive partitioning criteria. When you define the ROLLUP
measure expression, the TO clause of the expression specifies how to partition the data. You can either
specify an exact field name (fixed), a reference field name (adaptive), or no field name at all (adaptive).
Page 225
Data Ingest Guide - Platfora Expressions
In the previous example, the ROLLUP expression used a fixed partition of [Departure Date].Week.
If we changed the partition criteria to use just [Departure Date] (a reference), the partition criteria
becomes adaptive to any field of that reference that is used in a viz. The expression to define an adaptive
date partition might look something like this:
100 * [Distance(Sum)] / ROLLUP [Distance(Sum)] TO ([Departure Date])
Since Departure Date is a reference that points to the Date dimension, the calculation dynamically
changes if you drill down from week to day in the viz. This expression can then be used to partition
by any granularity of Departure Date without having to rewrite the ROLLUP expression. The ROLLUP
expression adapts to any granularity of Departure Date used in a viz.
Understand ROLLUP Window Expressions
Adding an ORDER BY plus an optional RANGE or ROWS clause to a ROLLUP expression turns it into a
window expression. These clauses are used to specify an order inside of each partition, and a window
frame around all, one, or several rows over which to compute the aggregate calculation. The window
frame defines how to crop, shift, or fix the row set in relation to the position of the current row.
For example, suppose you wanted to calculate a cumulative total on a day to day basis. You could do
this by adding a window frame to your ROLLUP expression that ordered the rows in each partition by
date (using the ORDER BY clause) , and then summed up the current row and all the days that came
Page 226
Data Ingest Guide - Platfora Expressions
before it (using a ROWS UNBOUNDED PRECEDING clause). In the Flights tutorial data, an expression
that calculated a cumulative total of flights per day would look something like this:
ROLLUP [Total Records] TO () ORDER BY ([Departure Date].Date) ROWS
UNBOUNDED PRECEDING
When this ROLLUP expression is used in a viz, the Total Records measure is computed cumulatively
by day for each partition group (the Date and Cancel Status dimensions in this case), allowing us to see
the progression of cancelled flights in the month of October 2012. This allows us to see unusual growth
patterns in the data, such as the dramatic spike in cancellations at the end of the month.
The RANK, DENSE_RANK, and NTILE functions are considered exclusively window functions because
they can only be used in a ROLLUP expression, and they always require an ordered set of rows (or
window) over which to compute their result.
Computed Field Examples
This section contains examples of some common data processing tasks you can accomplish using
Platfora computed fields.
The Expression Language Reference has examples for all of the built-in functions that Platfora provides.
Finding and Replacing Values
You may have a particular values in your data that you want to find and change to something else, or
reformat them in a way so they are all consistent. For example, find and replace values in a name field
Page 227
Data Ingest Guide - Platfora Expressions
where name values are formatted as firstname lastname and replace them with name values
formatted as lastname, firstname:
REGEX_REPLACE(name,"(.*) (.*)","$2, $1")
Or you may have field values that are not formatted exactly the same, and want to change them so that
like values can be grouped and sorted together. For example, change all profession_title field values that
contain the word "Retired" anywhere in the string to just be a value of "Retired":
REGEX_REPLACE(profession_title,".*(Retired).*","Retired")
Extracting Information from File Names and Directories
You may have a dataset where the information you need is not inside the source files, but in the Hadoop
file name or directory path, such as dates or server names.
Suppose your dataset is based on daily log files that are organized into directories by date, and the file
names are the server IP address of the server that produced the log file.
For example, the URI path to a log file produced by server 172.12.131.118 on July 4, 2012 is:
hdfs://myhdfs-server.com/data/logs/20120704/172.12.131.118.log
The following expression uses FILE_PATH() in combination with REGEX() and TO_DATE() to
create a date field from the date directory name:
TO_DATE(REGEX(FILE_PATH(),"hdfs://myhdfs-server.com/data/logs/(\d{8})/
(?:\d{1,3}\.*)+\.log"),"yyyyMMdd")
And the following expression uses FILE_NAME() and REGEX() to extract the server IP address from
the file name:
REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log")
Extracting a Portion of Field Values
You may have field values where only part of the value contains useful information. You can pull out a
portion of a field value to define a new field. For example, suppose you had an email_address field with
values in the format of [email protected], and you wanted to extract just the provider portion
of the email address:
REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$")
Renaming Field Values
Sometimes field values are not very user-friendly. For example, a Boolean field may have values of 0
and 1 that you want to change to more human-readable values.
CASE WHEN cancelled=0 THEN "Not Cancelled" WHEN cancelled=1 THEN
"Cancelled" ELSE NULL END
Page 228
Data Ingest Guide - Platfora Expressions
Deriving a New Field from Other Fields
You may want to combine the values of other fields to create a new field. For example, you could
combine a month, day, and year field into a single date field. This would then allow you to reference
Platfora's built-in Date dimension dataset.
TO_DATE(CONCAT(month,"/",day,"/",year),"MM/dd/yyyy")
You can also use the values of other fields to calculate a new value. For example, you could calculate a
gross profit margin percentage using the values of a revenue and cost field as follows:
((revenue - cost) / cost) * 100
Cleansing and Casting Field Values
Sometimes the are data values in a column need to be transformed and cast to another data type in order
to allow for further calculations on the data. For example, you might have some numeric data that you
want to use as a measure, however, it has string values of "NA" to represent what should really be NULL
values. You could transform the "NA" values to NULL and then cast the column to a numeric data type.
TO_INT(CASE WHEN delay_minutes="NA" then NULL ELSE delay_minutes END)
Troubleshoot Computed Field Errors
When you create a computed field Platfora catches any syntax error in your expression when you try to
save the field. This section describes the most common causes of expression syntax errors.
Function Arguments Don't Match the Expected Data Type
Functions expect input arguments to be of a certain data type. When a function uses another field as its
input argument, and that field is not of the expected data type, you might see an error such as:
Function REGEX takes 2 arguments with types STRING, STRING, but one
argument of type INTEGER was provided.
Look at the function's arguments that appear in the error message and verify they are the proper data
types. If the argument is a field, you might need to change the data type of the base field or use a data
type conversion function to cpnvert the argument to the expected data type within the expression itself.
See also: Functions in an Expression
Not Escaping Field or Dataset Names
Field and dataset names used in an expression must be enclosed in square brackets ([ ]) if they contain
spaces, special characters, reserved keywords, or start with numeric characters. When an expression
contains a field or dataset name that meets one of these criteria and is not encosed in square brackets,
you might see an error such as:
Platfora expected the string `)', but instead received `F'.
TO_LONG(New Field)
Page 229
Data Ingest Guide - Platfora Expressions
Look at the bolded character in the expression to find the location of the error. Note the text that comes
after this position. If it is part of a field or dataset name, you need to enclose the name with square
brackets. To correct the expression in this example, use: TO_LONG([New Field])
See also: Escaping Spaces or Special Characters in Field and Dataset Names
Not Specifying the Full Path to Fields of a Referenced Dataset
Functions can use a field that is in dataset referenced from the focus dataset. You must specify the field's
full path by including the reference dataset's reference name. If you forget to use the full path, you might
see an error like:
Field not found: carrier_name
When you see the Field not found error, make sure the field is qualified with the reference name.
In this example, carrier_name is a field in a referenced dataset. The reference name in this example is
carriers. To correct this expression, use: carriers.carrier_name for the field name.
See also: Referring to Fields in a Referenced Dataset
Unenclosed Literal Strings
You can include a literal string value as a function argument, but it must be enclosed in double quotes
("). When an expression uses a literal string that isn't enclosed in double quotes, you might see an error
such as:
Field not found: Platfora
When you see the Field not found error, one option is that the alleged field is meant to be a literal
string and needs to be enclosed in double quotes. To correct this expression, use: "Platfora" for the
string.
See also: Literal Values in an Expression
Unescaped Special Characters
Field and dataset names may contain a right square bracket (]), but it must be preceded by another right
square bracket (]]). Literal strings may contain a double quote ("), but it must be preceded by another
double quote (""). Suppose you want to concatenate the strings "Hello and world." to make the
string "Hello world.". The double quotes in each string are special characters and must be escaped
in the expression. If not, you might see an error like:
Platfora expected the string `)', but instead received `H'.
CONCAT(""Hello", " world."")
Look at the bolded character in the expression to find the location of the error. To correct this error,
escape the double quotes with another double quote:
CONCAT("""Hello", " world.""")
Page 230
Data Ingest Guide - Platfora Expressions
Invalid Syntax
Functions have specific requirements, including required arguments and keywords. When an expression
is missing a keyword, you might see an error such as:
Platfora expected a string matching the regular expression
`(?i)\Qend\E', but instead received end of source.
CASE WHEN cancel_code=0 THEN "Not Cancelled" WHEN cancel_code=1 THEN
"Cancelled" ELSE NULL
Look at the bolded character in the expression to find the location of the error. In this example, it
expected the string END (indicated by (?i)\Qend\E), but instead it reached the end of the expression.
The CASE function requires the END keyword at the end of its syntax string. To correct this error, add
END to the end of the expression:
CASE WHEN cancel_code=0 THEN "Not Cancelled" WHEN cancel_code=1 THEN
"Cancelled" ELSE NULL END
See also: Expression Language Reference
Using Row and Aggregate Functions Together in the Same Expression
Aggregate functions (functions used to define measures) cannot use nested expressions as their input
arguments. Aggregate functions can only accept field names as input. You also cannot use an aggregate
expression as input to a row function expression. Aggregate functions and row functions cannot be
mixed together in one expression.
Write a Lens Query
Platfora includes a programmatic query access feature you can use to query a lens. This section
describes support for querying lenses using Platfora's lens query language and the REST API.
Platfora allows you to make a query against an aggregate lens in your Platfora instance. This feature is
not meant as an end-user feature. Rather it is intended to allow you to write programs that issue SQLlike queries to a Platfora lens. For example, you could write a simple command-line client for querying
a lens. Since programmatic query access is meant for use by programs rather than people, a caller makes
the queries through REST API calls.
A query consists of a SELECT statement with one or more optional clauses. The statement and its
clauses use the same expression language elements you encounter when building a computed field
expression and/or a lens filter expression.
[ DEFINE alias-name AS expression [ DEFINE ... ] ]
SELECT measure-field [ AS alias-name ] | measure-expression AS alias-name [ , {
dimension-field [ AS alias-name ] | row-expression AS alias-name } [ , ...] ]
FROM lens-name
[ WHERE filter-expression [ AND filter-expression ] ]
[ GROUP BY dimension-field [ [, group-ordering ] ]
[ HAVING measure-filter-expression ]
Page 231
Data Ingest Guide - Platfora Expressions
For example, you make a query like the following:
SELECT [device].[manufacturer], [user].[gender], [Num Users]
FROM bo_view2G_PSM
WHERE video.genre %3D "Action/Comedy"
AND user.gender !%3D "male"
GROUP BY [device].[manufacturer], [user].[gender]
Once you know the query structure, you make an REST call use the query endpoint. You can pass the
query as a parameter to a GET or as JSON body to a POST.
https://hostname:port/api/v1/query?query="HTML-encoded SELECT
statement ..."
Considerations for Using Programmatic Query Access
Here are some considerations to keep in mind when constructing lens queries:
• You can only query aggregate lenses. You cannot query event series lenses.
• Queries run against the currently built version of the lens.
• Queries that once worked can later fail because the underlying dataset or lens changed.
• You cannot do a SELECT * on a lens.
FAQs - Expression Basics
This section covers the basic concepts and common questions about the Platfora expression language.
What is an expression?
An expression computes or produces a value by combining fields (or columns), constant values,
operators, and functions. An expression outputs a value of a particular data type, such as numeric, string,
datetime, or Boolean (true/false) values. Simple expressions can be a single constant value, the values of
a given column or field, or a function call. You can use operators to join two or more simple expressions
into a complex expression.
How are expressions used in the Platfora application?
Platfora expressions allow you to select, process, transform, and manipulate data. Expressions are used
in several ways in the Platfora application:
• In Datasets, they are used to define computed fields and measures that operate on the raw source
data.
• In Lenses, they are used to define lens filters that limit the scope of raw data requested from Hadoop.
• In Vizboards, they are used to define computed fields that further manipulate the prepared data in a
lens.
Page 232
Data Ingest Guide - Platfora Expressions
• In the Lens Query Language via the REST API, they are used to programmatically access and
manipulate the prepared data in a lens from external applications or plugins.
What is the expression builder?
The expression builder helps you create computed field expressions in the Platfora application. It
shows the available fields in the dataset or lens you are working with, plus the list of Platfora's built-in
functions and statements. It validates your expressions for correct syntax, input data types, and so on.
You can also access the help to view correct syntax and examples for all of the built-in functions and
statements.
What is a computed field expression?
A computed field expression generates its values based on a calculation or condition, and returns a value
for each input row. Computed field expressions that can contain values from other fields, constants,
mathematical operators, comparison operators, or built-in row functions.
What is a measure expression?
A measure expression generates its values as the result of an aggregate function. It takes input values
from multiple rows and returns a single aggregated value.
How are expressions used in programmatic lens queries?
Platfora's lens query language does not have a graphical user interface like the expression builder.
Instead, you can use the cURL command line, Chrome's Postman extension, or write your own plugin
extension to submit a SQL-like SELECT query statement through Platfora's REST API.
The lens query language makes use of expressions in its SELECT statement, DEFINE clause, WHERE
clause and HAVING clause.
Programmatic lens queries are subject to some of the same expression limitations as vizboard computed
fields, since they also operate on the pre-processed data in a lens.
Platfora Expression Language Reference
An expression computes or produces a value by combining field or column values, constant values,
operators, and functions. Platfora has a built-in expression language. You use the language's functions
and operators in dataset computed fields, vizboard computed fields, lens filters, and programmatic lens
queries.
Expression Quick Reference
An expression is a combination of columns (or fields), constant values, operators, and functions used
to evaluate, transform, or produce a value. Simple expressions can be combined to make more complex
expressions. This quick reference describes the functions and operators that can be used to write
expressions.
Page 233
Data Ingest Guide - Platfora Expressions
Platfora's built-in statements, functions and operators are divided into the following categories:
• Conditional and NULL Processing
• Event Series Processing
• String Processing
• Date and Time Processing
• URL Processing
• IP Address Processing
• Mathematical Processing
• Data Type Conversion
• Aggregation and Measure Processing
• ROLLUP and Window Calculations
• User Defined Functions
• Comparison Operators
• Logical Operators
• Arithmetic Operators
Conditional and NULL Processing
Conditional and NULL processing allows you to transform or manipulate data values based on certain
defined conditions. Conditional processing (CASE) can be done at either the dataset or vizboard level.
NULL processing (COALESCE and IS_VALID) is only applicable at the dataset level. During a lens
build, any NULL values in the source data are converted to default values, so lenses and vizboards have
no concept of NULL values.
Function
Description
Example
CASE
evaluates each row in
the dataset according
to one or more input
conditions, and
outputs the specified
result when the input
conditions are met
CASE WHEN gender = "M" THEN "Male"
WHEN gender = "F" THEN "Female" ELSE
"Unknown" END
COALESCE
returns the first valid
value (NOT NULL
value) from a commaseparated list of
expressions
COALESCE(hourly_wage * 40 * 52, salary)
IS_VALID
returns 0 if the
returned value is
NULL, and 1 if the
returned value is NOT
NULL.
IS_VALID(sale_amount)
Page 234
Data Ingest Guide - Platfora Expressions
Event Series Processing
Event series processing allows you to partition rows of input data, order the rows sequentially (typically
by a timestamp), and search for matching patterns in a set of rows. Computed fields that are defined
in a dataset using a PARTITION expression are considered event series processing computed fields.
Event series processing computed fields are processed differently than regular computed fields. Instead
of computing values from the input of a single row, they compute values from inputs of multiple rows
in the dataset. Event series processing computed fields can only be defined in the dataset - not in the
vizboard.
Function
Description
Example
PACK_VALUES
returns multiple
PACK_VALUES("ID",custid,"Age",age)
output values packed
into a single string
of key/value pairs
separated by the
Platfora default key
and pair separators
- useful when the
OUTPUT clause of a
PARTITION expression
returns multiple output
values
PARTITION
partitions the rows
of a dataset, orders
the rows sequentially
(typically by a
timestamp), and
searches for matching
patterns in a set of
rows
PARTITION BY SessionID ORDER BY
Timestamp PATTERN (A,B,C) DEFINE
A AS Page = "home.html", B AS
Page = "product.html", C AS Page =
"checkout.html" OUTPUT "TRUE"
String Functions
String functions allow you to manipulate and transform textual data, such as combining string values or
extracting a portion of a string value.
Function
Description
Example
ARRAY_CONTAINS
performs a whole
string match against
a string containing
delimited values
and returns a 1 or 0
depending on whether
or not the string
contains the search
value.
ARRAY_CONTAINS(device,",","iPad")
Page 235
Data Ingest Guide - Platfora Expressions
Function
Description
Example
CONCAT
concatenates
(combines together)
the results of multiple
string expressions
CONCAT(month,"/",day,"/",year)
FILE_NAME
returns the original file TO_DATE(SUBSTRING(FILE_NAME(),0,8),"yyyyMMdd")
name from the source
file system
FILE_PATH
returns the full URI
path from the source
file system
TO_DATE(REGEX(FILE_PATH(),"hdfs://
myhdfs-server.com/data/logs/(\d{8})/(?:
\d{1,3}\.*)+\.log"),"yyyyMMdd")
EXTRACT_COOKIE
extracts the value
of the given cookie
identifier from a semicolon delimited list of
cookie key=value pairs.
EXTRACT_COOKIE("SSID=ABC; vID=44",
"vID") returns 44
EXTRACT_VALUE
extracts the value for
the given key from
a string containing
delimited key/value
pairs.
EXTRACT_VALUE("firstname;daria|
lastname;hutch","lastname",";","|") returns
INSTR
returns an integer
indicating the position
of a character within
a string that is the
first character of
the occurrence of a
substring.
INSTR(url,"http://",-1,1)
JAVA_STRING
returns the unescaped
version of a Java
unicode character
escape sequence as a
string value
CASE WHEN currency ==
JAVA_STRING("\u00a5") THEN "yes" ELSE
"no" END
JOIN_STRINGS
concatenates
JOIN_STRINGS("/",month,day,year)
(combines together)
the results of multiple
string expressions
with the separator in
between each non-null
value
hutch
Page 236
Data Ingest Guide - Platfora Expressions
Function
Description
Example
JSON_ARRAY_CONTAINS
performs a whole
string match against
a string formatted
as a JSON array
and returns a 1 or 0
depending on whether
or not the string
contains the search
value
JSON_ARRAY_CONTAINS(software,"platfora")
JSON_DOUBLE
extracts a DOUBLE
value from a field in a
JSON object
JSON_DOUBLE(top_scores,"test_scores.2")
JSON_FIXED
extracts a FIXED value JSON_FIXED(top_scores,"test_scores.2")
from a field in a JSON
object
JSON_INTEGER
extracts an INTEGER
value from a field in a
JSON object
JSON_INTEGER(top_scores,"test_scores.2")
JSON_LONG
extracts a LONG value
from a field in a JSON
object
JSON_LONG(top_scores,"test_scores.2")
JSON_STRING
extracts a STRING
value from a field in a
JSON object
JSON_STRING(misc,"hobbies.0")
LENGTH
returns the count of
characters in a string
value
LENGTH(name)
REGEX
performs a whole
REGEX(weblog.request_line,"GET\s/([a-zAstring match against
Z0-9._%-]+\.[html])\sHTTP/[0-9.]+")
a string value with a
regular expression and
returns the portion of
the string matching
the first capturing
group of the regular
expression
Page 237
Data Ingest Guide - Platfora Expressions
Function
Description
Example
REGEX_REPLACE
evaluates a string
value against a
regular expression to
determine if there is
a match, and replaces
matched strings
with the specified
replacement value
REGEX_REPLACE(phone_number,"([0-9]
{3})\.([[0-9]]{3})\.([[0-9]]{4})","\($1\)
$2-$3")
SPLIT
breaks down a
delimited input string
into sections and
returns the specified
section of the string
SPLIT("Restaurants>Location>San
Francisco",">", -1) returns San Francisco
SUBSTRING
returns the specified
characters of a string
value based on the
given start and end
position
SUBSTRING(name,0,1)
TO_LOWER
converts all alphabetic
characters in a string
to lower case
TO_LOWER("123 Main Street") returns 123
converts all alphabetic
characters in a string
to upper case
TO_UPPER("123 Main Street") returns 123
TRIM
removes leading and
trailing spaces from a
string value
TRIM(area_code)
XPATH_STRING
takes an XMLformatted string and
returns the first string
matching the given
XPath expression
XPATH_STRING(address,"//
address[@type='home']/zipcode")
XPATH_STRINGS
takes an XMLformatted string and
returns a newlineseparated array of
strings matching
the given XPath
expression
XPATH_STRINGS(address,"/list/address[1]/
street")
TO_UPPER
main street
MAIN STREET
Page 238
Data Ingest Guide - Platfora Expressions
Function
Description
Example
XPATH_XML
takes an XMLformatted string
and returns an XMLformatted string
matching the given
XPath expression
XPATH_XML(address,"//address[last()]")
Date and Time Functions
Date and time functions allow you to manipulate and transform datetime values, such as calculating time
differences between two datetime values, or extracting a portion of a datetime value.
Function
Description
Example
DAYS_BETWEEN
calculates the
whole number of
days (ignoring
time) between two
DATETIME values
DAYS_BETWEEN(ship_date,order_date)
DATE_ADD
adds the specified time DATE_ADD(invoice_date,45,"day")
interval to a DATETIME
value
HOURS_BETWEEN
calculates the
whole number of
hours (ignoring
minutes, seconds, and
milliseconds) between
two DATETIME values
HOURS_BETWEEN(NOW(),impressions.adview_timestam
EXTRACT
returns the specified
portion of a DATETIME
value
EXTRACT("hour",order_date)
MILLISECONDS_BETWEEN
calculates the
MILLISECONDS_BETWEEN(request_timestamp,response_
MINUTES_BETWEEN calculates the whole
MINUTES_BETWEEN(impression_timestamp,conversion_t
whole number of
milliseconds between
two DATETIME values
number of minutes
(ignoring seconds and
milliseconds) between
two DATETIME values
NOW
returns the current
system date and time
as a DATETIME value
YEAR_DIFF(NOW(),users.birthdate)
Page 239
Data Ingest Guide - Platfora Expressions
Function
Description
Example
SECONDS_BETWEEN calculates the
whole number of
seconds (ignoring
milliseconds) between
two DATETIME values
SECONDS_BETWEEN(impression_timestamp,conversion_
TRUNC
truncates a DATETIME
value to the specified
format
TRUNC(TO_DATE(order_date,"MM/dd/yyyy
HH:mm:ss"),"day")
YEAR_DIFF
calculates the
fractional number of
years between two
DATETIME values
YEAR_DIFF(NOW(),users.birthdate)
URL Functions
URL functions allow you to extract different portions of a URL string, and decode text that is URLencoded.
Function
Description
Example
URL_AUTHORITY
returns the authority
URL_AUTHORITY("http://
portion of a URL string user:[email protected]:8012/
mypage.html") returns
user:[email protected]:8012
URL_FRAGMENT
returns the fragment
URL_FRAGMENT("http://platfora.com/
portion of a URL string news.php?topic=press#Platfora%20News")
returns Platfora%20News
URL_HOST
returns the host,
URL_HOST("http://
domain, or IP address user:[email protected]:8012/
portion of a URL string mypage.html") returns mycompany.com
URL_PATH
returns the path
URL_PATH("http://platfora.com/company/
portion of a URL string contact.html") returns /company/contact.html
URL_PORT
returns the port
URL_PORT("http://
portion of a URL string user:[email protected]:8012/
mypage.html") returns 8012
URL_PROTOCOL
returns the protocol
URL_PROTOCOL("http://www.platfora.com")
(or URI scheme name) returns http
portion of a URL string
Page 240
Data Ingest Guide - Platfora Expressions
Function
Description
Example
URL_QUERY
returns the query
URL_QUERY("http://platfora.com/news.php?
portion of a URL string topic=press&timeframe=today") returns
topic=press&timeframe=today
URLDECODE
decodes a string that
has been encoded
with the application/xwww-form-urlencoded
media type
URLDECODE("N%2FA%20or%20%22not
%20applicable%22")
IP Address Functions
IP address functions allow you to manipulate and transform STRING data consisting of IP address
values.
Function
Description
Example
CIDR_MATCH
compares two
CIDR_MATCH("60.145.56.0/24","60.145.56.246")
STRING arguments
returns 1
representing a CIDR
mask and an IP
address, and returns 1
if the IP address falls
within the specified
subnet mask or 0 if it
does not
HEX_TO_IP
converts a
HEX_TO_IP(AB20FE01) returns 171.32.254.1
hexadecimal-encoded
STRING to a text
representation of an IP
address
Math Functions
Math functions allow you to perform basic math calculations on numeric values. You can also use the
arithmetic operators to perform simple math calculations, such as addition, subtraction, division and
multiplication.
Function
Description
Example
DIV
divides two LONG
values and returns a
quotient value of type
LONG
DIV(TO_LONG(file_size),1024)
Page 241
Data Ingest Guide - Platfora Expressions
Function
Description
Example
EXP
raises the
EXP(Value)
mathematical constant
e to the power
(exponent) of a
numeric value and
returns a value of type
DOUBLE.
FLOOR
returns the largest
integer that is less
than or equal to the
input argument
FLOOR(32.6789) returns 32.0
HASH
evenly partitions data
values into the specified
number of buckets
HASH(username,20)
LN
returns the natural
logarithm of a number
LN(2.718281828) returns 1
MOD
divides two LONG
values and returns the
remainder value of
type LONG
MOD(TO_LONG(file_size),1024)
POW
raises a numeric
100 * POW(end_value/start_value, 0.2) - 1
value to the power
(exponent) of another
numeric value and
returns a value of type
DOUBLE.
ROUND
rounds a DOUBLE
value to the specified
number of decimal
places
ROUND(32.4678954,2) returns 32.47
Page 242
Data Ingest Guide - Platfora Expressions
Data Type Conversion Functions
Data type conversion functions allow you to cast data values from one data type to another. These
functions are used implicitly whenever you set the data type of a field or column in the Platfora user
interface. The supported data types are: INTEGER, LONG, DOUBLE, FIXED, DATETIME, and STRING
Function
Description
Example
EPOCH_MS_TO_DATEconverts LONG values
EPOCH_MS_TO_DATE(1360260240000)
to DATETIME values,
returns 2013-02-07T18:04:00:000Z
where the input
number represents the
number of milliseconds
since the epoch
TO_FIXED
converts STRING,
INTEGER, LONG, or
DOUBLE values to
fixed-decimal values
TO_FIXED(opening_price)
TO_DATE
converts STRING
values to DATETIME
values, and specifies
the format of the date
and time elements in
the string
TO_DATE(order_date,"yyyy.MM.dd 'at'
HH:mm:ss z")
TO_DOUBLE
converts STRING,
INTEGER, LONG, or
DOUBLE values to
DOUBLE (decimal)
values
TO_DOUBLE(average_rating)
TO_INT
converts STRING,
INTEGER, LONG,
or DOUBLE values
to INTEGER (whole
number) values
TO_INT(average_rating)
TO_LONG
converts STRING,
INTEGER, LONG, or
DOUBLE values to
LONG (whole number)
values
TO_LONG(average_rating)
TO_STRING
converts values of
other data types to
STRING (character)
values
TO_STRING(sku_number)
Page 243
Data Ingest Guide - Platfora Expressions
Aggregate Functions
An aggregate function groups the values of multiple rows together based on some defined input
expression. Aggregate functions return one value for a group of rows, and are only valid for defining
measures in Platfora. In the dataset, measures can be defined using any of the aggregate functions. In the
vizboard, only the DISTINCT, MAX, or MIN aggregate functions are allowed.
Function
Description
Example
AVG
returns the average
of all valid numeric
values
AVG(sale_amount)
COUNT
returns the number of
rows in a dataset
COUNT(sales.customers)
COUNT_VALID
returns the number
of rows for which the
given expression is
valid
COUNT_VALID(page_views)
DISTINCT
returns the number of
distinct values for the
given expression
DISTINCT(user_id)
MAX
returns the biggest
value from the given
input expression
MAX(sale_amount)
MIN
returns the smallest
value from the given
input expression
MIN(sale_amount)
SUM
returns the total of all
values from the given
input expression
SUM(sale_amount)
STDDEV
calculates the
population standard
deviation for a group
of numeric values
STDDEV(sale_amount)
VARIANCE
calculates the
VARIANCE(sale_amount)
population variance
for a group of numeric
values
ROLLUP and Window Functions
ROLLUP is a modifier to an aggregate expression that turns an aggregate into a windowed aggregate.
Window functions (RANK, DENSE_RANK and NTILE) can only be used within a ROLLUP statement.
The ROLLUP statement defines the partitioning and ordering of a rowset before the associated aggregate
function or window function is applied.
Page 244
Data Ingest Guide - Platfora Expressions
ROLLUP defines a window or user-specified set of rows within a query result set. A window function
then computes a value for each row in the window. You can use window functions to compute
aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group
results.
ROLLUP statements can be specified in either the dataset or the vizboard. When using a ROLLUP in a
vizboard, the measure for which you are calculating the ROLLUP must already exist in the lens you are
using in the vizboard.
Function
Description
Example
DENSE_RANK
assigns the rank
(position) of each row
in a group (partition)
of rows and does not
skip rank numbers in
the event of tie
ROLLUP DENSE_RANK() TO () ORDER BY
([Sales(Sum)] DESC) ROWS UNBOUNDED
PRECEDING
NTILE
divides a partitioned
group of rows into the
specified number of
buckets, and returns
the bucket number to
which the current row
belongs
ROLLUP NTILE(100) TO () ORDER BY
([Total Records] DESC) ROWS UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING
RANK
assigns the rank
ROLLUP RANK() TO () ORDER BY
(position) of each row ([Sales(Sum)] DESC) ROWS UNBOUNDED
in a group (partition)
PRECEDING
of rows and skips rank
numbers in the event
of tie
ROLLUP
a modifier to an
aggregate function
that turns a regular
aggregate function
into a windowed,
partitioned, or adaptive
aggregate function
100 * COUNT(Flights) / ROLLUP
COUNT(Flights) TO ([Departure Date])
ROW_NUMBER
a modifier to an
aggregate function
that turns a regular
aggregate function
into a windowed,
partitioned, or adaptive
aggregate function
ROLLUP ROW_NUMBER() TO (Quarter)
ORDER BY (Sum_Sales DESC) ROWS
UNBOUNDED PRECEDING
Page 245
Data Ingest Guide - Platfora Expressions
User Defined Functions
User defined functions (UDFs) allow you to define your own per-row processing logic, and then expose
that functionality to users in the Platfora application expression builder. See User Defined Functions
(UDFs) for more information.
Comparison Operators
Comparison operators are used to compare the equivalency of two expressions of the same data type.
The result of a comparison expression is a Boolean value (returns 1 for true, 0 for false, or NULL for
invalid). Boolean expressions are most often used to specify data processing conditions or filters.
Operator
Meaning
Example Expression
= or ==
Equal to
order_date = "12/22/2011"
>
Greater than
age > 18
!>
Not greater than
age !> 8
<
Less than
age < 30
!<
Not less than
age !< 12
>=
Greater than or equal to
age >= 20
<=
Less than or equal to
age <= 29
<> or != or ^=
Not equal to
age <> 30
Logical Operators
Logical operators are used to define Boolean (true / false) expressions. Logical operators are used
in expressions to test for a condition, and return 1 if the condition is true or 0 if it is false. Logical
operators are often used in lens filters, CASE expressions, PARTITION expressions, and WHERE clauses
of queries.
Operator
Meaning
Example Expression
AND
Test whether two
conditions are true.
OR
Test if either of two
conditions are true.
Page 246
Data Ingest Guide - Platfora Expressions
Operator
Meaning
BETWEEN
Test whether a date or year BETWEEN 2000 AND 2012
numeric value is within
the min and max values
min_value AND
max_value
Example Expression
(inclusive).
IN(list)
Test whether a value is product_type
within a set.
IN("tablet","phone","laptop")
LIKE("pattern")
Simple inclusive caseinsensitive character
pattern matching.
The * character
matches any number
of characters. The ?
character matches
exactly one character.
last_name LIKE("?utch*")
matches Kutcher, hutch but not Krutcher or
crutch
Check whether a field
value or expression is
null (empty)
ship_date IS NULL
evaluates to true when the ship_date field is
Reverses the value of
other operators.
• year NOT BETWEEN 2000 AND 2012
value
IS NULL
NOT
company_name LIKE("platfora")
matches Platfora or platfora
empty
• first_name NOT LIKE("Jo?n*")
excludes John, jonny but not Jon or Joann
• Date.Weekday NOT
IN("Saturday","Sunday")
• purchase_date IS NOT NULL
evaluates to true when the purchase_date
field is not empty
Arithmetic Operators
Arithmetic operators perform basic math operations on two expressions of the same data type resulting
in a numeric value. The plus (+) and minus (-) operators can also be used to perform arithmetic
operations on DATETIME values.
Operator
Description
Example
+
Addition
amount + 10
(add 10 to the value of the
amount
field)
Page 247
Data Ingest Guide - Platfora Expressions
Operator
Description
Example
-
Subtraction
amount - 10
(subtract 10 from the value of the
amount
field)
*
Multiplication
amount * 100
(multiply the value of the
amount
field by 100)
/
Division
bytes / 1024
(divide the value of the
bytes
field by 1024 and return the quotient)
Comparison Operators
Comparison operators are used to compare the equivalency of two expressions of the same data type.
The result of a comparison expression is a Boolean value (returns 1 for true, 0 for false, or NULL for
invalid). Boolean expressions are most often used to specify data processing conditions or filter criteria.
Operator Definitions
Operator
Meaning
Example Expression
= or ==
Equal to
order_date = "12/22/2011"
>
Greater than
age > 18
!>
Not greater than
age !> 8
<
Less than
age < 30
!<
Not less than
age !< 12
>=
Greater than or equal to
age >= 20
<=
Less than or equal to
age <= 29
Page 248
Data Ingest Guide - Platfora Expressions
Operator
Meaning
Example Expression
<> or != or ^=
Not equal to
age <> 30
If you are writing queries with REST and the query string includes an = (equal)
character, you must URL encode it as %3D. Failure to encode the character can
result in this error:
string matching regex `(?i)\Qnot\E\b' expected but end of source
found.
Logical Operators
Logical operators are used to define Boolean (true / false) expressions. Logical operators are used
in expressions to test for a condition, and return 1 if the condition is true or 0 if it is false. Logical
operators are often used in lens filters, CASE expressions, PARTITION expressions, and WHERE clauses
of queries.
Operator
Meaning
AND
Test whether two
conditions are true.
OR
Test if either of two
conditions are true.
BETWEEN
Test whether a date or year BETWEEN 2000 AND 2012
numeric value is within
the min and max values
min_value AND
max_value
Example Expression
(inclusive).
IN(list)
Test whether a value is product_type
within a set.
IN("tablet","phone","laptop")
LIKE("pattern")
Simple inclusive caseinsensitive character
pattern matching.
The * character
matches any number
of characters. The ?
character matches
exactly one character.
last_name LIKE("?utch*")
matches Kutcher, hutch but not Krutcher or
crutch
Check whether a field
value or expression is
null (empty)
ship_date IS NULL
evaluates to true when the ship_date field is
value
IS NULL
company_name LIKE("platfora")
matches Platfora or platfora
empty
Page 249
Data Ingest Guide - Platfora Expressions
Operator
Meaning
Example Expression
NOT
Reverses the value of
other operators.
• year NOT BETWEEN 2000 AND 2012
• first_name NOT LIKE("Jo?n*")
excludes John, jonny but not Jon or Joann
• Date.Weekday NOT
IN("Saturday","Sunday")
• purchase_date IS NOT NULL
evaluates to true when the purchase_date
field is not empty
Arithmetic Operators
Arithmetic operators perform basic math operations on two expressions of the same data type resulting
in a numeric value. The plus (+) and minus (-) operators can also be used to perform arithmetic
operations on DATETIME values.
Operator
Description
Example
+
Addition
amount + 10
(add 10 to the value of the
amount
field)
-
Subtraction
amount - 10
(subtract 10 from the value of the
amount
field)
*
Multiplication
amount * 100
(multiply the value of the
amount
field by 100)
/
Division
bytes / 1024
(divide the value of the
bytes
field by 1024 and return the quotient)
Conditional and NULL Processing
Conditional and NULL processing allows you to transform or manipulate data values based on certain
defined conditions. Conditional processing (CASE) can be done at either the dataset or vizboard level.
NULL processing (COALESCE and IS_VALID) is only applicable at the dataset level. During a lens
Page 250
Data Ingest Guide - Platfora Expressions
build, any NULL values in the source data are converted to default values, so lenses and vizboards have
no concept of NULL values.
CASE
CASE is a row function that evaluates each row in the dataset according to one or more input conditions,
and outputs the specified result when the input conditions are met.
CASE WHEN input_condition [AND|OR input_condition]THEN
output_expression [...] [ELSE other_output_expression] END
Returns one value per row of the same type as the output expression. All output expressions must return
the same data type.
If there are multiple output expressions that return different data types, then you will need to enclose
your entire CASE expression in one of the data type conversion functions to explicitly cast all output
values to a particular data type.
WHEN input_condition
Required. The WHEN keyword is used to specify one or more Boolean expressions (see Platfora's
supported conditional operators). If an input value meets the condition, then the output expression
is applied. Input conditions can include other row functions in their expression, but cannot contain
aggregate functions or measure expressions. You can use the AND or OR keywords to combine multiple
input conditions.
THEN output_expression
Required. The THEN keyword is used to specify an output expression when the specified conditions
are met. Output expressions can include other row functions in their expression, but cannot contain
aggregate functions or measure expressions.
ELSE other_output_expression
Optional. The ELSE keyword can be used to specify an alternate output expression to use when the
specified conditions are not met. If an ELSE expression is not supplied, ELSE NULL is the default.
END
Required. Denotes the end of CASE function processing.
Convert values in the age column into a range-based groupings (binning):
CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50" ELSE "over
50" END
Transform values in the gender column from one string to another:
CASE WHEN gender = "M" THEN "Male" WHEN gender = "F" THEN "Female" ELSE
"Unknown" END
The vehicle column contains the following values: truck, bus, car, scooter, wagon, bike, tricycle, and
motorcycle. The following example convert multiple values in the vehicle column into a single value:
Page 251
Data Ingest Guide - Platfora Expressions
CASE WHEN vehicle in ("bike","scooter","motorcycle) THEN "two-wheelers"
ELSE "other" END
COALESCE
COALESCE is a row function that returns the first valid value (NOT NULL value) from a commaseparated list of expressions.
COALESCE(expression[,expression][,...])
Returns one value per row of the same type as the first valid input expression.
expression
At least one required. A field name or expression.
The following example shows an expression to calculate employee yearly income for exempt employees
that have a salary and non-exempt employees that have an hourly_wage. This expression checks the
values of both fields for each row, and returns the value of the first expression that is valid (NOT NULL).
COALESCE(hourly_wage * 40 * 52, salary)
IS_VALID
IS_VALID is a row function that returns 0 if the returned value is NULL, and 1 if the returned value is
NOT NULL. This is useful for computing other calculations where you want to exclude NULL values
(such as when computing averages).
IS_VALID(expression)
Returns 0 if the returned value is NULL, and 1 if the returned value is NOT NULL.
expression
Required. A field name or expression.
Define a computed field using IS_VALID. This returns a row count only for the rows where this field
value is NOT NULL. If a value is NULL, it returns 0 for that row. In this example, we create a computed
field (sale_amount_not_null) using the sale_amount field as the basis.
IS_VALID(sale_amount)
Then you can use the sale_amount_not_null computed field to calculate an acurate average for
sale_amount that excludes NULL values:
SUM(sale_amount)/SUM(sale_amount_not_null)
This is what happens automatically when you use the AVG function.
Event Series Processing
Event series processing allows you to partition rows of input data, order the rows sequentially (typically
by a timestamp), and search for matching patterns in a set of rows. Computed fields that are defined
in a dataset using a PARTITION expression are considered event series processing computed fields.
Page 252
Data Ingest Guide - Platfora Expressions
Event series processing computed fields are processed differently than regular computed fields. Instead
of computing values from the input of a single row, they compute values from inputs of multiple rows
in the dataset. Event series processing computed fields can only be defined in the dataset - not in the
vizboard or a lens query.
PARTITION
PARTITION is an event series processing language that partitions the rows of a dataset, orders the
rows sequentially (typically by a timestamp), and searches for matching patterns in a set of rows.
Computed fields that are defined in a dataset using a PARTITION expression are considered event
series processing computed fields. Event series processing computed fields are processed differently
than regular computed fields. Instead of computing values from the input of a single row, they compute
values from inputs of multiple rows in the dataset.
The PARTITION function can only be used to define a computed field in the
dataset definition (pre-lens build). PARTITION cannot be used to define a
vizboard computed field. Unlike other expressions, PARTITION expressions
cannot be embedded within other functions or expressions - it must be a top-level
expression.
PARTITION BYfield_name
ORDER BY field_name [ASC|DESC]
PATTERN (pattern_expression)
DEFINE symbol_1 AS filter_expression
[,symbol_n AS filter_expression ]
[, ...]
OUTPUT output_expression
To understand how event series processing works, we'll walk through a simple example of a
PARTITION expression.
This is a simple example of some weblog page view data. Each row represents a page view by a user at
a give point in time. Session IDs are used to group together page views that happened in the same user
session:
Page 253
Data Ingest Guide - Platfora Expressions
Suppose you wanted to know how many sessions included the path of page visits to ‘home.html’ then
‘products.html’ then ‘checkout.html’. You could define a PARTITION expression that groups the rows
by session, orders by time, and then iterates through the rows from top to bottom to find sessions that
match the pattern:
PARTITION BY SessionID
ORDER BY Timestamp
PATTERN (A,B,C)
DEFINE A AS Page = "home.html",
B AS Page = "product.html",
C AS Page = "checkout.html"
OUTPUT "TRUE"
1. The PARTITION BY clause partitions (or groups) the rows of the dataset by session.
2. Within each partition, the ORDER BY clause sorts the rows by time (in ascending order by default).
3. Each DEFINE clause specifies a condition used to evaluate a row, and binds that condition to a
symbol that is then used in the PATTERN clause.
4. The PATTERN clause checks if the conditions are met in the specified order and frequency. This
pattern says that there is a match whenever there are 3 consecutive rows that meet criteria A then B
then C.
5. For a row that satisfies all of the PATTERN criteria, the value of the OUTPUT clause is applied.
Otherwise the output is NULL for rows that don’t meet all of the PATTERN criteria.
Returns one value per row of the same type as the output_expression for rows that match the
defined match pattern, otherwise returns NULL for rows that do not match the pattern.
Output values are calculated during the lens build process using a special
event series MapReduce job. Therefore, sample output values for a PARTITION
computed field cannot be shown in the dataset workspace.
PARTITION BY field_name
Required. The PARTITION BY clause is used to specify a field in the current dataset by
which to partition the rows. Rows that share the same value for this field will be grouped
Page 254
Data Ingest Guide - Platfora Expressions
together, and each group will then be processed independently according to the matching
pattern criteria.
The partition field cannot be a field of a referenced dataset; it must be a field in
the current focus dataset.
ORDER BY field_name
Optional. The ORDER BY clause specifies a field by which to sort the rows within each
partition before applying the match pattern criteria. For event series processing, records are
typically ordered by a DATETIME type field, such as a date or a timestamp. The default
sort order is ascending (first to last or low to high).
The ordering field cannot be a field of a referenced dataset; it must be a field in
the current focus dataset.
PATTERN (pattern_expression)
Required. The PATTERN clause specifies the matching pattern to search for within a
partition of rows. The pattern_expression is expressed in a format similar to a regular
expression. The pattern_expression can include:
• A symbol that represents some match criteria (as declared in the DEFINE clause).
• A symbol followed by one of the following regex quantifiers:
? (matches once or not at all - greedy construct)
?? (matches once or not at all - reluctant construct)
* (matches zero or more times - greedy construct)
*? (matches zero or more times - reluctant construct)
+ (matches one or more times - greedy construct)
+? (matches one or more times - reluctant construct)
** (matches the empty sequence, or one or more of the quantified symbol, with gaps
allowed in between. The match need not begin or end with the quantified symbol)
*+ (matches the empty sequence, or one or more of the quantified symbol, with gaps
allowed in between. The match must end with the quantified symbol)
++ (matches the quantified symbol, followed by zero or more of the quantified symbol,
with gaps allowed in between. The match must end with the quantified symbol)
+* (matches the quantified symbol, followed by zero or more of the quantified symbol,
with gaps allowed in between. The match need not end with the quantified symbol)
• A symbol or pattern of symbols anchored by the regex special characters for the
beginning of string.
Page 255
Data Ingest Guide - Platfora Expressions
^ (marks the beginning of the set of rows that match to the pattern)
• patternA|patternB - The alternation operator (pipe symbol) between two symbols or
patterns signifies an OR match.
• patternA,patternB - The concatenation operator (comma) between two symbols or
patterns signifies a match when pattern B immediately follows pattern A.
• patternA->patternB - The follows operator (minus and greater-than sign) between
two symbols or patterns signifies a match when pattern B eventually follows pattern A.
• (pattern_expression) - By default, pattern expressions are matched from left to
right. If parenthesis are used to group sub-expressions, the sub-expression within the
parenthesis is evaluated first.
You cannot use quantifiers outside of parenthesis. For example, you cannot write
((A,B,C)*), to indicate that the asterisk quantifier applies to the whole (A,B,C)
expression.
DEFINE symbol AS filter_expression
Required. The DEFINE clause is used to enumerate symbols used in the PATTERN clause
(or in the filter_expression of a subsequent symbol definition).
A symbol is a name used to refer to some pattern matching criteria. This can be any name
or token that follows Platfora's object naming rules. For example, if the name contains
spaces, special characters, keywords, or starts with a number, you must enclose the name
in brackets [] to escape it. Otherwise, this can be any logical name that helps you identify a
piece of pattern matching logic in your expression.
The filter_expression is a Boolean (true or false) expression that operates on each row of
the partition.
A filter_expression can contain:
• The special expression TRUE or 1, meaning allow the match to occur for any row in the
partition.
• Any field_name in the current dataset.
• symbol.field_name - A field from the dataset qualified by the name of a symbol
that (1) appears only once in the PATTERN clause, (2) preceeds this symbol in the
PATTERN clause, and (3) is not followed by a repetition quantifier in the PATTERN
clause.
For example:
PATTERN (A, B) DEFINE A AS TRUE, B AS product = A.product
This means that the expression for symbol B will match to a row if the product field
for that row is also equal to the product field for the row that is bound to symbol A.
• Any of the comparison operators, such as greater than, less than, equals, and so on.
• The keywords AND or OR (for combining multiple criteria in a single filter expression)
Page 256
Data Ingest Guide - Platfora Expressions
• FIRST|LAST(symbol.field_name) - A field from the dataset, qualified by the name
of a symbol that (1) only appears once in the PATTERN clause, (2) preceeds this symbol
in the PATTERN clause, and (3) is followed by a repetition quantifier in the PATTERN
clause (*,*?,+, or +?). This returns the field value for the first or last row when the
pattern matches to a set of rows.
For example:
PATTERN (A+) DEFINE A AS product = FIRST(A.product) OR COUNT(A)=0
The pattern A+ will match to a series of consecutive rows that all have the same value
for the product field as the first row in the sequence. If the current row happens to be
the first row in the sequence, then it will also be included in the match.
A FIRST or LAST expression evaluates to NULL if it refers to a
symbol that ends up matching an empty sequence. Make sure
your expression handles the row at the beginning or end of a
sequence if you want that row to match as well.
• Any computed expression that operates on the fields or expressions listed above and/or
on literal values.
OUTPUT output_expression
Required. An expression that specifies what the output value should be. The output
expression can refer to:
• The field declared in the PARTITION BY clause.
• symbol.field_name - A field from the dataset, qualified by the name of a symbol that
(1) appears only once in the PATTERN clause, and (2) is not followed by a repetition
quantifier in the PATTERN clause. This will output the matching field value.
• COUNT(symbol) where symbol (1) appears only once in the PATTERN clause, and
(2) is followed by a repetition quantifier in the PATTERN clause. This will output the
sequence number of the row that matched the symbol pattern.
• FIRST | LAST | SUM | COUNT | AVG(symbol.field_name) where symbol (1)
appears only once in the PATTERN clause, and (2) is followed by a repetition quantifier
in the PATTERN clause. This will output an aggregated value for a set of rows that
matched the symbol pattern.
• Since you can only output a single column value, you can use the PACK_VALUES
function to output multiple results in a single column as key/value pairs.
'Session Start Time' Expression
Calculate a user session by partitioning by user and ordering by time. The matching logic represented
by symbol A checks if the time of the current row is less than 30 minutes from the preceding row. If
it is, then it is considered part of the same session as the previous row. Otherwise, the current row is
considered the start of a new session. The PATTERN (A+) means that the matching logic represented
Page 257
Data Ingest Guide - Platfora Expressions
by symbol A must be true for one or more consecutive rows. The output then returns the time of the first
row in a session.
PARTITION BY UserID
ORDER BY Timestamp
PATTERN (A+)
DEFINE A AS COUNT(A)=0
OR MINUTES_BETWEEN(Timestamp,LAST(A.Timestamp)) < 30
OUTPUT FIRST(A. Timestamp)
'Click Number in Session' Expression
Calculate where a click happened in a session by partitioning by session and ordering by time. The
matching logic represented by symbol A simply matches to any row in the session. The PATTERN (A
+) means that the matching logic represented by symbol A must be true for one or more consecutive
rows. The output then returns to count of the row within the partition (based on its order or position in
the partition).
PARTITION BY [Session ID]
ORDER BY Timestamp
PATTERN (A+)
DEFINE A AS TRUE
OUTPUT COUNT(A)
'Path to Page' Expression
This is a complicated expression that looks back from the current row's position to determine the
previous 4 pages viewed in a session. Since a PARTITION expression can only output one column value
as its result, the OUTPUT clause uses the PACK_VALUES function to return the previous page positions
1,2,3, and 4 in one output value. You can then use a series of EXTRACT_VALUE expressions to create
individual columns for each prior page view in the path.
PARTITION BY SessionID
ORDER BY Timestamp
PATTERN (^OtherPreviousPages*?, Page4Back??, Page3Back??, Page2Back??,
Page1Back??, CurrentPage)
DEFINE OtherPreviousPages AS TRUE,
Page4Back AS TRUE,
Page3Back AS TRUE,
Page2Back AS TRUE,
Page1Back AS TRUE,
CurrentPage AS TRUE
OUTPUT PACK_VALUES("Back4",Page4Back.Page, "Back3",Page3Back.Page,
"Back2",Page2Back.Page, "Back1",Page1Back.Page)
‘Page -1 Back’ Expression
Use the output from the Path to Page expression and extract the last page viewed before the current
page.
EXTRACT_VALUE([Path to Page],"Back1")
Page 258
Data Ingest Guide - Platfora Expressions
PACK_VALUES
PACK_VALUES is a row function that returns multiple output values packed into a single string of key/
value pairs separated by the Platfora default key and pair separators. This is useful when the OUTPUT
clause of a PARTITION expression returns multiple output values. The string returned is in a format that
can be read by the EXTRACT_VALUE function. PACK_VALUES uses the same key and pair separator
values that EXTRACT_VALUE uses (the Unicode escape sequences u0003 and u0002, respectively).
PACK_VALUES(key_string,value_expression[,key_string,value_expression]
[,...])
Returns one value per row of type STRING. If the value for either key_string or
value_expression of a pair is null or contains either of the two separators, the full key/value pair is
omitted from the return value.
key_string
At least one required. A field name of any type, a literal string or number, or an expression that returns
any value.
value_expression
At least one required. A field name of any type, a literal string or number, or an expression that returns
any value. The expression must include one value_expression instance for each key_string instance.
Combine the values of the custid and age fields into a single string field.
PACK_VALUES("ID",custid,"Age",age)
The following expression returns ID\u00035555\u0002Age\u000329 when the value of the custid field is
5555 and the value of the age field is 29:
PACK_VALUES("ID",custid,"Age",age)
The following expression returns Age\u000329 when the value of the age field is 29:
PACK_VALUES("ID",NULL,"Age",age)
The following expression returns 29 as a STRING value when the age field is an INTEGER and its value
is 29:
EXTRACT_VALUE(PACK_VALUES("ID",custid,"Age",age),"Age")
You might want to use the PACK_VALUES function to combine multiple field values into a single value
in the OUTPUT clause of the PARTITION (event series processing) function. Then you can use the
EXTRACT_VALUE function in a different computed field in the dataset to get one of the values returned
by the PARTITION function. For example, in the example below, the PARTITION function creates a set
of rows that defines the previous five web pages accessed in a particular user session:
PARTITION BY Session
ORDER BY Time DESC
PATTERN (A?, B?, C?, D?, E)
DEFINE A AS true, B AS true, C AS true, D AS true, E AS true
OUTPUT PACK_VALUES("A", A.Page, "B", B.Page, "C", C.Page, "D", D.Page)
Page 259
Data Ingest Guide - Platfora Expressions
String Functions
String functions allow you to manipulate and transform textual data, such as combining string values or
extracting a portion of a string value.
CONCAT
CONCAT is a row function that returns a string by concatenating (combining together) the results of
multiple string expressions.
CONCAT(value_expression[,value_expression][,...])
Returns one value per row of type STRING.
value_expression
At least one required. A field name of any type, a literal string or number, or an expression that returns
any value.
Combine the values of the month, day, and year fields into a single date field formatted as MM/DD/
YYYY.
CONCAT(month,"/",day,"/",year)
ARRAY_CONTAINS
ARRAY_CONTAINS is a row function that performs a whole string match against a string containing
delimited values and returns a 1 or 0 depending on whether or not the string contains the search value.
ARRAY_CONTAINS(array_string,"delimiter","search_string")
Returns one value per row of type INTEGER. A return value of 1 indicates a positive match, and a return
value of 0 indicates no match.
array_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
array.
delimiter
Required. The delimiter used between values in the array string. This can be a name of a field or
expression of type STRING.
search_string
Required. The literal string that you want to search for. This can be a name of a field or expression of
type STRING.
If you had a device field that contained a comma delimited list formatted like this:
Safari,iPad
You could determine whether or not the device used was an iPad using the following expression:
Page 260
Data Ingest Guide - Platfora Expressions
ARRAY_CONTAINS(device,",","iPad")
The following expressions return 1:
ARRAY_CONTAINS("platfora","|","platfora")
ARRAY_CONTAINS("platfora|hadoop|2.3","|","hadoop")
The following expressions return 0:
ARRAY_CONTAINS("platfora","|","plat")
ARRAY_CONTAINS("platfora,hadoop","|","platfora")
FILE_NAME
FILE_NAME is a row function that returns the original file name from the source file system. This is
useful when the source data that comprises a dataset comes from multiple files, and there is useful
information in the file names themselves (such as dates or server names). You can use FILE_NAME in
combination with other string processing functions to extract useful information from the file name.
FILE_NAME()
Returns one value per row of type STRING.
Your dataset is based on daily log files that use an 8 character date as part of the file name. For example,
20120704.log is the file name used for the log file created on July 4, 2012. The following expression
uses FILE_NAME in combination with SUBSTRING and TO_DATE to create a date field from the first 8
characters of the file name.
TO_DATE(SUBSTRING(FILE_NAME(),0,8),"yyyyMMdd")
Your dataset is based on log files that use the server IP address as part of the file name. For example,
172.12.131.118.log is the log file name for server 172.12.131.118. The following expression uses
FILE_NAME in combination with REGEX to extract the IP address from the file name.
REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log")
FILE_PATH
FILE_PATH is a row function that returns the full URI path from the source file system. This is
useful when the source data that comprises a dataset comes from multiple files, and there is useful
information in the directory names or file names themselves (such as dates or server names). You can
use FILE_PATH in combination with other string processing functions to extract useful information
from the file path.
FILE_PATH()
Returns one value per row of type STRING.
Your dataset is based on daily log files that are organized into directories by date on the source file
system, and the file names are the server IP address of the server that produced the log file. For
Page 261
Data Ingest Guide - Platfora Expressions
example, the URI path to a log file produced by server 172.12.131.118 on July 4, 2012 is hdfs://myhdfsserver.com/data/logs/20120704/172.12.131.118.log.
The following expression uses FILE_PATH in combination with REGEX and TO_DATE to create a date
field from the date directory name.
TO_DATE(REGEX(FILE_PATH(),"hdfs://myhdfs-server.com/data/logs/(\d{8})/(?:
\d{1,3}\.*)+\.log"),"yyyyMMdd")
And the following expression uses FILE_NAME and REGEX to extract the server IP address from the file
name:
REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log")
EXTRACT_COOKIE
EXTRACT_COOKIE is a row function that extracts the value of the given cookie identifier from a semicolon delimited list of cookie key=value pairs. This function can be used to extract a particular cookie
value from a combined web access log Cookie column.
EXTRACT_COOKIE("cookie_list_string",cookie_key_string)
Returns the value of the specified cookie key as type STRING.
cookie_list_string
Required. A field or literal string that has a semi-colon delimited list of cookie key=value pairs.
cookie_key_string
Required. The cookie key name for which to extract the cookie value.
Extract the value of the vID cookie from a literal cookie string:
EXTRACT_COOKIE("SSID=ABC; vID=44", "vID") returns 44
Extract the value of the vID cookie from a field named Cookie:
EXTRACT_COOKIE(Cookie,"vID")
EXTRACT_VALUE
EXTRACT_VALUE is a row function that extracts the value for the given key from a string containing
delimited key/value pairs.
EXTRACT_VALUE(string,key_name [,delimiter] [,pair_delimiter])
Returns the value of the specified key as type STRING.
string
Required. A field or literal string that contains a delimited list of key/value pairs.
key_name
Required. The key name for which to extract the value.
Page 262
Data Ingest Guide - Platfora Expressions
delimiter
Optional. The delimiter used between the key and the value. If not specified, the value u0003 is used.
This is the Unicode escape sequence for the start of text character (which is the default delimiter used
by Hive).
pair_delimiter
Optional. The delimiter used between key/value pairs when the input string contains more than one key/
value pair. If not specified, the value u0002 is used. This is the Unicode escape sequence for the end of
text character (which is the default delimiter used by Hive).
Extract the value of the lastname key from a literal string of key/value pairs:
EXTRACT_VALUE("firstname;daria|lastname;hutch","lastname",";","|")
returns hutch
Extract the value of the email key from a string field named contact_info that contains strings in the
format of key:value,key:value:
EXTRACT_VALUE(contact_info,"email",":",",")
INSTR
INSTR is a row function that returns an integer indicating the position of a character within a string that
is the first character of the occurrence of a substring. Platfora's INSTR function is similar to the FIND
function in Excel, except that the first letter is position 0 and the order of the arguments is reversed.
INSTR(string,substring,position,occurrence)
Returns one value per row of type INTEGER. The first position is indicated with the value of zero (0).
string
Required. The name of a field or expression of type STRING (or a literal string).
substring
Required. A literal string or name of a field that specifies the substring to search for in string.
position
Optional. An integer that specifies at which character in string to start searching for substring. A value
of 0 (zero) starts the search at the beginning of string. Use a positive integer to start searching from
the beginning of string, and use a negative integer to start searching from the end of string. When no
position is specified, INSTR searches at the beginning of the string (0).
occurrence
Optional. A positive integer that specifies which occurrence of substring to search for. When no
occurrence is specified, INSTR searches for the first occurrence of the substring (1).
Return the position of the first occurrence of the substring "http://" starting at the end of the url field:
INSTR(url,"http://",-1,1)
Page 263
Data Ingest Guide - Platfora Expressions
The following expression searches for the second occurrence of the substring "st" starting at the
beginning of the string "bestteststring". INSTR finds that the substring starts at the seventh character in
the string, so it returns 6:
INSTR("bestteststring","st",0,2)
The following expression searches backward for the second occurrence of the substring "st" starting at 7
characters before the end of the string "bestteststring". INSTR finds that the substring starts at the third
character in the string, so it returns 2:
INSTR("bestteststring","st",-7,2)
JAVA_STRING
JAVA_STRING is a row function that returns the unescaped version of a Java unicode character escape
sequence as a string value. This is useful when you want to specify unicode characters in an expression.
For example, you can use JAVA_STRING to specify the unicode value representing a control character.
JAVA_STRING(unicode_escape_sequence)
Returns the unescaped version of the specified unicode character, one value per row of type STRING.
unicode_escape_sequence
Required. A STRING value containing a unicode character expressed as a Java unicode escape
sequence. Unicode escape sequences consist ofa backslash '\' (ASCII character 92, hex 0x5c), a
'u' (ASCII 117, hex 0x75), optionally one or more additional 'u' characters, and four hexadecimal digits
(the characters '0' through '9' or 'a' through 'f' or 'A' through 'F'). Such sequences represent the UTF-16
encoding of a Unicode character. For example, the letter 'a' is equivalent to '\u0061'.
Evaluates whether the currency field is equal to the yen symbol.
CASE WHEN currency == JAVA_STRING("\u00a5") THEN "yes" ELSE "no" END
JOIN_STRINGS
JOIN_STRINGS is a row function that returns a string by concatenating (combining together) the results
of multiple values with the separator in between each non-null value.
JOIN_STRINGS(separator,value_expression[,value_expression][,...])
Returns one value per row of type STRING.
separator
Required. A field name of type STRING, a literal string, or an expression that returns a string.
value_expression
At least one required. A field name of any type, a literal string or number, or an expression that returns
any value.
Combine the values of the month, day, and year fields into a single date field formatted as MM/DD/
YYYY.
Page 264
Data Ingest Guide - Platfora Expressions
JOIN_STRINGS("/",month,day,year)
The following expression returns NULL:
JOIN_STRINGS("+",NULL,NULL,NULL)
The following expression returns a+b:
JOIN_STRINGS("+","a","b",NULL)
JSON_ARRAY_CONTAINS
JSON_ARRAY_CONTAINS is a row function that performs a whole string match against a string
formatted as a JSON array and returns a 1 or 0 depending on whether or not the string contains the
search value.
JSON_ARRAY_CONTAINS(json_array_string,"search_string")
Returns one value per row of type INTEGER. A return value of 1 indicates a positive match, and a return
value of 0 indicates no match.
json_array_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
JSON array. A JSON array is an ordered sequence of values separated by commas and enclosed in
square brackets.
search_string
Required. The literal string that you want to search for. This can be a name of a field or expression of
type STRING.
If you have a software field that contains a JSON array formatted like this:
["hadoop","platfora"]
The following expression returns 1:
JSON_ARRAY_CONTAINS(software,"platfora")
JSON_DOUBLE
JSON_DOUBLE is a row function that extracts a DOUBLE value from a field in a JSON object.
JSON_DOUBLE(json_string,"json_field")
Returns one value per row of type DOUBLE.
json_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
JSON object.
json_field
Required. The key or name of the field value you want to extract.
Page 265
Data Ingest Guide - Platfora Expressions
For top-level fields, specify the name identifier (key) of the field.
To access fields within a nested object, specify a dot-separated path of field names (for example
top_level_field_name.nested_field_name).
To extract a value from an array, specify the dot-separated path of field names and the array position
starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).
If the name identifier contains dot or period characters within the name itself, escape the name by
enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]
If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].
If you had a top_scores field that contained a JSON object formatted like this (with the values contained
in an array):
{"practice_scores":["538.67","674.99","1021.52"], "test_scores":
["753.21","957.88","1032.87"]}
You could extract the third value of the test_scores array using the expression:
JSON_DOUBLE(top_scores,"test_scores.2")
JSON_FIXED
JSON_FIXED is a row function that extracts a FIXED value from a field in a JSON object.
JSON_FIXED(json_string,"json_field")
Returns one value per row of type FIXED.
json_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
JSON object.
json_field
Required. The key or name of the field value you want to extract.
For top-level fields, specify the name identifier (key) of the field.
To access fields within a nested object, specify a dot-separated path of field names (for example
top_level_field_name.nested_field_name).
To extract a value from an array, specify the dot-separated path of field names and the array position
starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).
If the name identifier contains dot or period characters within the name itself, escape the name by
enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]
If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].
Page 266
Data Ingest Guide - Platfora Expressions
If you had a top_scores field that contained a JSON object formatted like this (with the values contained
in an array):
{"practice_scores":["538.67","674.99","1021.52"], "test_scores":
["753.21","957.88","1032.87"]}
You could extract the third value of the test_scores array using the expression:
JSON_FIXED(top_scores,"test_scores.2")
JSON_INTEGER
JSON_INTEGER is a row function that extracts an INTEGER value from a field in a JSON object.
JSON_INTEGER(json_string,"json_field")
Returns one value per row of type INTEGER.
json_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
JSON object.
json_field
Required. The key or name of the field value you want to extract.
For top-level fields, specify the name identifier (key) of the field.
To access fields within a nested object, specify a dot-separated path of field names (for example
top_level_field_name.nested_field_name).
To extract a value from an array, specify the dot-separated path of field names and the array position
starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).
If the name identifier contains dot or period characters within the name itself, escape the name by
enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]
If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].
If you had an address field that contained a JSON object formatted like this:
{"street_address":"123 B Street", "city":"San Mateo", "state":"CA",
"zip_code":"94403"}
You could extract the zip_code value using the expression:
JSON_INTEGER(address,"zip_code")
If you had a top_scores field that contained a JSON object formatted like this (with the values contained
in an array):
{"practice_scores":["538","674","1021"], "test_scores":
["753","957","1032"]}
Page 267
Data Ingest Guide - Platfora Expressions
You could extract the third value of the test_scores array using the expression:
JSON_INTEGER(top_scores,"test_scores.2")
JSON_LONG
JSON_LONG is a row function that extracts a LONG value from a field in a JSON object.
JSON_LONG(json_string,"json_field")
Returns one value per row of type LONG.
json_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
JSON object.
json_field
Required. The key or name of the field value you want to extract.
For top-level fields, specify the name identifier (key) of the field.
To access fields within a nested object, specify a dot-separated path of field names (for example
top_level_field_name.nested_field_name).
To extract a value from an array, specify the dot-separated path of field names and the array position
starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).
If the name identifier contains dot or period characters within the name itself, escape the name by
enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]
If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].
If you had a top_scores field that contained a JSON object formatted like this (with the values contained
in an array):
{"practice_scores":["538","674","1021"], "test_scores":
["753","957","1032"]}
You could extract the third value of the test_scores array using the expression:
JSON_LONG(top_scores,"test_scores.2")
JSON_STRING
JSON_STRING is a row function that extracts a STRING value from a field in a JSON object.
JSON_STRING(json_string,"json_field")
Returns one value per row of type STRING.
json_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
JSON object.
Page 268
Data Ingest Guide - Platfora Expressions
json_field
Required. The key or name of the field value you want to extract.
For top-level fields, specify the name identifier (key) of the field.
To access fields within a nested object, specify a dot-separated path of field names (for example
top_level_field_name.nested_field_name).
To extract a value from an array, specify the dot-separated path of field names and the array position
starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).
If the name identifier contains dot or period characters within the name itself, escape the name by
enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]
If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].
If you had an address field that contained a JSON object formatted like this:
{"street_address":"123 B Street", "city":"San Mateo", "state":"CA",
"zip":"94403"}
You could extract the state value using the expression:
JSON_STRING(address,"state")
If you had a misc field that contained a JSON object formatted like this (with the values contained in an
array):
{"hobbies":["sailing","hiking","cooking"], "interests":
["art","music","travel"]}
You could extract the first value of the hobbies array using the expression:
JSON_STRING(misc,"hobbies.0")
LENGTH
LENGTH is a row function that returns the count of characters in a string value.
LENGTH(string)
Returns one value per row of type INTEGER.
string
Required. The name of a field or expression of type STRING (or a literal string).
Return count of characters from values in the name field. For example, the value Bob would return a
length of 3, Julie would return a length of 5, and so on:
LENGTH(name)
Page 269
Data Ingest Guide - Platfora Expressions
REGEX
REGEX is a row function that performs a whole string match against a string value with a regular
expression and returns the portion of the string matching the first capturing group of the regular
expression.
REGEX(string_expression,"regex_matching_pattern")
Returns the matched STRING value of the first capturing group of the regular expression. If there is no
match, returns NULL.
string_expression
Required. The name of a field or expression of type STRING (or a literal string).
regex_matching_pattern
Required. A regular expression pattern based on the regular expression pattern matching syntax of the
Java programming language. To return a non-NULL value, the regular expression pattern must match
the entire string value.
This section lists a summary of the most commonly used constructs for defining a regular expression
matching pattern. See the Regular Expression Reference for more information about regular expression
support in Platfora.
Literal and Special Characters
The most basic form of pattern matching is the match of literal characters. For example, if the regular
expression is foo and the input string is foo, the match will succeed because the strings are identical.
Certain characters are reserved for special use in regular expressions. These special characters are often
called metacharacters. If you want to use special characters as literal characters, they must be escaped.
You can escape a single character using a \ (backslash), or escape a character sequence by enclosing it
in \Q ... \E.
To escape literal double-quotes, double the double-quotes ("").
Character Name
Character
Reserved For
opening bracket
[
start of a character class
closing bracket
]
end of a character class
hyphen
-
character ranges within a character class
backslash
\
general escape character
caret
^
beginning of string, negating of a character class
dollar sign
$
end of string
period
.
matching any single character
Page 270
Data Ingest Guide - Platfora Expressions
Character Name
Character
Reserved For
pipe
|
alternation (OR) operator
question mark
?
optional quantifier, quantifier minimizer
asterisk
*
zero or more quantifier
plus sign
+
once or more quantifier
opening parenthesis
(
start of a subexpression group
closing parenthesis
)
end of a subexpression group
opening brace
{
start of min/max quantifier
closing brace
}
end of min/max quantifier
Character Class Constructs
A character class allows you to specify a set of characters, enclosed in square brackets, that can produce
a single character match. There are also a number of special predefined character classes (backslash
character sequences that are shorthand for the most common character sets).
Construct
Type
Description
[abc]
simple
matches
a
or
b
or
c
[^abc]
negation
matches any character except
a
or
b
or
c
Page 271
Data Ingest Guide - Platfora Expressions
Construct
Type
Description
[a-zA-Z]
range
matches
a
through
z
, or
A
through
Z
(inclusive)
[a-d[m-p]]
union
matches
a
through
d
, or
m
through
p
[a-z&&[def]]
intersection matches
d
,
e
, or
f
[a-z&&[^xq]]
subtraction matches
a
through
z
, except for
x
and
q
Predefined Character Classes
Page 272
Data Ingest Guide - Platfora Expressions
Predefined character classes offer convenient shorthands for commonly used regular expressions.
Construct
Description
Example
.
matches any single character (except newline)
.at
matches "cat", "hat", and also"bat" in the
phrase "batch files"
\d
\D
matches any digit character (equivalent to
\d
[0-9]
)
matches "3" in "C3PO" and "2" in
"file_2.txt"
matches any non-digit character (equivalent to
\D
[^0-9]
matches "S" in "900S" and "Q" in "Q45"
)
\s
matches any single white-space character
(equivalent to
[ \t\n\x0B\f\r]
\sbook
matches "book" in "blue book" but
nothing in "notebook"
)
\S
matches any single non-white-space character
\Sbook
matches "book" in "notebook" but
nothing in "blue book"
\w
matches any alphanumeric character, including r\w*
underscore (equivalent to
matches "rm" and "root"
[A-Za-z0-9_]
)
\W
matches any non-alphanumeric character
(equivalent to
[^A-Za-z0-9_]
\W
matches "&" in "stmd &" , "%" in
"100%", and "$" in "$HOME"
)
Line and Word Boundaries
Boundary matching constructs are used to specify where in a string to apply a matching pattern. For
example, you can search for a particular pattern within a word boundary, or search for a pattern at the
beginning or end of a line.
Construct
Description
Example
^
matches from the beginning of a line (multiline matches are currently not supported)
^172
Page 273
will match the "172" in IP address
"172.18.1.11" but not in "192.172.2.33"
Data Ingest Guide - Platfora Expressions
Construct
Description
Example
$
matches from the end of a line (multi-line
matches are currently not supported)
d$
matches within a word boundary
\bis\b
\b
will match the "d" in "maid" but not in
"made"
matches the word "is" in "this is my
island", but not the "is" part of "this" or
"island".
\bis
matches both "is" and the "is" in "island",
but not in "this".
\B
\Bb
matches within a non-word boundary
matches "b" in "sbin" but not in "bash"
Quantifiers
Quantifiers specify how often the preceding regular expression construct should match. There are three
classes of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and
possessive quantifiers involves what part of the string to try for the initial match, and how to retry if the
initial attempt does not produce a match.
Greedy ReluctantPossessiveDescription
ConstructConstructConstruct
Example
?
matches the previous
character or construct once
or not at all
st?on
matches the previous
character or construct zero
or more times
if*
matches the previous
character or construct one
or more times
if+
matches the previous
character or construct
exactly
o{2}
*
+
{n}
??
*?
+?
{n}?
?+
*+
++
{n}+
n
times
Page 274
matches "son" in "johnson" and "ston"
in "johnston" but nothing in "clinton" or
"version"
matches "if", "iff" in "diff", or "i" in "print"
matches "if", "iff" in "diff", but nothing in
"print"
matches "oo" in "lookup" and the first two o's
in "fooooo" but nothing in "mount"
Data Ingest Guide - Platfora Expressions
Greedy ReluctantPossessiveDescription
ConstructConstructConstruct
Example
{n,}
o{2,}
{n,}?
{n,}+
matches the previous
character or construct at
least
matches "oo" in "lookup" all five o's in
"fooooo" but nothing in "mount"
n
times
{n,m} {n,m}? {n,m}+ matches the previous
character or construct at
least
F{2,4}
matches "FF" in "#FF0000" and the last four
F's in "#FFFFFF"
n
times, but no more than
m
times
Groups are specified by a pair of parenthesis around a subpattern in the regular expression. A pattern can
have more than one group and the groups can be nested. The groups are numbered 1-n from left to right,
starting with the first opening parenthesis. There is always an implicit group 0, which contains the entire
match. For example, the pattern:
(a(b*))+(c)
contains three groups:
group 1: (a(b*))
group 2: (b*)
group 3: (c)
Capturing Groups
By default, a group captures the text that produces a match, and only the most recent match is captured.
The REGEX function returns the string that matches the first capturing group in the regular expression.
For example, if the input string to the expression above was abc, the entire REGEX function would
match to abc, but only return the result of group 1, which is ab.
Non-Capturing Groups
In some cases, you may want to use parenthesis to group subpatterns, but not capture text. A noncapturing group starts with (?: (a question mark and colon following the opening parenthesis). For
example, h(?:a|i|o)t matches hat or hit or hot, but does not capture the a, i, or o from the
subexpression.
Match all possible email address strings with a pattern of [email protected], but only return
the provider portion of the email address from the email field:
REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$")
Match the request line of a web log, where the value is in the format of:
Page 275
Data Ingest Guide - Platfora Expressions
GET /some_page.html HTTP/1.1
and return just the requested HTML page names:
REGEX(weblog.request_line,"GET\s/([a-zA-Z0-9._%-]+\.[html])\sHTTP/[0-9.]+")
Extract the inches portion from a height field where example values are 6'2", 5'11" (notice the
escaping of the literal quote with a double double-quote):
REGEX(height, "\d\'(\d)+""")
Extract all of the contents of the device field when the value is either iPod, iPad, or iPhone:
REGEX(device,"(iP[ao]d|iPhone)")
REGEX_REPLACE
REGEX_REPLACE is a row function that evaluates a string value against a regular expression to
determine if there is a match, and replaces matched strings with the specified replacement value.
REGEX_REPLACE(string_expression,"regex_match_pattern","regex_replace_pattern")
Returns the regex_replace_pattern as a STRING value when regex_match_pattern produces a match. If
there is no match, returns the value of string_expression as a STRING.
string_expression
Required. The name of a field or expression of type STRING (or a literal string).
regex_match_pattern
Required. A string literal or regular expression pattern based on the regular expression pattern matching
syntax of the Java programming language. You can use capturing groups to create backreferences that
can be used in the regex_replace_pattern. You might want to use a string literal to make a case-sensitive
match. For example, when you enter jane as the match value, the function matches jane but not Jane.
The function matches all occurrences of a string literal in the string expression.
regex_replace_pattern
Required. A string literal or regular expression pattern based on the regular expression pattern
matching syntax of the Java programming language. You can refer to backreferences from the
regex_match_pattern using the syntax $n (where n is the group number).
This section lists a summary of the most commonly used constructs for defining a regular expression
matching pattern. See the Regular Expression Reference for more information.
Literal and Special Characters
The most basic form of pattern matching is the match of literal characters. For example, if the regular
expression is foo and the input string is foo, the match will succeed because the strings are identical.
Certain characters are reserved for special use in regular expressions. These special characters are often
called metacharacters. If you want to use special characters as literal characters, they must be escaped.
Page 276
Data Ingest Guide - Platfora Expressions
You can escape a single character using a \ (backslash), or escape a character sequence by enclosing it
in \Q ... \E.
Character Name
Character
Reserved For
opening bracket
[
start of a character class
closing bracket
]
end of a character class
hyphen
-
character ranges within a character class
backslash
\
general escape character
caret
^
beginning of string, negating of a character class
dollar sign
$
end of string
period
.
matching any single character
pipe
|
alternation (OR) operator
question mark
?
optional quantifier, quantifier minimizer
asterisk
*
zero or more quantifier
plus sign
+
once or more quantifier
opening parenthesis
(
start of a subexpression group
closing parenthesis
)
end of a subexpression group
opening brace
{
start of min/max quantifier
closing brace
}
end of min/max quantifier
Character Class Constructs
Page 277
Data Ingest Guide - Platfora Expressions
A character class allows you to specify a set of characters, enclosed in square brackets, that can produce
a single character match. There are also a number of special predefined character classes (backslash
character sequences that are shorthand for the most common character sets).
Construct
Type
Description
[abc]
simple
matches
a
or
b
or
c
[^abc]
negation
matches any character except
a
or
b
or
c
[a-zA-Z]
range
matches
a
through
z
, or
A
through
Z
(inclusive)
[a-d[m-p]]
union
matches
a
through
d
, or
m
through
p
[a-z&&[def]]
intersection matches
d
,
e
, or
f
Page 278
Data Ingest Guide - Platfora Expressions
Construct
Type
Description
[a-z&&[^xq]]
subtraction matches
a
through
z
, except for
x
and
q
Predefined Character Classes
Predefined character classes offer convenient shorthands for commonly used regular expressions.
Construct
Description
Example
.
matches any single character (except newline)
.at
matches "cat", "hat", and also"bat" in the
phrase "batch files"
\d
\D
matches any digit character (equivalent to
\d
[0-9]
)
matches "3" in "C3PO" and "2" in
"file_2.txt"
matches any non-digit character (equivalent to
\D
[^0-9]
matches "S" in "900S" and "Q" in "Q45"
)
\s
matches any single white-space character
(equivalent to
[ \t\n\x0B\f\r]
\sbook
matches "book" in "blue book" but
nothing in "notebook"
)
\S
matches any single non-white-space character
\Sbook
matches "book" in "notebook" but
nothing in "blue book"
\w
matches any alphanumeric character, including r\w*
underscore (equivalent to
matches "rm" and "root"
[A-Za-z0-9_]
)
Page 279
Data Ingest Guide - Platfora Expressions
Construct
Description
Example
\W
matches any non-alphanumeric character
(equivalent to
\W
[^A-Za-z0-9_]
matches "&" in "stmd &" , "%" in
"100%", and "$" in "$HOME"
)
Line and Word Boundaries
Boundary matching constructs are used to specify where in a string to apply a matching pattern. For
example, you can search for a particular pattern within a word boundary, or search for a pattern at the
beginning or end of a line.
Construct
Description
Example
^
matches from the beginning of a line (multiline matches are currently not supported)
^172
matches from the end of a line (multi-line
matches are currently not supported)
d$
matches within a word boundary
\bis\b
$
\b
will match the "172" in IP address
"172.18.1.11" but not in "192.172.2.33"
will match the "d" in "maid" but not in
"made"
matches the word "is" in "this is my
island", but not the "is" part of "this" or
"island".
\bis
matches both "is" and the "is" in "island",
but not in "this".
\B
matches within a non-word boundary
\Bb
matches "b" in "sbin" but not in "bash"
Quantifiers
Quantifiers specify how often the preceding regular expression construct should match. There are three
classes of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and
Page 280
Data Ingest Guide - Platfora Expressions
possessive quantifiers involves what part of the string to try for the initial match, and how to retry if the
initial attempt does not produce a match.
Greedy ReluctantPossessiveDescription
ConstructConstructConstruct
Example
?
matches the previous
character or construct once
or not at all
st?on
matches the previous
character or construct zero
or more times
if*
matches the previous
character or construct one
or more times
if+
matches the previous
character or construct
exactly
o{2}
*
+
{n}
??
*?
+?
{n}?
?+
*+
++
{n}+
matches "son" in "johnson" and "ston"
in "johnston" but nothing in "clinton" or
"version"
matches "if", "iff" in "diff", or "i" in "print"
matches "if", "iff" in "diff", but nothing in
"print"
matches "oo" in "lookup" and the first two o's
in "fooooo" but nothing in "mount"
n
times
{n,}
{n,}?
{n,}+
matches the previous
character or construct at
least
o{2,}
matches "oo" in "lookup" all five o's in
"fooooo" but nothing in "mount"
n
times
{n,m} {n,m}? {n,m}+ matches the previous
character or construct at
least
F{2,4}
matches "FF" in "#FF0000" and the last four
F's in "#FFFFFF"
n
times, but no more than
m
times
Match the values in a phone_number field where phone number values are formatted as
xxx.xxx.xxxx and replace them with phone number values formatted as (xxx) xxx-xxxx:
REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]]
{4})","\($1\) $2-$3")
Match the values in a name field where name values are formatted as firstname lastname and
replace them with name values formatted as lastname, firstname:
Page 281
Data Ingest Guide - Platfora Expressions
REGEX_REPLACE(name,"(.*) (.*)","$2, $1")
Match the string literal mrs in a title field and replace it with the string literal Mrs.
REGEX_REPLACE(title,"mrs","Mrs")
SPLIT
SPLIT is a row function that breaks down a delimited input string into sections and returns the specified
section of the string. A section is considered any sub-string between the specified delimiter.
SPLIT(input_string_expression,"delimiter_string",position_integer)
Returns one value per row of type STRING.
input_string_expression
Required. The name of a field or expression of type STRING (or a literal string).
delimiter_string
Required. A literal string representing the delimiter used to separate values in the input string. The
delimiter can be a single character or multiple characters.
position_integer
Required. An integer representing the position of the section in the input string that you want to extract.
Positive integers count the position from the beginning of the string, and negative integers count the
position from the end of the string. A value of 0 returns NULL.
Return the third section of the literal delimited string: Restaurants>Location>San Francisco:
SPLIT("Restaurants>Location>San Francisco",">", -1) returns San Francisco
Return the first section of a phone_number field where phone number values are in the format of
123-456-7890:
SPLIT(phone_number,"-",1)
SUBSTRING
SUBSTRING is a row function that returns the specified characters of a string value based on the given
start and end position.
SUBSTRING(string,start,end)
Returns one value per row of type STRING.
string
Required. The name of a field or expression of type STRING (or a literal string).
start
Page 282
Data Ingest Guide - Platfora Expressions
Required. An integer that specifies where the returned characters start (inclusive), with 0 being the first
character of the string. If start is greater than the number of characters, then an empty string is returned.
If start is greater than end, then an empty string is returned.
end
Required. A positive integer that specifies where the returned characters end (exclusive), with the end
character not being part of the return value. If end is greater than the number of characters, the whole
string value (from start) is returned.
Return the first letter of the name field:
SUBSTRING(name,0,1)
TO_LOWER
TO_LOWER is a row function that converts all alphabetic characters in a string to lower case.
TO_LOWER(string_expression)
Returns one value per row of type STRING.
string_expression
Required. The name of a field or expression of type STRING (or a literal string).
Return the literal input string 123 Main Street in all lower case letters::
TO_LOWER("123 Main Street") returns 123 main street
TO_UPPER
TO_UPPER is a row function that converts all alphabetic characters in a string to upper case.
TO_UPPER(string_expression)
Returns one value per row of type STRING.
string_expression
Required. The name of a field or expression of type STRING (or a literal string).
Return the literal input string 123 Main Street in all upper case letters:
TO_UPPER("123 Main Street") returns 123 MAIN STREET
TRIM
TRIM is a row function that removes leading and trailing spaces from a string value.
TRIM(string_expression)
Returns one value per row of type STRING.
string_expression
Page 283
Data Ingest Guide - Platfora Expressions
Required. The name of a field or expression of type STRING (or a literal string).
Return the value of the area_code field without any leading or trailing spaces. For example, if the input
string is " 650 ", then the return value would be "650":
TRIM(area_code)
Return the value of the phone_number field without any leading or trailing spaces. For example, if the
input string is " 650 123-4567 ", then the return value would be "650 123-4567" (note that the extra
spaces in the middle of the string are not removed, only the spaces at the beginning and end of the
string):
TRIM(phone_number)
XPATH_STRING
XPATH_STRING is a row function that takes an XML-formatted string and returns the first string
matching the given XPath expression.
XPATH_STRING(xml_formatted_string,"xpath_expression")
Returns one value per row of type STRING.
If the XPath expression matches more than one string in the given XML node, this function will return
the first match only. To return all matches, use XPATH_STRINGS instead.
xml_formatted_string
Required. The name of a field or a literal string that contains a valid XML node (a snippet of XML
consisting of a parent element and one or more child nodes).
xpath_expression
Required. An XPath expression that refers to a node, element, or attribute within the XML string passed
to this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0
specification is valid.
These example XPATH_STRING expressions assume you have a field in your dataset named address
that contains XML-formatted strings such as this:
<list>
<address type="work">
<street>1300 So. El Camino Real</street1>
<street>Suite 600</street2>
<city>San Mateo</city>
<state>CA</state>
<zipcode>94403</zipcode>
</address>
<address type="home">
<street>123 Oakdale Street</street1>
<street/>
<city>San Francisco</city>
<state>CA</state>
Page 284
Data Ingest Guide - Platfora Expressions
<zipcode>94123</zipcode>
</address>
</list>
Get the zipcode value from any address element where the type attribute equals home:
XPATH_STRING(address,"//address[@type='home']/zipcode")
returns: 94123
Get the city value from the second address element:
XPATH_STRING(address,"/list/address[2]/city")
returns: San Francisco
Get the values from all child elements of the first address element (as one string):
XPATH_STRING(address,"/list/address")
returns: 1300 So. El Camino RealSuite 600 San MateoCA94403
XPATH_STRINGS
XPATH_STRINGS is a row function that takes an XML-formatted string and returns a newline-separated
array of strings matching the given XPath expression.
XPATH_STRINGS(xml_formatted_string,"xpath_expression")
Returns one value per row of type STRING.
If the XPath expression matches more than one string in the given XML node, this function will return
all matches separated by a newline (you cannot specify a different delimiter).
xml_formatted_string
Required. The name of a field or a literal string that contains a valid XML node (a snippet of XML
consisting of a parent element and one or more child nodes).
xpath_expression
Required. An XPath expression that refers to a node, element, or attribute within the XML string passed
to this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0
specification is valid.
These example XPATH_STRINGS expressions assume you have a field in your dataset named address
that contains XML-formatted strings such as this:
<list>
<address type="work">
<street>1300 So. El Camino Real</street1>
<street>Suite 600</street2>
<city>San Mateo</city>
<state>CA</state>
<zipcode>94403</zipcode>
Page 285
Data Ingest Guide - Platfora Expressions
</address>
<address type="home">
<street>123 Oakdale Street</street1>
<street/>
<city>San Francisco</city>
<state>CA</state>
<zipcode>94123</zipcode>
</address>
</list>
Get all zipcode values from all address elements:
XPATH_STRINGS(address,"//address/zipcode")
returns:
94123
94403
Get all street values from the first address element:
XPATH_STRINGS(address,"/list/address[1]/street")
returns:
1300 So. El Camino Real
Suite 600
Get the values from all child elements of all address elements (as one string per line):
XPATH_STRINGS(address,"/list/address")
returns:
123 Oakdale StreetSan FranciscoCA94123
1300 So. El Camino RealSuite 600 San MateoCA94403
XPATH_XML
XPATH_XML is a row function that takes an XML-formatted string and returns an XML-formatted string
matching the given XPath expression.
XPATH_XML(xml_formatted_string,"xpath_expression")
Returns one value per row of type STRING in XML format.
xml_formatted_string
Required. The name of a field or a literal string that contains a valid XML node (a snippet of XML
consisting of a parent element and one or more child nodes).
xpath_expression
Required. An XPath expression that refers to a node, element, or attribute within the XML string passed
to this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0
specification is valid.
Page 286
Data Ingest Guide - Platfora Expressions
These example XPATH_STRING expressions assume you have a field in your dataset named address
that contains XML-formatted strings such as this:
<list>
<address type="work">
<street>1300 So. El Camino Real</street1>
<street>Suite 600</street2>
<city>San Mateo</city>
<state>CA</state>
<zipcode>94403</zipcode>
</address>
<address type="home">
<street>123 Oakdale Street</street1>
<street/>
<city>San Francisco</city>
<state>CA</state>
<zipcode>94123</zipcode>
</address>
</list>
Get the last address node and its child nodes in XML format:
XPATH_XML(address,"//address[last()]")
returns:
<address type="home">
<street>123 Oakdale Street</street1>
<street/>
<city>San Francisco</city>
<state>CA</state>
<zipcode>94123</zipcode>
</address>
Get the city value from the second address node in XML format:
XPATH_XML(address,"/list/address[2]/city")
returns: <city>San Francisco</city>
Get the first address node and its child nodes in XML format:
XPATH_XML(address,"/list/address[1]")
returns:
<address type="work">
<street>1300 So. El Camino Real</street1>
<street>Suite 600</street2>
<city>San Mateo</city>
<state>CA</state>
<zipcode>94403</zipcode>
</address>
Page 287
Data Ingest Guide - Platfora Expressions
URL Functions
URL functions allow you to extract different portions of a URL string, and decode text that is URLencoded.
URL_AUTHORITY
URL_AUTHORITY is a row function that returns the authority portion of a URL string. The authority
portion of a URL is the part that has the information on how to locate and connect to the server.
URL_AUTHORITY(string)
Returns the authority portion of a URL as a STRING value, or NULL if the input string is not a valid
URL.
For example, in the string http://www.platfora.com/company/contact.html, the authority
portion is www.platfora.com.
In the string http://user:[email protected]:8012/mypage.html, the authority
portion is user:[email protected]:8012.
In the string mailto:[email protected]?subject=Topic, the authority portion is
NULL.
string
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment].
The authority portion of the URL contains the host information, which can be specified as a domain
name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). The host
information can be preceeded by optional user information terminated with @ (for example,
username:[email protected]), and followed by an optional port number preceded by a colon
(for example, localhost:8001).
Return the authority portion of URL string values in the referrer field:
URL_AUTHORITY(referrer)
Return the authority portion of a literal URL string:
URL_AUTHORITY("http://user:[email protected]:8012/mypage.html")
returns user:[email protected]:8012
URL_FRAGMENT
URL_FRAGMENT is a row function that returns the fragment portion of a URL string.
URL_FRAGMENT(string)
Returns the fragment portion of a URL as a STRING value, NULL if the URL or does not contain a
fragment, or NULL if the input string is not a valid URL.
Page 288
Data Ingest Guide - Platfora Expressions
For example, in the string http://www.platfora.com/contact.html#phone, the fragment
portion is phone.
In the string http://www.platfora.com/contact.html, the fragment portion is NULL.
In the string http://platfora.com/news.php?topic=press#Platfora%20News, the
fragment portion is Platfora%20News.
string
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment].
The optional fragment portion of the URL is separated by a hash mark (#) and provides direction to a
secondary resource, such as a heading or anchor identifier.
Return the fragment portion of URL string values in the request field:
URL_FRAGMENT(request)
Return the fragment portion of a literal URL string:
URL_FRAGMENT("http://platfora.com/news.php?topic=press#Platfora%20News")
returns Platfora%20News
Return and decode the fragment portion of a literal URL string:
URLDECODE(URL_FRAGMENT("http://platfora.com/news.php?
topic=press#Platfora%20News")) returns Platfora News
URL_HOST
URL_HOST is a row function that returns the host, domain, or IP address portion of a URL string.
URL_HOST(string)
Returns the host portion of a URL as a STRING value, or NULL if the input string is not a valid URL.
For example, in the string http://www.platfora.com/company/contact.html, the host
portion is www.platfora.com.
In the string http://admin:[email protected]:8001/index.html, the host portion is
127.0.0.1.
In the string mailto:[email protected]?subject=Topic, the host portion is NULL.
string
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment].
The authority portion of the URL contains the host information, which can be specified as a domain
name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1).
Page 289
Data Ingest Guide - Platfora Expressions
Return the host portion of URL string values in the referrer field:
URL_HOST(referrer)
Return the host portion of a literal URL string:
URL_HOST("http://user:[email protected]:8012/mypage.html") returns
mycompany.com
URL_PATH
URL_PATH is a row function that returns the path portion of a URL string.
URL_PATH(string)
Returns the path portion of a URL as a STRING value, NULL if the URL or does not contain a path, or
NULL if the input string is not a valid URL.
For example, in the string http://www.platfora.com/company/contact.html, the path
portion is /company/contact.html.
In the string http://admin:[email protected]:8001/index.html, the path portion is /
index.html.
In the string mailto:[email protected]?subject=Topic, the path portion is
[email protected].
string
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment].
The optional path portion of the URL is a sequence of resource location segments separated by a
forward slash (/), conceptually similar to a directory path.
Return the path portion of URL string values in the request field:
URL_PATH(request)
Return the path portion of a literal URL string:
URL_PATH("http://platfora.com/company/contact.html") returns /company/
contact.html
URL_PORT
URL_PORT is a row function that returns the port portion of a URL string.
URL_PORT(string)
Returns the port portion of a URL as an INTEGER value. If the URL does not specify a port, then returns
-1. If the input string is not a valid URL, returns NULL.
Page 290
Data Ingest Guide - Platfora Expressions
For example, in the string http://localhost:8001, the port portion is 8001.
string
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment].
The authority portion of the URL contains the host information, which can be specified as a
domain name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). The
host information can be followed by an optional port number preceded by a colon (for example,
localhost:8001).
Return the port portion of URL string values in the referrer field:
URL_PORT(referrer)
Return the port portion of a literal URL string:
URL_PORT("http://user:[email protected]:8012/mypage.html") returns
8012
URL_PROTOCOL
URL_PROTOCOL is a row function that returns the protocol (or URI scheme name) portion of a URL
string.
URL_PROTOCOL(string)
Returns the protocol portion of a URL as a STRING value, or NULL if the input string is not a valid
URL.
For example, in the string http://www.platfora.com, the protocol portion is http.
In the string ftp://ftp.platfora.com/articles/platfora.pdf, the protocol portion is
ftp.
string
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment]
The protocol portion of a URL consists of a sequence of characters beginning with a letter and followed
by any combination of letter, number, plus (+), period (.), or hyphen (-) characters, followed by a colon
(:). For example: http:, ftp:, mailto:
Return the protocol portion of URL string values in the referrer field:
URL_PROTOCOL(referrer)
Return the protocol portion of the literal URL string:
URL_PROTOCOL("http://www.platfora.com") returns http
Page 291
Data Ingest Guide - Platfora Expressions
URL_QUERY
URL_QUERY is a row function that returns the query portion of a URL string.
URL_QUERY(string)
Returns the query portion of a URL as a STRING value, NULL if the URL or does not contain a query, or
NULL if the input string is not a valid URL.
For example, in the string http://www.platfora.com/contact.html, the query portion is
NULL.
In the string http://platfora.com/news.php?
topic=press&timeframe=today#Platfora%20News, the query portion is
topic=press&timeframe=today.
In the string mailto:[email protected]?subject=Topic, the query portion is
subject=Topic.
string
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment].
The optional query portion of the URL is separated by a question mark (?) and typically contains an
unordered list of key=value pairs separated by an ampersand (&) or semicolon (;).
Return the query portion of URL string values in the request field:
URL_QUERY(request)
Return the query portion of a literal URL string:
URL_QUERY("http://platfora.com/news.php?topic=press&timeframe=today")
returns topic=press&timeframe=today
URLDECODE
URLDECODE is a row function that decodes a string that has been encoded with the application/
x-www-form-urlencoded media type. URL encoding, also known as percent-encoding, is a
mechanism for encoding information in a Uniform Resource Identifier (URI). When sent in an HTTP
GET request, application/x-www-form-urlencoded data is included in the query component
of the request URI. When sent in an HTTP POST request, the data is placed in the body of the message,
and the name of the media type is included in the message Content-Type header.
URLDECODE(string)
Returns a value of type STRING with characters decoded as follows:
• Alphanumeric characters (a-z, A-Z, 0-9) remain unchanged.
• The special characters hyphen (-), comma (,), underscore (_), period (.), and asterisk (*) remain
unchanged.
Page 292
Data Ingest Guide - Platfora Expressions
• The plus sign (+) character is converted to a space character.
• The percent character (%) is interpreted as the start of a special escaped sequence, where in the
sequence %HH, HH represents the hexadecimal value of the byte. For example, some common escape
sequences are:
percent encoding sequence
value
%20
space
%0A or %0D or %0D%0A
newline
%22
double quote (")
%25
percent (%)
%2D
hyphen (-)
%2E
period (.)
%3C
less than (<)
%3D
greater than (>)
%5C
backslash (\)
%7C
pipe (|)
string
Required. A field or expression that returns a STRING value. It is assumed that all characters in the
input string are one of the following: lower-case letters (a-z), upper-case letters (A-Z), numeric digits
(0-9), or the hyphen (-), comma (,), underscore (_), period (.) or asterisk (*) character. The percent
character (%) is allowed, but is interpreted as the start of a special escaped sequence. The plus character
(+) is allowed, but is interpreted as a space character.
Decode the values of the url_query field:
URLDECODE(url_query)
Convert a literal URL encoded string (N%2FA%20or%20%22not%20applicable%22) to a humanreadable value (N/A or "not applicable"):
URLDECODE("N%2FA%20or%20%22not%20applicable%22") returns N/A or "not
applicable"
IP Address Functions
IP address functions allow you to manipulate and transform STRING data consisting of IP address
values.
Page 293
Data Ingest Guide - Platfora Expressions
CIDR_MATCH
CIDR_MATCH is a row function that compares two STRING arguments representing a CIDR mask and
an IP address, and returns 1 if the IP address falls within the specified subnet mask or 0 if it does not.
CIDR_MATCH(CIDR_string, IP_string)
Returns an INTEGER value of 1 if the IP address falls within the subnet indicated by the CIDR mask
and 0 if it does not.
CIDR_string
Required. A field or expression that returns a STRING value containing either an IPv4 or IPv6 CIDR
mask (Classless InterDomain Routing subnet notation). An IPv4 CIDR mask can only successfully
match IPv4 addresses, and an IPv6 CIDR mask can only successfully match IPv6 addresses.
IP_string
Required. A field or expression that returns a STRING value containing either an IPv4 or IPv6 internet
protocol (IP) address.
Compare an IPv4 CIDR subnet mask to an IPv4 IP address:
CIDR_MATCH("60.145.56.0/24","60.145.56.246") returns 1
CIDR_MATCH("60.145.56.0/30","60.145.56.246") returns 0
Compare an IPv6 CIDR subnet mask to an IPv6 IP address:
CIDR_MATCH("fe80::/70","FE80::0202:B3FF:FE1E:8329") returns 1
CIDR_MATCH("fe80::/72","FE80::0202:B3FF:FE1E:8329") returns 0
HEX_TO_IP
HEX_TO_IP is a row function that converts a hexadecimal-encoded STRING to a text representation of
an IP address.
HEX_TO_IP(string)
Returns a value of type STRING representing either an IPv4 or IPv6 address. The type of IP address
returned depends on the input string. An 8 character hexadecimal string will return an IPv4 address. A
32 character long hexadecimal string will return an IPv6 address. IPv6 addresses are represented in full
length,
without removing any leading zeros and without using the compressed :: notation.
For example, 2001:0db8:0000:0000:0000:ff00:0042:8329 rather than
2001:db8::ff00:42:8329. Input strings that do not contain either 8 or 32 valid hexadecimal
characters will return NULL.
string
Page 294
Data Ingest Guide - Platfora Expressions
Required. A field or expression that returns a hexadecimal-encoded STRING value. The hexadecimal
string must be either 8 characters long (in which case it is converted to an IPv4 address) or 32 characters
long (in which case it is converted to an IPv6 address).
Return a plain text IP address for each hexadecimal-encoded string value in the byte_encoded_ips
column:
HEX_TO_IP(byte_encoded_ips)
Convert an 8 character hexadecimal-encoded string to a plain text IPv4 address:
HEX_TO_IP(AB20FE01) returns 171.32.254.1
Convert a 32 character hexadecimal-encoded string to a plain text IPv6 address:
HEX_TO_IP(FE800000000000000202B3FFFE1E8329) returns
fe80:0000:0000:0000:0202:b3ff:fe1e:8329
Date and Time Functions
Date and time functions allow you to manipulate and transform datetime values, such as calculating time
differences between two datetime values, or extracting a portion of a datetime value.
DAYS_BETWEEN
DAYS_BETWEEN is a row function that calculates the whole number of days (ignoring time) between
two DATETIME values (value1-value2).
DAYS_BETWEEN(datetime_1,datetime_2)
Returns one value per row of type INTEGER.
datetime_1
Required. A field or expression of type DATETIME.
datetime_2
Required. A field or expression of type DATETIME.
Calculate the number of days to ship a product by subtracting the value of the order_date field from the
ship_date field:
DAYS_BETWEEN(ship_date,order_date)
Calculate the number of days since a product's release by subtracting the value of the release_date field
in the product dataset from the current date (the result of the expression):
DAYS_BETWEEN(NOW(),product.release_date)
DATE_ADD
DATE_ADD is a row function that adds the specified time interval to a DATETIME value.
Page 295
Data Ingest Guide - Platfora Expressions
DATE_ADD(datetime,quantity,"interval")
Returns a value of type DATETIME.
datetime
Required. A field name or expression that returns a DATETIME value.
quantity
Required. An integer value. To add time intervals, use a positive integer. To subtract time intervals, use
a negative integer.
interval
Required. One of the following time intervals:
• millisecond - Adds the specified number of milliseconds to a datetime value.
• second - Adds the specified number of seconds to a datetime value.
• minute - Adds the specified number of minutes to a datetime value.
• hour - Adds the specified number of hours to a datetime value.
• day - Adds the specified number of days to a datetime value.
• week - Adds the specified number of weeks to a datetime value.
• month - Adds the specified number of months to a datetime value.
• quarter - Adds the specified number of quarters to a datetime value.
• year - Adds the specified number of years to a datetime value.
• weekyear - Adds the specified number of weekyears to a datetime value.
Add 45 days to the value of the invoice_date field to calculate the date a payment is due:
DATE_ADD(invoice_date,45,"day")
HOURS_BETWEEN
HOURS_BETWEEN is a row function that calculates the whole number of hours (ignoring minutes,
seconds, and milliseconds) between two DATETIME values (value1-value2).
HOURS_BETWEEN(datetime_1,datetime_2)
Returns one value per row of type INTEGER.
datetime_1
Required. A field or expression of type DATETIME.
datetime_2
Required. A field or expression of type DATETIME.
Calculate the number of hours to ship a product by subtracting the value of the ship_date field from the
order_date field:
HOURS_BETWEEN(ship_date,order_date)
Page 296
Data Ingest Guide - Platfora Expressions
Calculate the number of hours since an advertisement was viewed by subtracting the value of the
adview_timestamp field in the impressions dataset from the current date and time (the result of the
expression):
HOURS_BETWEEN(NOW(),impressions.adview_timestamp)
EXTRACT
EXTRACT is a row function that returns the specified portion of a DATETIME value.
EXTRACT("extract_value",datetime)
Returns the specified extracted value as type INTEGER. EXTRACT removes leading zeros. For example,
the month of April returns a value of 4, not 04.
extract_value
Required. One of the following extract values:
• millisecond - Returns the millisecond portion of a datetime value. For example, an input datetime
value of 2012-08-15 20:38:40.213 would return an integer value of 213.
• second - Returns the second portion of a datetime value. For example, an input datetime value of
2012-08-15 20:38:40.213 would return an integer value of 40.
• minute - Returns the minute portion of a datetime value. For example, an input datetime value of
2012-08-15 20:38:40.213 would return an integer value of 38.
• hour - Returns the hour portion of a datetime value. For example, an input datetime value of
2012-08-15 20:38:40.213 would return an integer value of 20.
• day - Returns the day portion of a datetime value. For example, an input datetime value of
2012-08-15 would return an integer value of 15.
• week - Returns the ISO week number for the input datetime value. For example, an input datetime
value of 2012-01-02 would return an integer value of 1 (the first ISO week of 2012 starts on
Monday January 2). An input datetime value of 2012-01-01 would return an integer value of 52
(January 1, 2012 is part of the last ISO week of 2011).
• month - Returns the month portion of a datetime value. For example, an input datetime value of
2012-08-15 would return an integer value of 8.
• quarter - Returns the quarter number for the input datetime value, where quarters start on January 1,
April 1, July 1, or October 1. For example, an input datetime value of 2012-08-15 would return a
integer value of 3.
• year - Returns the year portion of a datetime value. For example, an input datetime value of
2012-01-01 would return an integer value of 2012.
• weekyear - Returns the year value that corresponds the the ISO week number of the input datetime
value. For example, an input datetime value of 2012-01-02 would return an integer value of 2012
(the first ISO week of 2012 starts on Monday January 2). An input datetime value of 2012-01-01
would return an integer value of 2011 (January 1, 2012 is part of the last ISO week of 2011).
datetime
Required. A field name or expression that returns a DATETIME value.
Page 297
Data Ingest Guide - Platfora Expressions
Extract the hour portion from the order_date datetime field:
EXTRACT("hour",order_date)
Cast the value of the order_date string field to a datetime value using TO_DATE, and extract the ISO
week year:
EXTRACT("weekyear",TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss"))
MILLISECONDS_BETWEEN
MILLISECONDS_BETWEEN is a row function that calculates the whole number of milliseconds between
two DATETIME values (value1-value2).
MILLISECONDS_BETWEEN(datetime_1,datetime_2)
Returns one value per row of type INTEGER.
datetime_1
Required. A field or expression of type DATETIME.
datetime_2
Required. A field or expression of type DATETIME.
Calculate the number of milliseconds it took to serve a web page by subtracting the value of the
request_timestamp field from the response_timestamp field:
MILLISECONDS_BETWEEN(request_timestamp,response_timestamp)
MINUTES_BETWEEN
MINUTES_BETWEEN is a row function that calculates the whole number of minutes (ignoring seconds
and milliseconds) between two DATETIME values (value1-value2).
MINUTES_BETWEEN(datetime_1,datetime_2)
Returns one value per row of type INTEGER.
datetime_1
Required. A field or expression of type DATETIME.
datetime_2
Required. A field or expression of type DATETIME.
Calculate the number of minutes it took for a user to click on an advertisement by subtracting the value
of the impression_timestamp field from the conversion_timestamp field:
MINUTES_BETWEEN(impression_timestamp,conversion_timestamp)
Calculate the number of minutes since a user last logged in by subtracting the login_timestamp field in
the weblogs dataset from the current date and time (the result of the expression):
Page 298
Data Ingest Guide - Platfora Expressions
MINUTES_BETWEEN(NOW(),weblogs.login_timestamp)
NOW
NOW is a scalar function that returns the current system date and time as a DATETIME value. It can be
used in other expressions involving DATETIME type fields, such as , , or . Note that the value of NOW is
only evaluated at the time a lens is built (it is not re-evaluated with each query).
NOW()
Returns the current system date and time as a DATETIME value.
Calculate a user's age using to subtract the value of the birthdate field in the users dataset from the
current date:
YEAR_DIFF(NOW(),users.birthdate)
Calculate the number of days since a product's release using to subtract the value of the release_date
field from the current date:
DAYS_BETWEEN(NOW(),release_date)
SECONDS_BETWEEN
SECONDS_BETWEEN is a row function that calculates the whole number of seconds (ignoring
milliseconds) between two DATETIME values (value1-value2).
SECONDS_BETWEEN(datetime_1,datetime_2)
Returns one value per row of type INTEGER.
datetime_1
Required. A field or expression of type DATETIME.
datetime_2
Required. A field or expression of type DATETIME.
Calculate the number of seconds it took for a user to click on an advertisement by subtracting the value
of the impression_timestamp field from the conversion_timestamp field:
SECONDS_BETWEEN(impression_timestamp,conversion_timestamp)
Calculate the number of seconds since a user last logged in by subtracting the login_timestamp field in
the weblogs dataset from the current date and time (the result of the expression):
SECONDS_BETWEEN(NOW(),weblogs.login_timestamp)
TRUNC
TRUNC is a row function that truncates a DATETIME value to the specified format.
TRUNC(datetime,"format")
Page 299
Data Ingest Guide - Platfora Expressions
Returns a value of type DATETIME truncated to the specified format.
datetime
Required. A field or expression that returns a DATETIME value.
format
Required. One of the following format values:
• millisecond - Returns a datetime value truncated to millisecond granularity. Has no effect since
millisecond is already the most granular format for datetime values. For example, an input datetime
value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:40.213.
• second - Returns a datetime value truncated to second granularity. For example, an input datetime
value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:40.000.
• minute - Returns a datetime value truncated to minute granularity. For example, an input datetime
value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:00.000.
• hour - Returns a datetime value truncated to hour granularity. For example, an input datetime value
of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:00:00.000.
• day - Returns a datetime value truncated to day granularity. For example, an input datetime value of
2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 00:00:00.000.
• week - Returns a datetime value truncated to the first day of the week (starting on a Monday). For
example, an input datetime value of 2012-08-15 (a Wednesday) would return a datetime value of
2012-08-13 (the Monday prior).
• month - Returns a datetime value truncated to the first day of the month. For example, an input
datetime value of 2012-08-15 would return a datetime value of 2012-08-01.
• quarter - Returns a datetime value truncated to the first day of the quarter (January 1, April 1, July 1,
or October 1). For example, an input datetime value of 2012-08-15 20:38:40.213 would return a
datetime value of 2012-07-01.
• year - Returns a datetime value truncated to the first day of the year (January 1). For example, an
input datetime value of 2012-08-15 would return a datetime value of 2012-01-01.
• weekyear - Returns a datetime value trucated to the first day of the ISO weekyear (the ISO week
starting with the Monday which is nearest in time to January 1). For example, an input datetime value
of 2008-08-15 would return a datetime value of 2007-12-31. The first day of the ISO weekyear for
2008 is December 31, 2007 (the prior Monday closest to January 1).
Truncate the order_date datetime field to day granularity:
TRUNC(order_date,"day")
Cast the value of the order_date string field to a datetime value using TO_DATE, and truncate it to day
granularity:
TRUNC(TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss"),"day")
Page 300
Data Ingest Guide - Platfora Expressions
YEAR_DIFF
YEAR_DIFF is a row function that calculates the fractional number of years between two DATETIME
values (value1-value2).
YEAR_DIFF(datetime_1,datetime_2)
Returns one value per row of type DOUBLE.
datetime_1
Required. A field or expression of type DATETIME.
datetime_2
Required. A field or expression of type DATETIME.
Calculate the number of years a user has been a customer by subtracting the value of the
registration_date field from the current date (the result of the expression):
YEAR_DIFF(NOW(),registration_date)
Calculate a user's age by subtracting the value of the birthdate field in the users dataset from the current
date (the result of the expression):
YEAR_DIFF(NOW(),users.birthdate)
Math Functions
Math functions allow you to perform basic math calculations on numeric values. You can also use
arithmetic operators to perform simple math calculations.
DIV
DIV is a row function that divides two LONG values and returns a quotient value of type LONG (the result
is truncated to 0 decimal places).
DIV(dividend,divisor)
Returns one value per row of type LONG.
dividend
Required. A field or expression of type LONG.
divisor
Required. A field or expression of type LONG.
Cast the value of the file_size field to LONG and divide by 1024:
DIV(TO_LONG(file_size),1024)
Page 301
Data Ingest Guide - Platfora Expressions
EXP
EXP is a row function that raises the mathematical constant e to the power (exponent) of a numeric value
and returns a value of type DOUBLE.
EXP(power)
Returns one value per row of type DOUBLE.
power
Required. A field or expression of a numeric type.
Raise e to the power in the Value field.
EXP(Value)
When the Value field value is 2.0, the result is equal to 7.3890 when truncated to four decimal places.
FLOOR
FLOOR is a row function that returns the largest integer that is less than or equal to the input argument.
FLOOR(double)
Returns one value per row of type DOUBLE.
double
Required. A field or expression of type DOUBLE.
Return the floor value of 32.6789:
FLOOR(32.6789) returns 32.0
HASH
HASH is a row function that evenly partitions data values into the specified number of buckets. It creates
a hash of the input value and assigns that value a bucket number. Equal values will always hash to the
same bucket number.
HASH(field_name,integer)
Returns one value per row of type INTEGER corresponding to the bucket number that the input value
hashes to.
field_name
Required. The name of the field whose values you want to partition.
integer
Required. The desired number of buckets. This parameter can be a numeric value of any data type, but
when it is a non-integer value, Platfora truncates the value to an integer. When the value is zero, the
function returns NULL. When the value is negative, the function uses absolute value.
Page 302
Data Ingest Guide - Platfora Expressions
Partition the values of the username field into 20 buckets:
HASH(username,20)
LN
LN is a row function that returns the natural logarithm of a number. The natural logarithm is the
logarithm to the base e, where e (Euler's number) is a mathematical constant approximately equal to
2.718281828. The natural logarithm of a number x is the power to which the constant e must be raised in
order to equal x.
LN(positive_number)
Returns the exponent to which base e must be raised to obtain the input value, where e denotes the
constant number 2.718281828. The return value is the same data type as the input value.
For example, LN(7.389) is 2, because e to the power of 2 is approximately 7.389.
positive_number
Required. A field or expression that returns a number greater than 0. Inputs can be of type INTEGER,
LONG, DOUBLE, or FIXED.
Return the natural logarithm of base number e, which is approximately 2.718281828:
LN(2.718281828) returns 1
LN(3.0000) returns 1.098612
LN(300.0000) returns 5.703782
MOD
MOD is a row function that divides two LONG values and returns the remainder value of type LONG (the
result is truncated to 0 decimal places).
MOD(dividend,divisor)
Returns one value per row of type LONG.
dividend
Required. A field or expression of type LONG.
divisor
Required. A field or expression of type LONG.
Cast the value of the file_size field to LONG and divide by 1024:
MOD(TO_LONG(file_size),1024)
Page 303
Data Ingest Guide - Platfora Expressions
POW
POW is a row function that raises the a numeric value to the power (exponent) of another numeric value
and returns a value of type DOUBLE.
POW(index,power)
Returns one value per row of type DOUBLE.
index
Required. A field or expression of a numeric type.
power
Required. A field or expression of a numeric type.
Calculate the compound annual growth rate (CAGR) percentage for a given investment over a five year
span.
100 * POW(end_value/start_value, 0.2) - 1
Calculate the square of the Value field.
POW(Value,2)
Calculate the square root of the Value field.
POW(Value,0.5)
The following expression returns 1.
POW(0,0)
ROUND
ROUND is a row function that rounds a DOUBLE value to the specified number of decimal places.
ROUND(double,number_decimal_places)
Returns one value per row of type DOUBLE.
double
Required. A field or expression of type DOUBLE.
number_decimal_places
Required. An integer that specifies the number of decimal places to round to.
Round the number 32.4678954 to two decimal places:
ROUND(32.4678954,2) returns 32.47
Page 304
Data Ingest Guide - Platfora Expressions
Data Type Conversion Functions
Data type conversion functions allow you to cast data values from one data type to another. These
functions are used implicitly whenever you set the data type of a field or column in the Platfora user
interface. The supported data types are: INTEGER, LONG, DOUBLE, FIXED, DATETIME, and STRING.
EPOCH_MS_TO_DATE
EPOCH_MS_TO_DATE is a row function that converts LONG values to DATETIME values, where the
input number represents the number of milliseconds since the epoch.
EPOCH_MS_TO_DATE(long_expression)
Returns one value per row of type DATETIME in UTC format yyyy-MM-dd HH:mm:ss:SSS Z.
long_expression
Required. A field or expression of type LONG representing the number of milliseconds since the epoch
datetime (January 1, 1970 00:00:00:000 GMT).
Convert a number representing the number of milliseconds from the epoch to a human-readable date and
time:
EPOCH_MS_TO_DATE(1360260240000) returns 2013-02-07T18:04:00:000Z or February 7,
2013 18:04:00:000 GMT
Or if your data is in seconds instead of milliseconds:
EPOCH_MS_TO_DATE(1360260240 * 1000) returns 2013-02-07T18:04:00:000Z or February
7, 2013 18:04:00:000 GMT
TO_CURRENCY
This function is deprecated. Use the TO_FIXED function instead.
TO_DATE
TO_DATE is a row function that converts STRING values to DATETIME values, and specifies the format
of the date and time elements in the string.
TO_DATE(string_expression,"date_format")
Returns one value per row of type DATETIME (which by definition is in UTC).
string_expression
Required. A field or expression of type STRING.
date_format
Required. A pattern that describes how the date is formatted.
Use the following pattern symbols to define your date format. The count and ordering of the pattern
letters determines the datetime format. Any characters in the pattern that are not in the ranges of a-z and
Page 305
Data Ingest Guide - Platfora Expressions
A-Z are treated as quoted delimiter text. For instance, characters such as slash (/) or colon (:) will appear
in the resulting output even they are not escaped with single quotes.
Table 2: Date Pattern Symbols
SymbolMeaning
Presentation
Examples
G
era
text
AD
C
century of era (0 or
greater)
number
20
Y
year of era (0 or
greater)
year
1996
x
week year
year
1996
w
week number of week
year
number
27
e
day of week (number)
number
2
E
day of week (name)
text
Tuesday; Tue
y
year
year
1996
D
day of year
number
189
M
month of year
month
July; Jul; 07
3 or more uses text, otherwise uses a
number
d
day of month
number
10
If the number of pattern letters is 3 or
more, the text form is used; otherwise
the number is used.
a
half day of day
text
PM
K
hour of half day (0-11) number
0
h
clock hour of half day
(1-12)
number
12
H
hour of day (0-23)
number
0
k
clock hour of day
(1-24)
number
24
m
minute of hour
number
30
s
second of minute
number
55
S
fraction of second
number
978
Page 306
Notes
Numeric presentation for year and week
year fields are handled specially. For
example, if the count of 'y' is 2, the year
will be displayed as the zero-based year
of the century, which is two digits.
If the number of pattern letters is 4 or
more, the full form is used; otherwise a
short or abbreviated form is used.
Data Ingest Guide - Platfora Expressions
SymbolMeaning
Presentation
Examples
Notes
z
time zone
text
Pacific Standard
Time; PST
If the number of pattern letters is 4 or
more, the full form is used; otherwise a
short or abbreviated form is used.
Z
time zone offset/id
zone
-0800; -08:00;
America/
Los_Angeles
'Z' outputs offset without a colon, 'ZZ'
outputs the offset with a colon, 'ZZZ' or
more outputs the zone id.
'
escape character for
text-based delimiters
delimiter
''
literal representation of literal
a single quote
'
Define a new DATETIME computed field based on the order_date base field, which contains timestamps
in the format of: 2014.07.10 at 15:08:56 PDT:
TO_DATE(order_date,"yyyy.MM.dd 'at' HH:mm:ss z")
Define a new DATETIME computed field by first combining individual month, day, year, and
depart_time fields (using CONCAT), and performing a transformation on depart_time to make sure threedigit times are converted to four-digit times (using REGEX_REPLACE):
TO_DATE(CONCAT(month,"/",day,"/",year,":",REGEX_REPLACE(depart_time,"\b(\d{3})\b",
dd/yyyy:HHmm")
Define a new DATETIME computed field based on the created_at base field, which contains timestamps
in the format of: Sat Jan 25 16:35:23 +0800 2014 (this is the timestamp format returned by Twitter's
API):
TO_DATE(created_at,"EEE MMM dd HH:mm:ss Z yyyy")
TO_DOUBLE
TO_DOUBLE is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to DOUBLE
(decimal) values.
TO_DOUBLE(expression)
Returns one value per row of type DOUBLE.
expression
Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or
DOUBLE.
Convert the values of the average_rating field to a double data type:
TO_DOUBLE(average_rating)
Convert the average_rating field to a double data type, but first transform the occurrence of any NA
values to NULL values using a CASE expression:
Page 307
Data Ingest Guide - Platfora Expressions
TO_DOUBLE(CASE WHEN average_rating="N/A" then NULL ELSE average_rating
END)
TO_FIXED
TO_FIXED is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to fixeddecimal values. Using a FIXED data type to represent monetary values allows you to calculate and
aggregate monetary values with accuracy to a ten-thousandth of a monetary unit.
TO_FIXED(expression)
Returns one value per row of type FIXED (fixed-decimal value to 10000th accuracy).
expression
Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or
DOUBLE.
Convert the opening_price field to a fixed decimal data type:
TO_FIXED(opening_price)
Convert the sale_price field to a fixed decimal data type, but first transform the occurrence of any N/A
string values to NULL values using a CASE expression:
TO_FIXED(CASE WHEN sale_price="N/A" then NULL ELSE sale_price END)
TO_INT
TO_INT is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to INTEGER
(whole number) values. When converting DOUBLE values, everything after the decimal will be truncated
(not rounded up or down).
TO_INT(expression)
Returns one value per row of type INTEGER.
expression
Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or
DOUBLE.
Convert the values of the average_rating field to an integer data type:
TO_INT(average_rating)
Convert the flight_duration field to an integer data type, but first transform the occurrence of any NA
values to NULL values using a CASE expression:
TO_INT(CASE WHEN flight_duration="N/A" then NULL ELSE flight_duration
END)
Page 308
Data Ingest Guide - Platfora Expressions
TO_LONG
TO_LONG is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to LONG (whole
number) values. When converting DOUBLE values, everything after the decimal will be truncated (not
rounded up or down).
TO_LONG(expression)
Returns one value per row of type LONG.
expression
Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or
DOUBLE.
Convert the values of the average_rating field to a long data type:
TO_LONG(average_rating)
Convert the average_rating field to a long data type, but first transform the occurrence of any NA values
to NULL values using a CASE expression:
TO_LONG(CASE WHEN average_rating="N/A" then NULL ELSE average_rating
END)
TO_STRING
TO_STRING is a row function that converts values of other data types to STRING (character) values.
TO_STRING(expression)
TO_STRING(datetime_expression,date_format)
Returns one value per row of type STRING.
expression
A field or expression of type FIXED, STRING, INTEGER, LONG, or DOUBLE.
datetime_expression
A field or expression of type DATETIME.
date_format
If converting a DATETIME to a string, a pattern that describes how the date is formatted. See TO_DATE
for the date format patterns.
Convert the values of the sku_number field to a string data type:
TO_STRING(sku_number)
Convert values in the age column into a range-based groupings (binning), and cast output values to a
STRING:
Page 309
Data Ingest Guide - Platfora Expressions
TO_STRING(CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50"
ELSE "over 50" END)
Convert the values of a timestamp datetime field to a string, where the timestamp values are in the
format of: 2002.07.10 at 15:08:56 PDT:
TO_STRING(timestamp,"yyyy.MM.dd 'at' HH:mm:ss z")
Aggregate Functions
An aggregate function groups the values of multiple rows together based on some defined input
expression. Aggregate functions return one value for a group of rows, and are only valid for defining
measures in Platfora. Aggregate functions cannot be combined with row functions.
AVG
AVG is an aggregate function that returns the average of all valid numeric values. It sums all values in
the provided expression and divides by the number of valid (NOT NULL) rows. If you want to compute
an average that includes all values in the row count (including NULL values), you can use a SUM/COUNT
expression instead.
AVG(numeric_field)
Returns a value of type DOUBLE.
numeric_field
Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate
functions can only take field names as input.
Get the average of the valid sale_amount field values:
AVG(sale_amount)
Get the average of the valid net_worth field values in the billionaires data set, which resides in the
samples namespace:
AVG([(samples) billionaires].net_worth)
Get the average of all page_views field values in the web_logs dataset (including NULL values):
SUM(page_views)/COUNT(web_logs)
COUNT
COUNT is an aggregate function that returns the number of rows in a dataset.
COUNT([namespace_name]dataset_name)
Returns a value of type INTEGER.
namespace_name
Page 310
Data Ingest Guide - Platfora Expressions
Optional. The name of the namespace in which the dataset resides. If not specified, uses the default
namespace.
dataset_name
Required. The name of the dataset for which to obtain a count of rows. If you want to count rows of a
down-stream dataset that is related to the current dataset, you can specify the hierarchy of dataset names
in the format of:
parent_dataset_name.child_dataset_name.[...]
Count the rows in the sales dataset:
COUNT(sales)
Count the rows in the billionaires dataset, which resides in the samples namespace:
COUNT([(samples) billionaires])
Count the rows in the customer dataset, which is a related dataset down-stream of sales:
COUNT(sales.customers)
COUNT_VALID
COUNT_VALID is an aggregate function that returns the number of rows for which the given expression
is valid (excludes NULL values).
COUNT_VALID(field)
Returns a numeric value of type INTEGER.
field
Required. A field name. Unlike row functions, aggregate functions can only take field names as input.
Count the valid values in the page_views field:
COUNT_VALID(page_views)
DISTINCT
DISTINCT is an aggregate function that returns the number of distinct values for the given expression.
DISTINCT(field)
Returns a numeric value of type INTEGER.
field
Required. A field name. Unlike row functions, aggregate functions can only take field names as input.
Count the unique values of the user_id field in the currently selected dataset:
DISTINCT(user_id)
Page 311
Data Ingest Guide - Platfora Expressions
Count the unique values of the name field in the billionaires dataset, which resides in the samples
namespace:
DISTINCT([(samples) billionaires].name)
Count the unique values of the customer_id field in the customer dataset, which is a related dataset
down-stream of web sales:
DISTINCT([web sales].customers.customer_id)
MAX
MAX is an aggregate function that returns the biggest value from the given input expression.
MAX(numeric_or_datetime_field)
Returns a numeric or datetime value of the same type as the input expression.
numeric_or_datetime_field
Required. A field of type INTEGER, LONG, DOUBLE, FIXED, or DATETIME. Unlike row
functions, aggregate functions can only take field names as input.
Get the highest value from the sale_amount field:
MAX(sale_amount)
Get the latest date from the Session Timestamp datetime field:
MAX([Session Timestamp])
MIN
MIN is an aggregate function that returns the smallest value from the given input expression.
MIN(numeric_or_datetime_field)
Returns a numeric or datetime value of the same type as the input expression.
numeric_or_datetime_field
Required. A field of type INTEGER, LONG, DOUBLE, FIXED, or DATETIME. Unlike row
functions, aggregate functions can only take field names as input.
Get the lowest value from the sale_amount field:
MIN(sale_amount)
Get the earliest date from the Session Timestamp datetime field:
MIN([Session Timestamp])
SUM
SUM is an aggregate function that returns the total of all values from the given input expression.
Page 312
Data Ingest Guide - Platfora Expressions
SUM(numeric_field)
Returns a numeric value of the same type as the input expression.
numeric_field
Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate
functions can only take field names as input.
Add the values of the sale_amount field:
SUM(sale_amount)
Add values of the session count field in the users dataset, which is a related dataset down-stream of
clicks:
SUM(clicks.users.[session count])
STDDEV
STDDEV is an aggregate function that calculates the population standard deviation for a group of
numeric values. Standard deviation is the square root of the variance.
STDDEV(numeric_field)
Returns a value of type DOUBLE. If there are less than two values in the input group, returns NULL.
numeric_field
Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate
functions can only take field names as input.
Calculate the standard deviation of the values contained in the sale_amount field:
STDDEV(sale_amount)
VARIANCE
VARIANCE is an aggregate function that calculates the population variance for a group of numeric
values. Variance measures the amount by which all values in a group vary from the average value of
the group. Data with low variance contains values that are identical or similar. Data with high variance
contains values that are not similar. Variance is calculated as the average of the squares of the deviations
from the mean. Squaring the deviations ensures that negative and positive deviations do not cancel each
other out.
VARIANCE(numeric_field)
Returns a value of type DOUBLE. If there are less than two values in the input group, returns NULL.
numeric_field
Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate
functions can only take field names as input.
Page 313
Data Ingest Guide - Platfora Expressions
Get the population variance of the values contained in the sale_amount field:
VARIANCE(sale_amount)
ROLLUP and Window Functions
Window functions can only be used in conjunction with ROLLUP. ROLLUP is a modifier to an aggregate
expression that determines the partitioning and ordering of a rowset before the associated aggregate
function or window function is applied. ROLLUP defines a window or user-specified set of rows within
a query result set. A window function then computes a value for each row in the window. You can
use window functions to compute aggregated values such as moving averages, cumulative aggregates,
running totals, or a top N per group results.
ROLLUP
ROLLUP is a modifier to an aggregate function that turns a regular aggregate function into a windowed,
partitioned, or adaptive aggregate function. This is useful when you want to compute an aggregation
over a subset of rows within the overall result of a viz query.
ROLLUP aggregate_expression [ WHERE input_group_condition [...] ]
[ TO ([partitioning_columns])
[ ORDER BY (ordering_column [ASC | DESC])
ROWS|RANGE window_boundary [window_boundary]
| BETWEEN window_boundary AND window_boundary ]
]
where window_boundary can be one of:
UNBOUNDED
PRECEDING
value PRECEDING
value FOLLOWING
UNBOUNDED FOLLOWING
A regular measure is the result of an aggregation (such as SUM or AVG) applied to some fact or metric
column of a dataset. For example, suppose we had a dataset with the following rows and columns:
Date
Sale Amount
Product
Region
05/01/2013
100
gadget
west
05/01/2013
200
widget
east
06/01/2013
100
gadget
east
06/01/2013
400
widget
west
07/01/2013
300
widget
west
Page 314
Data Ingest Guide - Platfora Expressions
Date
Sale Amount
Product
Region
07/01/2013
200
gadget
east
To define a regular measure called Total Sales, we would use the expression:
SUM([Sale Amount])
When this measure is used in a visualization, the group of input records passed into the aggregate
calculation is determined by the dimensions selected by the user when they create the viz. For example,
if the user chose Region as a dimension in the viz, there would be two input groups for which the
measure would be calculated:
Total Sales / Region
east
west
500
800
If an aggregate expression includes a ROLLUP clause, the column(s) specified in the TO clause of the
ROLLUP expression determine the additional partitions over which to compute the aggregate expression.
It divides the overall rows returned by the viz query into subsets or buckets, and then computes the
aggregate expression within each bucket. Every ROLLUP expression has implicit partitioning defined: an
absent TO clause treats the entire result set as one partition; an empty TO clause partitions by whatever
dimension columns are present in the viz query.
The WHERE clause is used to filter the input rows that flow into each partition. Input rows that meet the
WHERE clause criteria will be partitioned, and rows that don't will not be partitioned.
The ORDER BY with a RANGE or ROW clause is used to define a window frame within each partition
over which to compute the aggregate expression.
When a ROLLUP measure is used in a visualization, the aggregate calculation is computed across a
set of input rows that are related to, but separate from, the other dimension(s) used in the viz. This is
similar to the type of calculation that is done with a regular measure. However unlike a regular measure,
a ROLLUP measure does not cause the input rows to be grouped into a single result set; the input rows
still retain their separate identities. The ROLLUP clause determines how the input rows are split up for
processing by the ROLLUP's aggregate function.
ROLLUP expressions can be written to make the partitioning adaptive to whatever dimension columns
are selected in the visualization. This is done by using a reference name as the partitioning column, as
opposed to a regular column. For example, suppose we wanted to be able to calculate the total sales for
any granularity of date. We could create an adaptive measure called Rollup Sales to Date that partitions
total sales by date as follows:
ROLLUP SUM([Sale Amount]) TO (Date)
Page 315
Data Ingest Guide - Platfora Expressions
When this measure is used in a visualization, the group of input records passed into the aggregate
calculation is determined by the dimension fields selected by the user in the viz, but partitioned by the
granularity of Date selected by the user. For example, if the user chose the dimensions Date.Month and
Region in the viz, then total sales would be grouped by month and region, but the ROLLUP measure
expression would aggregate the sales by month only.
Notice that the results for the east and west regions are the same - this is because the aggregation
expression is only considering rows that share the same month when calculating the sum of sales.
Month / (Measures) / Region
May 2013
June 2013
July 2013
Rollup Sales to Date
Rollup Sales to Date
Rollup Sales to Date
east | west
east | west
east | west
300 | 300
500 | 500
500 | 500
Suppose within the date partition, we wanted to calculate the cumulative total day to day. We could
define a window measure called Running Total to Date that looks at each day and all preceding days as
follows:
ROLLUP SUM([Sale Amount]) TO (Date) ORDER BY (Date.Date) ROWS UNBOUNDED
PRECEDING
When this measure is used in a visualization, the group of input records passed into the aggregate
calculation is determined by the dimension fields selected by the user in the viz, and partitioned by the
granularity of Date selected by the user. Within each partition the rows are ordered chronologically (by
Date.Date), and the sum amount is then calculated per date partition by looking at the current row (or
mark), and all rows that come before it within the partition. For example, if the user chose the dimension
Date.Month in the viz, then the ROLLUP measure expression would cumulatively aggregate the sales
within each month.
Month / (Measures) / Date.Date
May 2013
June 2013
July 2013
2013-05-01
2013-06-01
2013-07-01
Running Total to Date
Rollup Sales to Date
Rollup Sales to Date
300
500
500
Returns a numeric value per partition based on the output type of the aggregate_expression.
aggregate_expression
Page 316
Data Ingest Guide - Platfora Expressions
Required. An expression containing an aggregate or window function. Simple aggregate
functions such as COUNT, AVG, SUM, MIN, and MAX are supported. Window functions
such as RANK, DENSE_RANK, and NTILE are supported and can only be used in
conjuction with ROLLUP.
Complex aggregate functions such as STDDEV and VARIANCE are not supported.
WHERE input_group_condition
The WHERE clause limits the group of input rows over which to compute the aggregate
expression. The input group condition is a Boolean (true or false) condition defined
using a comparison operator expression. Any row that does not satisfy the condition will
be excluded from the input group used to calculate the aggregated measure value. For
example (note that datetime values must be specified in yyyy-MM-dd format):
WHERE Date.Date BETWEEN 2012-06-01 AND 2012-07-31
WHERE Date.Year BETWEEN 2009 AND 2013
WHERE Company LIKE("Plat*")
WHERE Code IN("a","b","c")
WHERE Sales < 50.00
WHERE Age >= 21
You can specify multiple WHERE clauses in a ROLLUP expression.
TO ([partitioning_columns])
The TO clause is used to specify the dimension column(s) used to partition a group of
input rows. This allows you to calculate a measure value for a specific dimension group
(a subset of input rows) that are somehow related to the other dimension groups used in a
visualization (all input rows). It is possible to define an empty group (meaning all rows) by
using empty parenthesis.
When used in a visualization, measure values are computed for groups of input rows that
return the same value for the columns specified in the partitioning list. For example, if the
Date.Month column is used as a partitioning column, then all records that have the same
value for Date.Month will be grouped together in order to calculate the measure value.
The aggregate expression is applied to the group specified in the TO clause independently
of the other dimension groupings used in the visualization. Note that the partitioning
column(s) specified in the TO clause of an adaptive measure expression must also be
included as dimensions (or grouping columns) in the visualization.
A partitioning column can also be the name of a reference field. Using a reference field allows the
partition criteria to dynamically adapt based on any field of the referenced dataset that is used in a viz.
For example, if the partition column is a reference field pointing to the Date dimension, then any subfield of Date (Date.Year, Date.Month, etc.) can be used as the partitioning column by selecting it in a
viz.
Page 317
Data Ingest Guide - Platfora Expressions
A TO clause with an empty partitioning list treats each mark in the result set as an input
group. For example, if the viz includes the Month and Region columns, then TO() would
be equivalent to TO(Month,Region).
ORDER BY (ordering_column)
The optional ORDER BY clause orders the input rows using the values in the specified
column within each partition identified in the TO clause. Use the ORDER BY clause
along with the ROWS or RANGE clauses to define windows over which to compute the
aggregate function. This is useful for computing moving averages, cumulative aggregates,
running totals, or a top value per group of input rows. The ordering column specified in the
ORDER BY clause can be a dimension, measure, or an aggregate expression (for example
ORDER BY (SUM(Sales))). If the ordering column is a dimension, it must be included in
the viz.
By default, rows are sorted in ascending order (low to high values). You can use the DESC
keyword to sort in descending order (high to low values).
ROWS | RANGE
Required when using ORDER BY. Further limits the rows within the partition by
specifying start and end points within the partition. This is done by specifying a range of
rows with respect to the current row either by logical association (RANGE) or physical
association (ROWS). Use either a ROWS or RANGE clause to express the window
boundary (the set of input rows in each partition, relative to the current row, over which to
compute the aggregate expression). The window boundary can include one, several, or all
rows of the partition.
When using the RANGE clause, the ordering column used in the ORDER BY clause must
be a sub-column of a reference to Platfora's built-in Date dimension dataset.
window_boundary
A window boundary is required when using either ROWS or RANGE. This defines the set
of rows, relative to the current row, over which to compute the aggregate expression. The
row order is based on the ordering specified in the ORDER BY clause.
A PRECEEDING clause defines a lower window boundary (the number of rows to include
before the current row). The FOLLOWING clause defines an upper window boundary
(the number of rows to include after the current row). The window boundary expression
must include either a PRECEEDING or FOLLOWING clause, or both. If PRECEEDING
is omitted, the current row is considered the first row in the window. Similarly, if
FOLLOWING is omitted, the current row is considered the last row in the window. The
UNBOUNDED keyword includes all rows in the direction specified. When you need to
specify both a start and end of a window, use the BETWEEN and AND keywords.
For example:
ROWS 2 PRECEDING means that the window is three rows in size, starting with two
rows preceding until and including the current row.
Page 318
Data Ingest Guide - Platfora Expressions
ROWS BETWEEN 2 PRECEDING AND 5 FOLLOWING means that the window is eight
rows in size, starting with two rows preceding, the current row, and five rows following
the current row. The current row is included in the set of rows by default.
You can exclude the current row from the window by specifying a window start and end
point before or after the current row. For example:
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING starts the window
with all rows that come before the current row, and ends the window one row before the
current row, thereby excluding the current row from the window.
Calculate the percentage of flight records in the same departure date period. Note that the
departure_date field is a reference to the Date dataset, meaning that the group to which the
measure is applied can adapt to any downstream field of departure_date (departure_date.Year,
departure_date.Month, and so on). When used in a viz, this will calculate the percentage of flights for
each dimension group in the viz that share the same value for departure_date:
100 * COUNT(Flights) / ROLLUP COUNT(Flights) TO ([Departure Date])
Normalize the number of flights using the carrier American Airlines (AA) as the benchmark. This will
allow you to compare the number of flights for other carriers against the fixed baseline number of flights
for AA (if AA = 100 percent, then all other carriers will fall either above or below that percentage):
100 * COUNT(Flights) / ROLLUP COUNT(Flights) WHERE [Carrier Code]="AA"
Calculate a generic percentage of total sales. When this measure is used in a visualization, it will show
the percentage of total sales that a mark in the viz is contributing to the total for all marks in the viz. The
input rows depend on the dimensions selected in the viz.
100 * SUM(sales) / ROLLUP SUM(sales) TO ()
Calculate the cumulative total of sales for a given year on a month-to-month basis (year-to-month sales
totals):
ROLLUP SUM(sales) TO (Date.Year) ORDER BY (Date.Month) ROWS UNBOUNDED
PRECEDING
Calculate the cumulative total of sales (for all input rows) for all previous years, but exclude the current
year from the total.
ROLLUP SUM(sales) TO () ORDER BY (Date.Year) ROWS BETWEEN UNBOUNDED
PRECEDING AND 1 PRECEDING
DENSE_RANK
DENSE_RANK is a windowing aggregate function that orders rows by a measure value and assigns a
rank number to each row in the given partition. Rank positions are not skipped in the event of a tie.
DENSE_RANK must be used within a ROLLUP expression.
ROLLUP DENSE_RANK()
TO ([partitioning_column])
ORDER BY (measure_expression [ASC | DESC])
Page 319
Data Ingest Guide - Platfora Expressions
ROWS|RANGE window_boundary [window_boundary]
| BETWEEN window_boundary AND window_boundary ]
where window_boundary can be one of:
UNBOUNDED
PRECEDING
value PRECEDING
value FOLLOWING
UNBOUNDED FOLLOWING
DENSE_RANK is a window aggregate function used to assign a ranking number to each row in a group.
If multiple rows have the same ranking value (there is a tie), then the tied rows are given the same rank
value and subsequent rank positions are not skipped.
The TO clause of the ROLLUP is used to specify the dimension column(s) used to partition a group of
input rows. To define a global ranking that can adapt to any dimension groupings used in a viz, specify
an empty TO clause.
The ORDER BY clause of the ROLLUP expression determines how to order the rows before they are
ranked. The ORDER BY clause should specify the measure field for which you want to calculate the
ranks. The ranked rows in the partition are numbered starting at one.
For example, suppose we had a dataset with the following rows and columns and you want to rank the
Quarters and Regions according to the values in the Sales column.
Quarter
Region
Sales
2010 Q1
North
100
2010 Q1
South
200
2010 Q1
East
300
2010 Q1
West
400
2010 Q2
North
400
2010 Q2
South
250
2010 Q2
East
150
2010 Q2
West
250
Supposing the lens has an existing measure field called Sales(Sum), you could then define a measure
called Sales_Dense_Rank using the following expression:
ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED
PRECEDING
Page 320
Data Ingest Guide - Platfora Expressions
When you include the Quarter, Region, and Sales_Dense_Rank columns in the viz, you get the
following data points. Notice that tied values are given the same rank number and no rank positions are
skipped:
Quarter
Region
SalesRank
2010 Q1
North
6
2010 Q1
South
4
2010 Q1
East
2
2010 Q1
West
1
2010 Q2
North
1
2010 Q2
South
3
2010 Q2
East
5
2010 Q2
West
3
Returns a value of type LONG.
ROLLUP
Required. DENSE_RANK must be used within a ROLLUPROLLUP expression in place of the
aggregate_expression of the ROLLUP.
The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate the
window function. An empty TO calculates the window function over all rows in the query as one group.
The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression.
Rank the sum of all sales in descending order, so the highest sales is given the ranking of 1.
ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED
PRECEDING
Rank the sum of all sales within a given quarter in descending order, so the highest sales in each quarter
is given the ranking of 1.
ROLLUP DENSE_RANK() TO (Quarter) ORDER BY ([Sales(Sum)] DESC) ROWS
UNBOUNDED PRECEDING
Page 321
Data Ingest Guide - Platfora Expressions
NTILE
NTILE is a windowing aggregate function that divides a partitioned group of rows into the specified
number of buckets, and returns the bucket number to which the current row belongs. NTILE must be
used within a ROLLUP expression.
ROLLUP NTILE(integer)
TO ([partitioning_column])
ORDER BY (measure_expression [ASC | DESC])
ROWS|RANGE window_boundary [window_boundary]
| BETWEEN window_boundary AND window_boundary ]
where window_boundary can be one of:
UNBOUNDED
PRECEDING
value PRECEDING
value FOLLOWING
UNBOUNDED FOLLOWING
NTILE is a window aggregate function typically used to calculate percentiles. A percentile (or centile)
is a measure used in statistics indicating the value below which a given percentage of records in a group
falls. For example, the 20th percentile is the value (or score) below which 20 percent of the records may
be found. The term percentile is often used in the reporting of test scores. For example, if a score is in
the 86th percentile, it is higher than 86% of the other scores. The 25th percentile is also known as the
first quartile (Q1), the 50th percentile as the median or second quartile (Q2), and the 75th percentile as
the third quartile (Q3). In general, percentiles, deciles and quartiles are specific types of ntiles.
NTILE must be used within a ROLLUPROLLUP expression in place of the aggregate_expression
of the ROLLUP.
The TO clause of the ROLLUP is used to specify a fixed dimension column used to partition a group of
input rows. To define a global NTILE ranking that can adapt to any dimension groupings used in a viz,
specify an empty TO clause.
The ORDER BY clause of the ROLLUP expression determines how to order the rows before they are
divided into buckets. The ORDER BY clause should specify the measure field for which you want to
calculate NTILE bucket values. A centile would be 100 buckets, a decile would be 10 buckets, a quartile
4 buckets, and so on. The buckets in the partition are numbered starting at one.
For example, suppose we had a dataset with the following rows and columns and you want to divide
the year-to-date sales into four buckets (quartiles) with the highest quartile ranked as 1 and the
lowest ranked as 4. Supposing a measure field has been defined called Sum_YTD_Sales, defined as
Page 322
Data Ingest Guide - Platfora Expressions
SUM([Sales YTD]), you could then define a measure called YTD_Sales_Quartile using the following
expression:
ROLLUP NTILE(4) TO() ORDER BY(Sum_YTD_Sales DESC) ROWS UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING
Name
Gender
Sales YTD
YTD_Sales_Quartile
Chen
F
3,500,000
1
John
M
3,100,000
1
Pete
M
2,900,000
1
Daria
F
2,500,000
2
Jennie
F
2,200,000
2
Mary
F
2,100,000
2
Mike
M
1,900,000
3
Brian
M
1,700,000
3
Molly
F
1,500,000
3
Theresa
F
1,200,000
4
Hans
M
900,000
4
Ben
M
500,000
4
Because the TO clause of the ROLLUP expression is empty, the quartile partitioning adapts to whatever
dimensions are used in the viz. For example, if you include the Gender dimension field in the viz, the
quartiles would then be computed per gender. The following example divides each gender into buckets
with each gender having 6 year-to-date sales values. The two extra values (the remainder of 6 / 4) are
allocated to buckets 1 and 2, which therefore have one more value than buckets 3 or 4.
Name
Gender
Sales YTD
YTD_Sales_Quartile (partitioned by Gender)
Chen
F
3,500,000
1
Daria
F
2,500,000
1
Jennie
F
2,200,000
2
Mary
F
2,100,000
2
Molly
F
1,500,000
3
Page 323
Data Ingest Guide - Platfora Expressions
Name
Gender
Sales YTD
YTD_Sales_Quartile (partitioned by Gender)
Theresa
F
1,200,000
4
John
M
3,100,000
1
Pete
M
2,900,000
1
Mike
M
1,900,000
2
Brian
M
1,700,000
2
Hans
M
900,000
3
Ben
M
500,000
4
Returns a value of type LONG.
ROLLUP
Required. NTILE must be used within a ROLLUPROLLUP expression in place of the
aggregate_expression of the ROLLUP.
The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate the
window function. An empty TO calculates the window function over all rows in the query as one group.
The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression.
integer
Required. An integer that specifies the number of buckets to divide the partitioned rows into.
Perhaps the most common use case for NTILE is to get a global ranking of result rows. For example,
if you wanted to get the percentile of Total Records per City, you may think the expression to use is:
ROLLUP NTILE(100) TO (City) ORDER BY ([Total Records] DESC) ROWS UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING.
However, by leaving the TO clause blank, the percentile buckets can adapt to whatever dimension(s)
you use in the viz. To calculate the Total Records percentiles by City, you could define a global
Total_Records_Percentiles measure and then use this measure in conjunction with the City dimension in
the viz (or any other dimension for that matter).
ROLLUP NTILE(100) TO () ORDER BY ([Total Records] DESC) ROWS UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING
RANK
RANK is a windowing aggregate function that orders rows by a measure value and assigns a rank number
to each row in the given partition. Rank positions are skipped in the event of a tie. RANK must be used
within a ROLLUP expression.
ROLLUP RANK()
Page 324
Data Ingest Guide - Platfora Expressions
TO ([partitioning_column])
ORDER BY (measure_expression [ASC | DESC])
ROWS|RANGE window_boundary [window_boundary]
| BETWEEN window_boundary AND window_boundary ]
where window_boundary can be one of:
UNBOUNDED
PRECEDING
value PRECEDING
value FOLLOWING
UNBOUNDED FOLLOWING
RANK is a window aggregate function used to assign a ranking number to each row in a group. If
multiple rows have the same ranking value (there is a tie), then the tied rows are given the same rank
value and the subsequent rank position is skipped.
The TO clause of the ROLLUP is used to specify the dimension column(s) used to partition a group of
input rows. To define a global ranking that can adapt to any dimension groupings used in a viz, specify
an empty TO clause.
The ORDER BY clause of the ROLLUP expression determines how to order the rows before they are
ranked. The ORDER BY clause should specify the measure field for which you want to calculate the
ranks. The ranked rows in the partition are numbered starting at one.
For example, suppose we had a dataset with the following rows and columns and you want to rank the
Quarters and Regions according to the values in the Sales column.
Quarter
Region
Sales
2010 Q1
North
100
2010 Q1
South
200
2010 Q1
East
300
2010 Q1
West
400
2010 Q2
North
400
2010 Q2
South
250
2010 Q2
East
150
2010 Q2
West
250
Page 325
Data Ingest Guide - Platfora Expressions
Supposing the lens has an existing measure field called Sales(Sum), you could then define a measure
called Sales_Rank using the following expression:
ROLLUP RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED
PRECEDING
When you include the Quarter, Region, and Sales_Rank columns in the viz, you get the following
data points. Notice that tied values are given the same rank number and the rank positions 2 and 5 are
skipped:
Quarter
Region
SalesRank
2010 Q1
North
8
2010 Q1
South
6
2010 Q1
East
3
2010 Q1
West
1
2010 Q2
North
1
2010 Q2
South
4
2010 Q2
East
7
2010 Q2
West
4
Returns a value of type LONG.
ROLLUP
Required. RANK must be used within a ROLLUPROLLUP expression in place of the
aggregate_expression of the ROLLUP.
The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate the
window function. An empty TO calculates the window function over all rows in the query as one group.
The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression.
Rank the sum of all sales in descending order, so the highest sales is given the ranking of 1.
ROLLUP RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED
PRECEDING
Rank the sum of all sales within a given quarter in descending order, so the highest sales in each quarter
is given the ranking of 1.
ROLLUP RANK() TO (Quarter) ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED
PRECEDING
Page 326
Data Ingest Guide - Platfora Expressions
ROW_NUMBER
ROW_NUMBER is a windowing aggregate function that assigns a unique, sequential number to each row
in a group (partition) of rows, starting at 1 for the first row in each partition. ROW_NUMBER must be used
within a ROLLUP expression, which acts as a modifier for ROW_NUMBER. Use a column in the ROLLUP
order by clause to determine on which column to determine the row number.
ROLLUP ROW_NUMBER(integer)
TO ([partitioning_column])
ORDER BY (ordering_column [ASC | DESC])
ROWS|RANGE window_boundary [window_boundary]
| BETWEEN window_boundary AND window_boundary ]
where window_boundary can be one of:
UNBOUNDED
PRECEDING
value PRECEDING
value FOLLOWING
UNBOUNDED FOLLOWING
For example, suppose we had a dataset with the following rows and columns:
Quarter
Region
Sales
2010 Q1
North
100
2010 Q1
South
200
2010 Q1
East
300
2010 Q1
West
400
2010 Q2
North
400
2010 Q2
South
250
2010 Q2
East
150
2010 Q2
West
250
Suppose you want to assign a unique ID to the sales of each region by quarter in descending order. In
this example, a measure field is defined called Sum_Sales with the expression SUM(Sales). You could
then define a measure called SalesNumber using the following expression:
ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWS
UNBOUNDED PRECEDING
Page 327
Data Ingest Guide - Platfora Expressions
When you include the Quarter, Region, and SalesNumber columns in the viz, you get the following data
points:
Quarter
Region
SalesNumber
2010 Q1
North
4
2010 Q1
South
3
2010 Q1
East
2
2010 Q1
West
1
2010 Q2
North
1
2010 Q2
South
2
2010 Q2
East
4
2010 Q2
West
3
Returns a value of type LONG.
None
Assign a unique ID to the sales of each region by quarter in descending order, so the highest sales is
given the number of 1.
ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWS
UNBOUNDED PRECEDING
User Defined Functions (UDFs)
User defined functions (UDFs) allow you to define your own per-row processing logic, and then expose
that functionality to users in the Platfora application expression builder.
User defined functions can only be used to implement new row functions, not
aggregate functions. If a computed field that uses a UDF is included in a lens, the
UDF will be executed once for each row during the lens build process. This is good
to keep in mind when writing UDF Java programs, so you do not write programs
that negatively impact lens build resources or execution times.
Writing a Platfora UDF Java Program
User defined functions (UDFs) are written in the Java programming language and implement the
Platfora-provided Java interface, com.platfora.udf.UserDefinedFunction.
Verify that any JAR file that the UDF will use is compatible with the existing libraries Platfora uses.
You can find those libraries in $PLATFORA_HOME/lib.
Page 328
Data Ingest Guide - Platfora Expressions
To define a user defined function for Platfora, you must have the Java Development Kit (JDK) version 6
or 7 installed on the machine where you plan to do your development.
You will also need the com.platfora.udf.UserDefinedFunction interface Java code from
your Platfora master server installation. If you go to the $PLATFORA_HOME/tools/udf directory of
your Platfora master server installation, you will find two files:
• platfora-udf.jar – This is the compiled code for the
com.platfora.udf.UserDefinedFunction interface. You must link to this jar file (place it
in the CLASSPATH) when you compile your UDF Java program.
• /com/platfora/udf/UserDefinedFunction.java – This is the source code for the
Java interface that your UDF classes need to implement. The source code is provided as reference
documentation of the Platfora UserDefinedFunction interface. You can refer to this file when
writing your UDF Java programs.
1. Copy the file $PLATFORA_HOME/tools/udf/platfora-udf.jar to a directory on the
machine where you plan to develop and compile your UDF program.
2. Write a Java program that implements com.platfora.udf.UserDefinedFunction interface.
For example, here is a sample Java program that defines a REPEAT_STRING user defined function.
This simple function repeats an input string a specified number of times.
import java.util.List;
/**
* Sample user-defined function implementation that demonstrates
* how to create a REPEAT_STRING function.
*/
public class RepeatString implements
com.platfora.udf.UserDefinedFunction {
/**
* Returns the name of the user-defined function.
* The first character in the name must be a letter,
* and subsequent characters must be either letters,
* digits, or underscores. You cannot name your function
* the same name as an existing Platfora
* built-in function. Names are case-insensitive.
*/
@Override
public String getFunctionName() {
return "REPEAT_STRING";
}
/**
* Returns one of the following values, reflecting the
* return type of the user-defined function:
* DATETIME, DOUBLE, FIXED, INTEGER, LONG, or STRING.
*/
Page 329
Data Ingest Guide - Platfora Expressions
@Override
public String getReturnType() {
return "STRING";
}
/**
* Returns an array of Strings, one for each of the
* input arguments to the user-defined function,
* specifying the required data type for each argument.
* The Strings should be of the following values:
* DATETIME, DOUBLE, FIXED, INTEGER, LONG, STRING.
*/
@Override
public String[] getArgumentTypes() {
return new String[] { "STRING", "INTEGER" };
}
/**
* Returns a human-readable description of what the function
* does, to be displayed to Platfora users in the
* Expression Builder. May return null.
*/
@Override
public String getDescription() {
return "The REPEAT_STRING function returns an input string
repeated " +
" a specified number of times.";
}
/**
* Returns a human-readable description explaining the
* value that the function returns, to be displayed to
* Platfora users in the Expression Builder. May return null.
*/
@Override
public String getReturnValueDescription() {
return "Returns one value per row of type STRING";
}
/**
* Returns a human-readable example of the function syntax,
* to be displayed to Platfora users in the Expression
* Builder. May return null.
*/
@Override
public String getExampleUsage() {
return "CONCAT(\"It's a \", REPEAT_STRING(\"Mad \",4), \"
World\")";
}
/**
Page 330
Data Ingest Guide - Platfora Expressions
* The compute method performs the actual work of evaluating
* the user-defined function. The method should operate on the
* argument values provided to calculate the function return
value
* and return a Java object of the appropriate type to represent
* the return value. The following mapping describes the Java
* object type that is used to represent each Platfora data type:
* DATETIME -> java.util.Date
* DOUBLE -> java.lang.Double
* FIXED -> java.lang.Long
* INTEGER -> java.lang.Integer
* LONG -> java.lang.Long
* STRING -> java.lang.String
* Note on FIXED type: fixed-precision numbers in Platfora
* are represented as Longs that have been scaled by a
* factor of 10,000.
*
* In the event that the user-defined function
* encounters invalid inputs, or the function return value is not
* defined given the inputs provided, the compute method should
return
* null rather than throwing an exception. The compute method
should
* avoid throwing any exceptions.
*
* @param arguments The values of the function inputs.
*
* The entries in this list will match the specification
* provided by getArgumentTypes method in type, number, and order:
* for example, if getArgumentTypes returned an array of
* length 3 with the values STRING, DOUBLE, STRING, then
* the arguments parameter will hold be a list of 3 Java
* objects: a java.lang.String, a java.lang.Double, and a
* java.lang.String. Any of the values within the
* arguments List may be null.
*/
@Override
public String compute(List arguments) {
// cast the inputs to the correct types
final String toRepeat = (String) arguments.get(0);
final Integer numberOfRepeats = (Integer) arguments.get(1);
// check for invalid inputs
if (toRepeat == null || numberOfRepeats == null ||
numberOfRepeats < 0)
return null;
// repeat the input string the specified number of times
final StringBuilder builder = new StringBuilder();
for (int i = 0; i < numberOfRepeats; i++) {
builder.append(toRepeat);
}
return builder.toString();
Page 331
Data Ingest Guide - Platfora Expressions
}
}
3. Compile your .java UDF program file into a .class file (make sure to link to the platforaudf.jar file or place it in your Java CLASSPATH).
The target Java version must be Java 1.6. Compiling with a target of Java 1.7 will result in an error
when the UDF is used.
For example, to compile the RepeatString.java program using Java 1.6:
javac -source 1.6 -target 1.6 -cp platfora-udf.jar RepeatString.java
4. Create a Java archive file (.jar) containing your .class file.
For example:
jar cf repeat-string-udf.jar RepeatString.class
After you have written and compiled your UDF Java program, you must then install and enable it on the
Platfora master server. See Adding a UDF to the Platfora Expression Builder.
Adding a UDF to the Platfora Expression Builder
After you have written and compiled a user defined function (UDF) Java class, you must install your
class on the Platfora master server and enable it so that it can be seen and used in the Platfora expression
builder.
This task is performed on the Platfora master server.
Before you begin, you must have written and compiled a Java class for your user defined function. See
Writing a Platfora UDF Java Program.
1. Create a directory named extlib in the Platfora data directory on the Platfora master server.
For example:
$ mkdir $PLATFORA_DATA_DIR/extlib
2. Copy the Java archive (.jar) file containing your UDF class to the $PLATFORA_DATA_DIR/
extlib directory on the Platfora master server.
For example:
$ cp repeat-string-udf.jar $PLATFORA_DATA_DIR/extlib/
3. Set the Platfora server configuration property, platfora.udf.class.names, so it contains
the name of your UDF Java class. If you have more than one class, separate the class names with a
comma.
For example, to set this property using the platfora-config command-line utility:
$ $PLATFORA_HOME/bin/platfora-config set --key
platfora.udf.class.names --value RepeatString
4. Restart the Platfora server:
$ platfora-services restart
Page 332
Data Ingest Guide - Platfora Expressions
The user defined function will then be available for defining computed field expressions in the Add
Field dialog of the Platfora application.
Due to the way some web browsers cache Javascript files, the newly added
function may not appear in the Functions list for up to 24 hours. However, the
function is immediately available for use and recognized by the Expression autocomplete feature.
Regular Expression Reference
Regular expressions vary in complexity using a combination of basic constructs to describe a string
matching pattern. This reference describes the most common regular expression matching patterns, but
is not a comprehensive list.
Regular expressions, also referred to as regex or regexp, are a standardized collection of special
characters and constructs used for matching strings of text. They provide a flexible and precise language
for matching particular characters, words, or patterns of characters.
Platfora regular expressions are based on the pattern matching syntax of the Java programming
language. For more in depth information on writing valid regular expressions, refer to the Java regular
expression pattern documentation.
Page 333
Data Ingest Guide - Platfora Expressions
Platfora makes use of regular expressions in the following contexts:
• In computed field expressions that use the REGEX or REGEX_REPLACE functions.
• In PARTITION expression statements for event series processing computed fields.
• In the Regex file parser in data ingest.
• In the data source location path descriptor in data ingest.
• In lens filter expressions.
Regex Literal and Special Characters
The most basic form of regular expression pattern matching is the match of a literal character or string.
Regular expressions also have a number of special characters that affect the way a pattern is matched.
This section describes the regular expression syntax for referring to literal characters, special characters,
non-printable characters (such as a tab or a newline), and special character escaping.
The most basic form of pattern matching is the match of literal characters. For example, if the regular
expression is foo and the input string is foo, the match will succeed because the strings are identical.
Certain characters are reserved for special use in regular expressions. These special characters are often
called metacharacters. If you want to use special characters as literal characters, they must be escaped.
Character Name
Character
Reserved For
opening bracket
[
start of a character class
closing bracket
]
end of a character class
hyphen
-
character ranges within a character class
backslash
\
general escape character
caret
^
beginning of string, negating of a character class
dollar sign
$
end of string
period
.
matching any single character
pipe
|
alternation (OR) operator
question mark
?
optional quantifier, quantifier minimizer
asterisk
*
zero or more quantifier
plus sign
+
once or more quantifier
opening parenthesis
(
start of a subexpression group
closing parenthesis
)
end of a subexpression group
Page 334
Data Ingest Guide - Platfora Expressions
Character Name
Character
Reserved For
opening brace
{
start of min/max quantifier
closing brace
}
end of min/max quantifier
There are two ways to force a special character to be treated as an ordinary character:
• Precede the special character with a \ (backslash character). For example, to specify an asterisk as a
literal character instead of a quantifier, use \*.
• Enclose the special character(s) within \Q (starting quote) and \E (ending quote). Everything
between \Q and \E is then treated as literal characters.
• To escape literal double-quotes in a REGEX() expression, double the double-quotes (""). For
example, to extract the inches portion from a height field where example values are 6'2", 5'11":
REGEX(height, "\'(\d)+""$")
You can use special character sequence constructs to specify non-printable characters in a regular
expression. Some of the most commonly used constructs are:
Construct
Matches
\n
newline character
\r
carriage return character
\t
tab character
\f
form feed character
Regex Character Classes
A character class allows you to specify a set of characters, enclosed in square brackets, that can produce
a single character match. There are also a number of special predefined character classes (backslash
character sequences that are shorthand for the most common character sets).
A character class matches only to a single character. For example, gr[ae]y will match to gray or
grey, but not to graay or graey. The order of the characters inside the brackets does not matter.
You can use a hyphen inside a character class to specify a range of characters. For example, [az] matches a single lower-case letter between a and z. You can also use more than one range, or a
combination of ranges and single characters. For example, [0-9X] matches a numeric digit or the letter
X. Again, the order of the characters and the ranges does not matter.
Page 335
Data Ingest Guide - Platfora Expressions
A caret following an opening bracket specifies characters to exclude from a match. For example,
[^abc] will match any character except a, b, or c.
Construct
Type
Description
[abc]
simple
matches
a
or
b
or
c
[^abc]
negation
matches any character except
a
or
b
or
c
[a-zA-Z]
range
matches
a
through
z
, or
A
through
Z
(inclusive)
[a-d[m-p]]
union
matches
a
through
d
, or
m
through
p
[a-z&&[def]]
intersection matches
d
,
e
, or
f
Page 336
Data Ingest Guide - Platfora Expressions
Construct
Type
Description
[a-z&&[^xq]]
subtraction matches
a
through
z
, except for
x
and
q
Predefined character classes offer convenient shorthands for commonly used regular expressions.
Construct
Description
Example
.
matches any single character (except newline)
.at
matches "cat", "hat", and also"bat" in the
phrase "batch files"
\d
\D
matches any digit character (equivalent to
\d
[0-9]
)
matches "3" in "C3PO" and "2" in
"file_2.txt"
matches any non-digit character (equivalent to
\D
[^0-9]
matches "S" in "900S" and "Q" in "Q45"
)
\s
matches any single white-space character
(equivalent to
[ \t\n\x0B\f\r]
\sbook
matches "book" in "blue book" but
nothing in "notebook"
)
\S
matches any single non-white-space character
\Sbook
matches "book" in "notebook" but
nothing in "blue book"
\w
matches any alphanumeric character, including r\w*
underscore (equivalent to
matches "rm" and "root"
[A-Za-z0-9_]
)
\W
matches any non-alphanumeric character
(equivalent to
[^A-Za-z0-9_]
)
Page 337
\W
matches "&" in "stmd &" , "%" in
"100%", and "$" in "$HOME"
Data Ingest Guide - Platfora Expressions
POSIX has a set of character classes that denote certain common ranges. They are similar to bracket and
predefined character classes, except they take into account the locale (the local language/coding system).
\p{Lower}
a lower-case alphabetic character,
[a-z]
\p{Upper}
an upper-case alphabetic character,
[A-Z]
\p{ASCII}
an ASCII character,
[\x00-\x7F]
\p{Alpha}
an alphabetic character,
[a-zA-z]
\p{Digit}
a decimal digit,
[0-9]
\p{Alnum}
an alphanumeric character,
[a-zA-z0-9]
\p{Punct}
a punctuation character, one of
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph}
a visible character,
[\p{Alnum}\p{Punct}]
\p{Print}
a printable character,
[\p{Graph}\x20]
\p{Blank}
a space or tab,
[ t]
\p{Cntrl}
a control character,
[\x00-\x1F\x7F]
\p{XDigit}
a hexidecimal digit,
[0-9a-fA-F]
\p{Space}
a whitespace character,
[ \t\n\x0B\f\r]
Page 338
Data Ingest Guide - Platfora Expressions
Regex Line and Word Boundaries
Boundary matching constructs are used to specify where in a string to apply a matching pattern. For
example, you can search for a particular pattern within a word boundary, or search for a pattern at the
beginning or end of a line.
Construct
Description
Example
^
matches from the beginning of a line (multiline matches are currently not supported)
^172
matches from the end of a line (multi-line
matches are currently not supported)
d$
matches within a word boundary
\bis\b
$
\b
will match the "172" in IP address
"172.18.1.11" but not in "192.172.2.33"
will match the "d" in "maid" but not in
"made"
matches the word "is" in "this is my
island", but not the "is" part of "this" or
"island".
\bis
matches both "is" and the "is" in "island",
but not in "this".
\B
matches within a non-word boundary
\Bb
matches "b" in "sbin" but not in "bash"
Regex Quantifiers
Quantifiers specify how often the preceding regular expression construct should match. There are three
classes of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and
possessive quantifiers involves what part of the string to try for the initial match, and how to retry if the
initial attempt does not produce a match.
By default, quantifiers are greedy. A greedy quantifier will first try for a match with the entire input
string. If that produces a match, then the match is considered a success, and the engine can move on to
the next construct in the regular expression. If the first try does not produce a match, the engine backsoff one character at a time until a match is found. So a greedy quantifier checks for possible matches in
order from the longest possible input string to the shortest possible input string, recursively trying from
right to left.
Adding a ? (question mark) to a greedy quantifier makes it reluctant. A reluctant quantifier will first try
for a match from the beginning of the input string, starting with the shortest possible piece of the string
that matches the regex construct. If that produces a match, then the match is considered a success, and
the engine can move on to the next construct in the regular expression. If the first try does not produce
a match, the engine adds one character at a time until a match is found. So a reluctant quantifier checks
Page 339
Data Ingest Guide - Platfora Expressions
for possible matches in order from the shortest possible input string to the longest possible input string,
recursively trying from left to right.
Adding a + (plus sign) to a greedy quantifier makes it possessive. A possessive quantifier is like a greedy
quantifier on the first attempt (it tries for a match with the entire input string). The difference is that
unlike a greedy quantifier, a possessive quantifier does not retry a shorter string if a match is not found.
If the initial match fails, the possessive quantifier reports a failed match. It does not make any more
attempts.
Greedy ReluctantPossessiveDescription
ConstructConstructConstruct
Example
?
matches the previous
character or construct once
or not at all
st?on
matches the previous
character or construct zero
or more times
if*
matches the previous
character or construct one
or more times
if+
matches the previous
character or construct
exactly
o{2}
*
+
{n}
??
*?
+?
{n}?
?+
*+
++
{n}+
matches "son" in "johnson" and "ston"
in "johnston" but nothing in "clinton" or
"version"
matches "if", "iff" in "diff", or "i" in "print"
matches "if", "iff" in "diff", but nothing in
"print"
matches "oo" in "lookup" and the first two o's
in "fooooo" but nothing in "mount"
n
times
{n,}
{n,}?
{n,}+
matches the previous
character or construct at
least
o{2,}
matches "oo" in "lookup" all five o's in
"fooooo" but nothing in "mount"
n
times
{n,m} {n,m}? {n,m}+ matches the previous
character or construct at
least
n
times, but no more than
m
times
Page 340
F{2,4}
matches "FF" in "#FF0000" and the last four
F's in "#FFFFFF"
Data Ingest Guide - Platfora Expressions
Regex Capturing Groups
Groups are specified by a pair of parenthesis around a subpattern in the regular expression. By placing
part of a regular expression inside parentheses, you group that part of the regular expression together.
This allows you to apply regex operators and quantifiers to the entire group at once. Besides grouping
part of a regular expression together, parenthesis also create a capturing group. Capturing groups are
used to determine which matching values to save or return from your regular expression.
A regular expression can have more than one group and the groups can be nested. The groups are
numbered 1-n from left to right, starting with the first opening parenthesis. There is always an implicit
group 0, which contains the entire match. For example, the pattern:
(a(b*))+(c)
contains three groups:
group 1: (a(b*))
group 2: (b*)
group 3: (c)
By default, a group captures the text that produces a match. Besides grouping part of a regular
expression together, parenthesis also create a capturing group or a backreference. The portion of the
string matched by the grouped subexpression is captured in memory for later retrieval or use.
Capturing Groups and the Regex Line Parser
When you choose the Regex line parser during the Parse Data phase of the data ingest process,
Platfora uses capturing groups to determine what parts of the regular expression to return as columns.
The Regex line parser applies the user-supplied regular expression against each line in the source file,
and returns each capturing group in the regular expression as a column value.
For example, suppose you had user records in a file, and the lines were formatted like this:
Name: John Smith Address: 123 Main St. Age: 25 Comment: Active
Name: Sally R. Jones Address: 2 E. El Camino Real Age: 32
Name: Rod Rogers Address: 55 Elm Street Comment: Suspended
You could use the following regular expression to extract the Full Name, Last Name, Address, Age, and
Comment column values:
Name: (.*\s(\p{Alpha}+)) Address:\s+(.*) Age:\s+([0-9]+)(?:\s+Comment:\s
+(.*))?
Capturing Groups and the REGEX Function
The REGEX function can be used to extract a portion of a string value. For the REGEX function, only the
value of the first capturing group is returned. For example, if you wanted to match all possible email
address strings with a pattern of [email protected], but only return the provider portion of the
email address from the email field:
REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$")
Capturing Groups and the REGEX_REPLACE Function
Page 341
Data Ingest Guide - Platfora Expressions
The REGEX_REPLACE function is used to match a string value, and replace matched strings with
another value. The REGEX_REPLACE function takes three arguments: an input string, a matching regex,
and a replacement regex. Capturing groups can be used to capture backreferences (see Backreferences),
but do not control what portions of the match are returned (the entire match is always returned).
Backreferences allow you to capture and reuse a subexpression match inside the same regular
expression. You can reuse a capturing group as a backreference by referring to its group number
preceded by a backslash (for example, \1 refers to capturing group 1, \2 refers to capturing group 2,
and so on).
For example, if you wanted to match a pair of HTML tags and their enclosed text, you could capture the
opening tag into a backreference, and then reuse it to match the corresponding closing tag:
(<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\2>)
This regular expression contains two capturing groups, the outermost capturing group (which captures
the entire string), and one which captures the string matched by [A-Z][A-Z0-9]* into backreference
number two. This backreference can then be reused with \2 (backslash two) to match the corresponding
closing HTML tag.
When referring to capturing groups in the previous regular expression, the backreference syntax is
slightly different. The backreference group number is preceded by a dollar sign instead of a backslash
(for example, $1 refers to capturing group 1 of the previous expression). An example of this would be
the REGEX_REPLACE function, which takes two regular expressions: one for the matching string, and
one for the replacement string.
The following example matches the values in a phone_number field where phone number values are
formatted as xxx.xxx.xxxx, and replaces them with phone number values formatted as (xxx) xxxxxxx. Notice the backreferences in the replacement expression; they refer to the capturing groups of the
previous matching expression:
REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]]
{4})","\($1\) $2-$3")
In some cases, you may want to use parenthesis to group subpatterns, but not capture text. A noncapturing group starts with (?: (a question mark and colon following the opening parenthesis). For
example, h(?:a|i|o)t matches hat or hit or hot, but does not capture the a, i, or o from the
subexpression.
Page 342
Appendix
A
Platfora Expression Language Reference
An expression computes or produces a value by combining field or column values, constant values, operators,
and functions. Platfora has a built-in expression language. You use the language's functions and operators in
dataset computed fields, vizboard computed fields, lens filters, and programmatic lens queries.
Topics:
•
Expression Quick Reference
•
Comparison Operators
•
Logical Operators
•
Arithmetic Operators
•
Conditional and NULL Processing
•
Event Series Processing
•
String Functions
•
URL Functions
•
IP Address Functions
•
Date and Time Functions
•
Math Functions
•
Data Type Conversion Functions
•
Aggregate Functions
•
ROLLUP and Window Functions
•
User Defined Functions (UDFs)
•
Regular Expression Reference
Expression Quick Reference
An expression is a combination of columns (or fields), constant values, operators, and functions used
to evaluate, transform, or produce a value. Simple expressions can be combined to make more complex
expressions. This quick reference describes the functions and operators that can be used to write
expressions.
Page 343
Data Ingest Guide - Platfora Expression Language Reference
Platfora's built-in statements, functions and operators are divided into the following categories:
• Conditional and NULL Processing
• Event Series Processing
• String Processing
• Date and Time Processing
• URL Processing
• IP Address Processing
• Mathematical Processing
• Data Type Conversion
• Aggregation and Measure Processing
• ROLLUP and Window Calculations
• User Defined Functions
• Comparison Operators
• Logical Operators
• Arithmetic Operators
Conditional and NULL Processing
Conditional and NULL processing allows you to transform or manipulate data values based on certain
defined conditions. Conditional processing (CASE) can be done at either the dataset or vizboard level.
NULL processing (COALESCE and IS_VALID) is only applicable at the dataset level. During a lens
build, any NULL values in the source data are converted to default values, so lenses and vizboards have
no concept of NULL values.
Function
Description
Example
CASE
evaluates each row in
the dataset according
to one or more input
conditions, and
outputs the specified
result when the input
conditions are met
CASE WHEN gender = "M" THEN "Male"
WHEN gender = "F" THEN "Female" ELSE
"Unknown" END
COALESCE
returns the first valid
value (NOT NULL
value) from a commaseparated list of
expressions
COALESCE(hourly_wage * 40 * 52, salary)
IS_VALID
returns 0 if the
returned value is
NULL, and 1 if the
returned value is NOT
NULL.
IS_VALID(sale_amount)
Page 344
Data Ingest Guide - Platfora Expression Language Reference
Event Series Processing
Event series processing allows you to partition rows of input data, order the rows sequentially (typically
by a timestamp), and search for matching patterns in a set of rows. Computed fields that are defined
in a dataset using a PARTITION expression are considered event series processing computed fields.
Event series processing computed fields are processed differently than regular computed fields. Instead
of computing values from the input of a single row, they compute values from inputs of multiple rows
in the dataset. Event series processing computed fields can only be defined in the dataset - not in the
vizboard.
Function
Description
Example
PACK_VALUES
returns multiple
PACK_VALUES("ID",custid,"Age",age)
output values packed
into a single string
of key/value pairs
separated by the
Platfora default key
and pair separators
- useful when the
OUTPUT clause of a
PARTITION expression
returns multiple output
values
PARTITION
partitions the rows
of a dataset, orders
the rows sequentially
(typically by a
timestamp), and
searches for matching
patterns in a set of
rows
PARTITION BY SessionID ORDER BY
Timestamp PATTERN (A,B,C) DEFINE
A AS Page = "home.html", B AS
Page = "product.html", C AS Page =
"checkout.html" OUTPUT "TRUE"
String Functions
String functions allow you to manipulate and transform textual data, such as combining string values or
extracting a portion of a string value.
Function
Description
Example
ARRAY_CONTAINS
performs a whole
string match against
a string containing
delimited values
and returns a 1 or 0
depending on whether
or not the string
contains the search
value.
ARRAY_CONTAINS(device,",","iPad")
Page 345
Data Ingest Guide - Platfora Expression Language Reference
Function
Description
Example
CONCAT
concatenates
(combines together)
the results of multiple
string expressions
CONCAT(month,"/",day,"/",year)
FILE_NAME
returns the original file TO_DATE(SUBSTRING(FILE_NAME(),0,8),"yyyyMMdd")
name from the source
file system
FILE_PATH
returns the full URI
path from the source
file system
TO_DATE(REGEX(FILE_PATH(),"hdfs://
myhdfs-server.com/data/logs/(\d{8})/(?:
\d{1,3}\.*)+\.log"),"yyyyMMdd")
EXTRACT_COOKIE
extracts the value
of the given cookie
identifier from a semicolon delimited list of
cookie key=value pairs.
EXTRACT_COOKIE("SSID=ABC; vID=44",
"vID") returns 44
EXTRACT_VALUE
extracts the value for
the given key from
a string containing
delimited key/value
pairs.
EXTRACT_VALUE("firstname;daria|
lastname;hutch","lastname",";","|") returns
INSTR
returns an integer
indicating the position
of a character within
a string that is the
first character of
the occurrence of a
substring.
INSTR(url,"http://",-1,1)
JAVA_STRING
returns the unescaped
version of a Java
unicode character
escape sequence as a
string value
CASE WHEN currency ==
JAVA_STRING("\u00a5") THEN "yes" ELSE
"no" END
JOIN_STRINGS
concatenates
JOIN_STRINGS("/",month,day,year)
(combines together)
the results of multiple
string expressions
with the separator in
between each non-null
value
hutch
Page 346
Data Ingest Guide - Platfora Expression Language Reference
Function
Description
Example
JSON_ARRAY_CONTAINS
performs a whole
string match against
a string formatted
as a JSON array
and returns a 1 or 0
depending on whether
or not the string
contains the search
value
JSON_ARRAY_CONTAINS(software,"platfora")
JSON_DOUBLE
extracts a DOUBLE
value from a field in a
JSON object
JSON_DOUBLE(top_scores,"test_scores.2")
JSON_FIXED
extracts a FIXED value JSON_FIXED(top_scores,"test_scores.2")
from a field in a JSON
object
JSON_INTEGER
extracts an INTEGER
value from a field in a
JSON object
JSON_INTEGER(top_scores,"test_scores.2")
JSON_LONG
extracts a LONG value
from a field in a JSON
object
JSON_LONG(top_scores,"test_scores.2")
JSON_STRING
extracts a STRING
value from a field in a
JSON object
JSON_STRING(misc,"hobbies.0")
LENGTH
returns the count of
characters in a string
value
LENGTH(name)
REGEX
performs a whole
REGEX(weblog.request_line,"GET\s/([a-zAstring match against
Z0-9._%-]+\.[html])\sHTTP/[0-9.]+")
a string value with a
regular expression and
returns the portion of
the string matching
the first capturing
group of the regular
expression
Page 347
Data Ingest Guide - Platfora Expression Language Reference
Function
Description
Example
REGEX_REPLACE
evaluates a string
value against a
regular expression to
determine if there is
a match, and replaces
matched strings
with the specified
replacement value
REGEX_REPLACE(phone_number,"([0-9]
{3})\.([[0-9]]{3})\.([[0-9]]{4})","\($1\)
$2-$3")
SPLIT
breaks down a
delimited input string
into sections and
returns the specified
section of the string
SPLIT("Restaurants>Location>San
Francisco",">", -1) returns San Francisco
SUBSTRING
returns the specified
characters of a string
value based on the
given start and end
position
SUBSTRING(name,0,1)
TO_LOWER
converts all alphabetic
characters in a string
to lower case
TO_LOWER("123 Main Street") returns 123
converts all alphabetic
characters in a string
to upper case
TO_UPPER("123 Main Street") returns 123
TRIM
removes leading and
trailing spaces from a
string value
TRIM(area_code)
XPATH_STRING
takes an XMLformatted string and
returns the first string
matching the given
XPath expression
XPATH_STRING(address,"//
address[@type='home']/zipcode")
XPATH_STRINGS
takes an XMLformatted string and
returns a newlineseparated array of
strings matching
the given XPath
expression
XPATH_STRINGS(address,"/list/address[1]/
street")
TO_UPPER
main street
MAIN STREET
Page 348
Data Ingest Guide - Platfora Expression Language Reference
Function
Description
Example
XPATH_XML
takes an XMLformatted string
and returns an XMLformatted string
matching the given
XPath expression
XPATH_XML(address,"//address[last()]")
Date and Time Functions
Date and time functions allow you to manipulate and transform datetime values, such as calculating time
differences between two datetime values, or extracting a portion of a datetime value.
Function
Description
Example
DAYS_BETWEEN
calculates the
whole number of
days (ignoring
time) between two
DATETIME values
DAYS_BETWEEN(ship_date,order_date)
DATE_ADD
adds the specified time DATE_ADD(invoice_date,45,"day")
interval to a DATETIME
value
HOURS_BETWEEN
calculates the
whole number of
hours (ignoring
minutes, seconds, and
milliseconds) between
two DATETIME values
HOURS_BETWEEN(NOW(),impressions.adview_timestam
EXTRACT
returns the specified
portion of a DATETIME
value
EXTRACT("hour",order_date)
MILLISECONDS_BETWEEN
calculates the
MILLISECONDS_BETWEEN(request_timestamp,response_
MINUTES_BETWEEN calculates the whole
MINUTES_BETWEEN(impression_timestamp,conversion_t
whole number of
milliseconds between
two DATETIME values
number of minutes
(ignoring seconds and
milliseconds) between
two DATETIME values
NOW
returns the current
system date and time
as a DATETIME value
YEAR_DIFF(NOW(),users.birthdate)
Page 349
Data Ingest Guide - Platfora Expression Language Reference
Function
Description
Example
SECONDS_BETWEEN calculates the
whole number of
seconds (ignoring
milliseconds) between
two DATETIME values
SECONDS_BETWEEN(impression_timestamp,conversion_
TRUNC
truncates a DATETIME
value to the specified
format
TRUNC(TO_DATE(order_date,"MM/dd/yyyy
HH:mm:ss"),"day")
YEAR_DIFF
calculates the
fractional number of
years between two
DATETIME values
YEAR_DIFF(NOW(),users.birthdate)
URL Functions
URL functions allow you to extract different portions of a URL string, and decode text that is URLencoded.
Function
Description
Example
URL_AUTHORITY
returns the authority
URL_AUTHORITY("http://
portion of a URL string user:[email protected]:8012/
mypage.html") returns
user:[email protected]:8012
URL_FRAGMENT
returns the fragment
URL_FRAGMENT("http://platfora.com/
portion of a URL string news.php?topic=press#Platfora%20News")
returns Platfora%20News
URL_HOST
returns the host,
URL_HOST("http://
domain, or IP address user:[email protected]:8012/
portion of a URL string mypage.html") returns mycompany.com
URL_PATH
returns the path
URL_PATH("http://platfora.com/company/
portion of a URL string contact.html") returns /company/contact.html
URL_PORT
returns the port
URL_PORT("http://
portion of a URL string user:[email protected]:8012/
mypage.html") returns 8012
URL_PROTOCOL
returns the protocol
URL_PROTOCOL("http://www.platfora.com")
(or URI scheme name) returns http
portion of a URL string
Page 350
Data Ingest Guide - Platfora Expression Language Reference
Function
Description
Example
URL_QUERY
returns the query
URL_QUERY("http://platfora.com/news.php?
portion of a URL string topic=press&timeframe=today") returns
topic=press&timeframe=today
URLDECODE
decodes a string that
has been encoded
with the application/xwww-form-urlencoded
media type
URLDECODE("N%2FA%20or%20%22not
%20applicable%22")
IP Address Functions
IP address functions allow you to manipulate and transform STRING data consisting of IP address
values.
Function
Description
Example
CIDR_MATCH
compares two
CIDR_MATCH("60.145.56.0/24","60.145.56.246")
STRING arguments
returns 1
representing a CIDR
mask and an IP
address, and returns 1
if the IP address falls
within the specified
subnet mask or 0 if it
does not
HEX_TO_IP
converts a
HEX_TO_IP(AB20FE01) returns 171.32.254.1
hexadecimal-encoded
STRING to a text
representation of an IP
address
Math Functions
Math functions allow you to perform basic math calculations on numeric values. You can also use the
arithmetic operators to perform simple math calculations, such as addition, subtraction, division and
multiplication.
Function
Description
Example
DIV
divides two LONG
values and returns a
quotient value of type
LONG
DIV(TO_LONG(file_size),1024)
Page 351
Data Ingest Guide - Platfora Expression Language Reference
Function
Description
Example
EXP
raises the
EXP(Value)
mathematical constant
e to the power
(exponent) of a
numeric value and
returns a value of type
DOUBLE.
FLOOR
returns the largest
integer that is less
than or equal to the
input argument
FLOOR(32.6789) returns 32.0
HASH
evenly partitions data
values into the specified
number of buckets
HASH(username,20)
LN
returns the natural
logarithm of a number
LN(2.718281828) returns 1
MOD
divides two LONG
values and returns the
remainder value of
type LONG
MOD(TO_LONG(file_size),1024)
POW
raises a numeric
100 * POW(end_value/start_value, 0.2) - 1
value to the power
(exponent) of another
numeric value and
returns a value of type
DOUBLE.
ROUND
rounds a DOUBLE
value to the specified
number of decimal
places
ROUND(32.4678954,2) returns 32.47
Page 352
Data Ingest Guide - Platfora Expression Language Reference
Data Type Conversion Functions
Data type conversion functions allow you to cast data values from one data type to another. These
functions are used implicitly whenever you set the data type of a field or column in the Platfora user
interface. The supported data types are: INTEGER, LONG, DOUBLE, FIXED, DATETIME, and STRING
Function
Description
Example
EPOCH_MS_TO_DATEconverts LONG values
EPOCH_MS_TO_DATE(1360260240000)
to DATETIME values,
returns 2013-02-07T18:04:00:000Z
where the input
number represents the
number of milliseconds
since the epoch
TO_FIXED
converts STRING,
INTEGER, LONG, or
DOUBLE values to
fixed-decimal values
TO_FIXED(opening_price)
TO_DATE
converts STRING
values to DATETIME
values, and specifies
the format of the date
and time elements in
the string
TO_DATE(order_date,"yyyy.MM.dd 'at'
HH:mm:ss z")
TO_DOUBLE
converts STRING,
INTEGER, LONG, or
DOUBLE values to
DOUBLE (decimal)
values
TO_DOUBLE(average_rating)
TO_INT
converts STRING,
INTEGER, LONG,
or DOUBLE values
to INTEGER (whole
number) values
TO_INT(average_rating)
TO_LONG
converts STRING,
INTEGER, LONG, or
DOUBLE values to
LONG (whole number)
values
TO_LONG(average_rating)
TO_STRING
converts values of
other data types to
STRING (character)
values
TO_STRING(sku_number)
Page 353
Data Ingest Guide - Platfora Expression Language Reference
Aggregate Functions
An aggregate function groups the values of multiple rows together based on some defined input
expression. Aggregate functions return one value for a group of rows, and are only valid for defining
measures in Platfora. In the dataset, measures can be defined using any of the aggregate functions. In the
vizboard, only the DISTINCT, MAX, or MIN aggregate functions are allowed.
Function
Description
Example
AVG
returns the average
of all valid numeric
values
AVG(sale_amount)
COUNT
returns the number of
rows in a dataset
COUNT(sales.customers)
COUNT_VALID
returns the number
of rows for which the
given expression is
valid
COUNT_VALID(page_views)
DISTINCT
returns the number of
distinct values for the
given expression
DISTINCT(user_id)
MAX
returns the biggest
value from the given
input expression
MAX(sale_amount)
MIN
returns the smallest
value from the given
input expression
MIN(sale_amount)
SUM
returns the total of all
values from the given
input expression
SUM(sale_amount)
STDDEV
calculates the
population standard
deviation for a group
of numeric values
STDDEV(sale_amount)
VARIANCE
calculates the
VARIANCE(sale_amount)
population variance
for a group of numeric
values
ROLLUP and Window Functions
ROLLUP is a modifier to an aggregate expression that turns an aggregate into a windowed aggregate.
Window functions (RANK, DENSE_RANK and NTILE) can only be used within a ROLLUP statement.
The ROLLUP statement defines the partitioning and ordering of a rowset before the associated aggregate
function or window function is applied.
Page 354
Data Ingest Guide - Platfora Expression Language Reference
ROLLUP defines a window or user-specified set of rows within a query result set. A window function
then computes a value for each row in the window. You can use window functions to compute
aggregated values such as moving averages, cumulative aggregates, running totals, or a top N per group
results.
ROLLUP statements can be specified in either the dataset or the vizboard. When using a ROLLUP in a
vizboard, the measure for which you are calculating the ROLLUP must already exist in the lens you are
using in the vizboard.
Function
Description
Example
DENSE_RANK
assigns the rank
(position) of each row
in a group (partition)
of rows and does not
skip rank numbers in
the event of tie
ROLLUP DENSE_RANK() TO () ORDER BY
([Sales(Sum)] DESC) ROWS UNBOUNDED
PRECEDING
NTILE
divides a partitioned
group of rows into the
specified number of
buckets, and returns
the bucket number to
which the current row
belongs
ROLLUP NTILE(100) TO () ORDER BY
([Total Records] DESC) ROWS UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING
RANK
assigns the rank
ROLLUP RANK() TO () ORDER BY
(position) of each row ([Sales(Sum)] DESC) ROWS UNBOUNDED
in a group (partition)
PRECEDING
of rows and skips rank
numbers in the event
of tie
ROLLUP
a modifier to an
aggregate function
that turns a regular
aggregate function
into a windowed,
partitioned, or adaptive
aggregate function
100 * COUNT(Flights) / ROLLUP
COUNT(Flights) TO ([Departure Date])
ROW_NUMBER
a modifier to an
aggregate function
that turns a regular
aggregate function
into a windowed,
partitioned, or adaptive
aggregate function
ROLLUP ROW_NUMBER() TO (Quarter)
ORDER BY (Sum_Sales DESC) ROWS
UNBOUNDED PRECEDING
Page 355
Data Ingest Guide - Platfora Expression Language Reference
User Defined Functions
User defined functions (UDFs) allow you to define your own per-row processing logic, and then expose
that functionality to users in the Platfora application expression builder. See User Defined Functions
(UDFs) for more information.
Comparison Operators
Comparison operators are used to compare the equivalency of two expressions of the same data type.
The result of a comparison expression is a Boolean value (returns 1 for true, 0 for false, or NULL for
invalid). Boolean expressions are most often used to specify data processing conditions or filters.
Operator
Meaning
Example Expression
= or ==
Equal to
order_date = "12/22/2011"
>
Greater than
age > 18
!>
Not greater than
age !> 8
<
Less than
age < 30
!<
Not less than
age !< 12
>=
Greater than or equal to
age >= 20
<=
Less than or equal to
age <= 29
<> or != or ^=
Not equal to
age <> 30
Logical Operators
Logical operators are used to define Boolean (true / false) expressions. Logical operators are used
in expressions to test for a condition, and return 1 if the condition is true or 0 if it is false. Logical
operators are often used in lens filters, CASE expressions, PARTITION expressions, and WHERE clauses
of queries.
Operator
Meaning
Example Expression
AND
Test whether two
conditions are true.
OR
Test if either of two
conditions are true.
Page 356
Data Ingest Guide - Platfora Expression Language Reference
Operator
Meaning
BETWEEN
Test whether a date or year BETWEEN 2000 AND 2012
numeric value is within
the min and max values
min_value AND
max_value
Example Expression
(inclusive).
IN(list)
Test whether a value is product_type
within a set.
IN("tablet","phone","laptop")
LIKE("pattern")
Simple inclusive caseinsensitive character
pattern matching.
The * character
matches any number
of characters. The ?
character matches
exactly one character.
last_name LIKE("?utch*")
matches Kutcher, hutch but not Krutcher or
crutch
Check whether a field
value or expression is
null (empty)
ship_date IS NULL
evaluates to true when the ship_date field is
Reverses the value of
other operators.
• year NOT BETWEEN 2000 AND 2012
value
IS NULL
NOT
company_name LIKE("platfora")
matches Platfora or platfora
empty
• first_name NOT LIKE("Jo?n*")
excludes John, jonny but not Jon or Joann
• Date.Weekday NOT
IN("Saturday","Sunday")
• purchase_date IS NOT NULL
evaluates to true when the purchase_date
field is not empty
Arithmetic Operators
Arithmetic operators perform basic math operations on two expressions of the same data type resulting
in a numeric value. The plus (+) and minus (-) operators can also be used to perform arithmetic
operations on DATETIME values.
Operator
Description
Example
+
Addition
amount + 10
(add 10 to the value of the
amount
field)
Page 357
Data Ingest Guide - Platfora Expression Language Reference
Operator
Description
Example
-
Subtraction
amount - 10
(subtract 10 from the value of the
amount
field)
*
Multiplication
amount * 100
(multiply the value of the
amount
field by 100)
/
Division
bytes / 1024
(divide the value of the
bytes
field by 1024 and return the quotient)
Comparison Operators
Comparison operators are used to compare the equivalency of two expressions of the same data type.
The result of a comparison expression is a Boolean value (returns 1 for true, 0 for false, or NULL for
invalid). Boolean expressions are most often used to specify data processing conditions or filter criteria.
Operator Definitions
Operator
Meaning
Example Expression
= or ==
Equal to
order_date = "12/22/2011"
>
Greater than
age > 18
!>
Not greater than
age !> 8
<
Less than
age < 30
!<
Not less than
age !< 12
>=
Greater than or equal to
age >= 20
<=
Less than or equal to
age <= 29
Page 358
Data Ingest Guide - Platfora Expression Language Reference
Operator
Meaning
Example Expression
<> or != or ^=
Not equal to
age <> 30
If you are writing queries with REST and the query string includes an = (equal)
character, you must URL encode it as %3D. Failure to encode the character can
result in this error:
string matching regex `(?i)\Qnot\E\b' expected but end of source
found.
Logical Operators
Logical operators are used to define Boolean (true / false) expressions. Logical operators are used
in expressions to test for a condition, and return 1 if the condition is true or 0 if it is false. Logical
operators are often used in lens filters, CASE expressions, PARTITION expressions, and WHERE clauses
of queries.
Operator
Meaning
AND
Test whether two
conditions are true.
OR
Test if either of two
conditions are true.
BETWEEN
Test whether a date or year BETWEEN 2000 AND 2012
numeric value is within
the min and max values
min_value AND
max_value
Example Expression
(inclusive).
IN(list)
Test whether a value is product_type
within a set.
IN("tablet","phone","laptop")
LIKE("pattern")
Simple inclusive caseinsensitive character
pattern matching.
The * character
matches any number
of characters. The ?
character matches
exactly one character.
last_name LIKE("?utch*")
matches Kutcher, hutch but not Krutcher or
crutch
Check whether a field
value or expression is
null (empty)
ship_date IS NULL
evaluates to true when the ship_date field is
value
IS NULL
company_name LIKE("platfora")
matches Platfora or platfora
empty
Page 359
Data Ingest Guide - Platfora Expression Language Reference
Operator
Meaning
Example Expression
NOT
Reverses the value of
other operators.
• year NOT BETWEEN 2000 AND 2012
• first_name NOT LIKE("Jo?n*")
excludes John, jonny but not Jon or Joann
• Date.Weekday NOT
IN("Saturday","Sunday")
• purchase_date IS NOT NULL
evaluates to true when the purchase_date
field is not empty
Arithmetic Operators
Arithmetic operators perform basic math operations on two expressions of the same data type resulting
in a numeric value. The plus (+) and minus (-) operators can also be used to perform arithmetic
operations on DATETIME values.
Operator
Description
Example
+
Addition
amount + 10
(add 10 to the value of the
amount
field)
-
Subtraction
amount - 10
(subtract 10 from the value of the
amount
field)
*
Multiplication
amount * 100
(multiply the value of the
amount
field by 100)
/
Division
bytes / 1024
(divide the value of the
bytes
field by 1024 and return the quotient)
Page 360
Data Ingest Guide - Platfora Expression Language Reference
Conditional and NULL Processing
Conditional and NULL processing allows you to transform or manipulate data values based on certain
defined conditions. Conditional processing (CASE) can be done at either the dataset or vizboard level.
NULL processing (COALESCE and IS_VALID) is only applicable at the dataset level. During a lens
build, any NULL values in the source data are converted to default values, so lenses and vizboards have
no concept of NULL values.
CASE
CASE is a row function that evaluates each row in the dataset according to one or more input conditions,
and outputs the specified result when the input conditions are met.
Syntax
CASE WHEN input_condition [AND|OR input_condition]THEN
output_expression [...] [ELSE other_output_expression] END
Return Value
Returns one value per row of the same type as the output expression. All output expressions must return
the same data type.
If there are multiple output expressions that return different data types, then you will need to enclose
your entire CASE expression in one of the data type conversion functions to explicitly cast all output
values to a particular data type.
Input Parameters
WHEN input_condition
Required. The WHEN keyword is used to specify one or more Boolean expressions (see Platfora's
supported conditional operators). If an input value meets the condition, then the output expression
is applied. Input conditions can include other row functions in their expression, but cannot contain
aggregate functions or measure expressions. You can use the AND or OR keywords to combine multiple
input conditions.
THEN output_expression
Required. The THEN keyword is used to specify an output expression when the specified conditions
are met. Output expressions can include other row functions in their expression, but cannot contain
aggregate functions or measure expressions.
ELSE other_output_expression
Optional. The ELSE keyword can be used to specify an alternate output expression to use when the
specified conditions are not met. If an ELSE expression is not supplied, ELSE NULL is the default.
END
Required. Denotes the end of CASE function processing.
Page 361
Data Ingest Guide - Platfora Expression Language Reference
Examples
Convert values in the age column into a range-based groupings (binning):
CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50" ELSE "over
50" END
Transform values in the gender column from one string to another:
CASE WHEN gender = "M" THEN "Male" WHEN gender = "F" THEN "Female" ELSE
"Unknown" END
The vehicle column contains the following values: truck, bus, car, scooter, wagon, bike, tricycle, and
motorcycle. The following example convert multiple values in the vehicle column into a single value:
CASE WHEN vehicle in ("bike","scooter","motorcycle) THEN "two-wheelers"
ELSE "other" END
COALESCE
COALESCE is a row function that returns the first valid value (NOT NULL value) from a commaseparated list of expressions.
Syntax
COALESCE(expression[,expression][,...])
Return Value
Returns one value per row of the same type as the first valid input expression.
Input Parameters
expression
At least one required. A field name or expression.
Examples
The following example shows an expression to calculate employee yearly income for exempt employees
that have a salary and non-exempt employees that have an hourly_wage. This expression checks the
values of both fields for each row, and returns the value of the first expression that is valid (NOT NULL).
COALESCE(hourly_wage * 40 * 52, salary)
IS_VALID
IS_VALID is a row function that returns 0 if the returned value is NULL, and 1 if the returned value is
NOT NULL. This is useful for computing other calculations where you want to exclude NULL values
(such as when computing averages).
Syntax
IS_VALID(expression)
Page 362
Data Ingest Guide - Platfora Expression Language Reference
Return Value
Returns 0 if the returned value is NULL, and 1 if the returned value is NOT NULL.
Input Parameters
expression
Required. A field name or expression.
Examples
Define a computed field using IS_VALID. This returns a row count only for the rows where this field
value is NOT NULL. If a value is NULL, it returns 0 for that row. In this example, we create a computed
field (sale_amount_not_null) using the sale_amount field as the basis.
IS_VALID(sale_amount)
Then you can use the sale_amount_not_null computed field to calculate an acurate average for
sale_amount that excludes NULL values:
SUM(sale_amount)/SUM(sale_amount_not_null)
This is what happens automatically when you use the AVG function.
Event Series Processing
Event series processing allows you to partition rows of input data, order the rows sequentially (typically
by a timestamp), and search for matching patterns in a set of rows. Computed fields that are defined
in a dataset using a PARTITION expression are considered event series processing computed fields.
Event series processing computed fields are processed differently than regular computed fields. Instead
of computing values from the input of a single row, they compute values from inputs of multiple rows
in the dataset. Event series processing computed fields can only be defined in the dataset - not in the
vizboard or a lens query.
PARTITION
PARTITION is an event series processing language that partitions the rows of a dataset, orders the
rows sequentially (typically by a timestamp), and searches for matching patterns in a set of rows.
Computed fields that are defined in a dataset using a PARTITION expression are considered event
series processing computed fields. Event series processing computed fields are processed differently
Page 363
Data Ingest Guide - Platfora Expression Language Reference
than regular computed fields. Instead of computing values from the input of a single row, they compute
values from inputs of multiple rows in the dataset.
The PARTITION function can only be used to define a computed field in the
dataset definition (pre-lens build). PARTITION cannot be used to define a
vizboard computed field. Unlike other expressions, PARTITION expressions
cannot be embedded within other functions or expressions - it must be a top-level
expression.
Syntax
PARTITION BYfield_name
ORDER BY field_name [ASC|DESC]
PATTERN (pattern_expression)
DEFINE symbol_1 AS filter_expression
[,symbol_n AS filter_expression ]
[, ...]
OUTPUT output_expression
Description
To understand how event series processing works, we'll walk through a simple example of a
PARTITION expression.
This is a simple example of some weblog page view data. Each row represents a page view by a user at
a give point in time. Session IDs are used to group together page views that happened in the same user
session:
Suppose you wanted to know how many sessions included the path of page visits to ‘home.html’ then
‘products.html’ then ‘checkout.html’. You could define a PARTITION expression that groups the rows
by session, orders by time, and then iterates through the rows from top to bottom to find sessions that
match the pattern:
PARTITION BY SessionID
ORDER BY Timestamp
PATTERN (A,B,C)
DEFINE A AS Page = "home.html",
Page 364
Data Ingest Guide - Platfora Expression Language Reference
B AS Page = "product.html",
C AS Page = "checkout.html"
OUTPUT "TRUE"
1. The PARTITION BY clause partitions (or groups) the rows of the dataset by session.
2. Within each partition, the ORDER BY clause sorts the rows by time (in ascending order by default).
3. Each DEFINE clause specifies a condition used to evaluate a row, and binds that condition to a
symbol that is then used in the PATTERN clause.
4. The PATTERN clause checks if the conditions are met in the specified order and frequency. This
pattern says that there is a match whenever there are 3 consecutive rows that meet criteria A then B
then C.
5. For a row that satisfies all of the PATTERN criteria, the value of the OUTPUT clause is applied.
Otherwise the output is NULL for rows that don’t meet all of the PATTERN criteria.
Return Value
Returns one value per row of the same type as the output_expression for rows that match the
defined match pattern, otherwise returns NULL for rows that do not match the pattern.
Output values are calculated during the lens build process using a special
event series MapReduce job. Therefore, sample output values for a PARTITION
computed field cannot be shown in the dataset workspace.
Input Parameters
PARTITION BY field_name
Required. The PARTITION BY clause is used to specify a field in the current dataset by
which to partition the rows. Rows that share the same value for this field will be grouped
Page 365
Data Ingest Guide - Platfora Expression Language Reference
together, and each group will then be processed independently according to the matching
pattern criteria.
The partition field cannot be a field of a referenced dataset; it must be a field in
the current focus dataset.
ORDER BY field_name
Optional. The ORDER BY clause specifies a field by which to sort the rows within each
partition before applying the match pattern criteria. For event series processing, records are
typically ordered by a DATETIME type field, such as a date or a timestamp. The default
sort order is ascending (first to last or low to high).
The ordering field cannot be a field of a referenced dataset; it must be a field in
the current focus dataset.
PATTERN (pattern_expression)
Required. The PATTERN clause specifies the matching pattern to search for within a
partition of rows. The pattern_expression is expressed in a format similar to a regular
expression. The pattern_expression can include:
• A symbol that represents some match criteria (as declared in the DEFINE clause).
• A symbol followed by one of the following regex quantifiers:
? (matches once or not at all - greedy construct)
?? (matches once or not at all - reluctant construct)
* (matches zero or more times - greedy construct)
*? (matches zero or more times - reluctant construct)
+ (matches one or more times - greedy construct)
+? (matches one or more times - reluctant construct)
** (matches the empty sequence, or one or more of the quantified symbol, with gaps
allowed in between. The match need not begin or end with the quantified symbol)
*+ (matches the empty sequence, or one or more of the quantified symbol, with gaps
allowed in between. The match must end with the quantified symbol)
++ (matches the quantified symbol, followed by zero or more of the quantified symbol,
with gaps allowed in between. The match must end with the quantified symbol)
+* (matches the quantified symbol, followed by zero or more of the quantified symbol,
with gaps allowed in between. The match need not end with the quantified symbol)
• A symbol or pattern of symbols anchored by the regex special characters for the
beginning of string.
Page 366
Data Ingest Guide - Platfora Expression Language Reference
^ (marks the beginning of the set of rows that match to the pattern)
• patternA|patternB - The alternation operator (pipe symbol) between two symbols or
patterns signifies an OR match.
• patternA,patternB - The concatenation operator (comma) between two symbols or
patterns signifies a match when pattern B immediately follows pattern A.
• patternA->patternB - The follows operator (minus and greater-than sign) between
two symbols or patterns signifies a match when pattern B eventually follows pattern A.
• (pattern_expression) - By default, pattern expressions are matched from left to
right. If parenthesis are used to group sub-expressions, the sub-expression within the
parenthesis is evaluated first.
You cannot use quantifiers outside of parenthesis. For example, you cannot write
((A,B,C)*), to indicate that the asterisk quantifier applies to the whole (A,B,C)
expression.
DEFINE symbol AS filter_expression
Required. The DEFINE clause is used to enumerate symbols used in the PATTERN clause
(or in the filter_expression of a subsequent symbol definition).
A symbol is a name used to refer to some pattern matching criteria. This can be any name
or token that follows Platfora's object naming rules. For example, if the name contains
spaces, special characters, keywords, or starts with a number, you must enclose the name
in brackets [] to escape it. Otherwise, this can be any logical name that helps you identify a
piece of pattern matching logic in your expression.
The filter_expression is a Boolean (true or false) expression that operates on each row of
the partition.
A filter_expression can contain:
• The special expression TRUE or 1, meaning allow the match to occur for any row in the
partition.
• Any field_name in the current dataset.
• symbol.field_name - A field from the dataset qualified by the name of a symbol
that (1) appears only once in the PATTERN clause, (2) preceeds this symbol in the
PATTERN clause, and (3) is not followed by a repetition quantifier in the PATTERN
clause.
For example:
PATTERN (A, B) DEFINE A AS TRUE, B AS product = A.product
This means that the expression for symbol B will match to a row if the product field
for that row is also equal to the product field for the row that is bound to symbol A.
• Any of the comparison operators, such as greater than, less than, equals, and so on.
• The keywords AND or OR (for combining multiple criteria in a single filter expression)
Page 367
Data Ingest Guide - Platfora Expression Language Reference
• FIRST|LAST(symbol.field_name) - A field from the dataset, qualified by the name
of a symbol that (1) only appears once in the PATTERN clause, (2) preceeds this symbol
in the PATTERN clause, and (3) is followed by a repetition quantifier in the PATTERN
clause (*,*?,+, or +?). This returns the field value for the first or last row when the
pattern matches to a set of rows.
For example:
PATTERN (A+) DEFINE A AS product = FIRST(A.product) OR COUNT(A)=0
The pattern A+ will match to a series of consecutive rows that all have the same value
for the product field as the first row in the sequence. If the current row happens to be
the first row in the sequence, then it will also be included in the match.
A FIRST or LAST expression evaluates to NULL if it refers to a
symbol that ends up matching an empty sequence. Make sure
your expression handles the row at the beginning or end of a
sequence if you want that row to match as well.
• Any computed expression that operates on the fields or expressions listed above and/or
on literal values.
OUTPUT output_expression
Required. An expression that specifies what the output value should be. The output
expression can refer to:
• The field declared in the PARTITION BY clause.
• symbol.field_name - A field from the dataset, qualified by the name of a symbol that
(1) appears only once in the PATTERN clause, and (2) is not followed by a repetition
quantifier in the PATTERN clause. This will output the matching field value.
• COUNT(symbol) where symbol (1) appears only once in the PATTERN clause, and
(2) is followed by a repetition quantifier in the PATTERN clause. This will output the
sequence number of the row that matched the symbol pattern.
• FIRST | LAST | SUM | COUNT | AVG(symbol.field_name) where symbol (1)
appears only once in the PATTERN clause, and (2) is followed by a repetition quantifier
in the PATTERN clause. This will output an aggregated value for a set of rows that
matched the symbol pattern.
• Since you can only output a single column value, you can use the PACK_VALUES
function to output multiple results in a single column as key/value pairs.
Examples
'Session Start Time' Expression
Calculate a user session by partitioning by user and ordering by time. The matching logic represented
by symbol A checks if the time of the current row is less than 30 minutes from the preceding row. If
it is, then it is considered part of the same session as the previous row. Otherwise, the current row is
Page 368
Data Ingest Guide - Platfora Expression Language Reference
considered the start of a new session. The PATTERN (A+) means that the matching logic represented
by symbol A must be true for one or more consecutive rows. The output then returns the time of the first
row in a session.
PARTITION BY UserID
ORDER BY Timestamp
PATTERN (A+)
DEFINE A AS COUNT(A)=0
OR MINUTES_BETWEEN(Timestamp,LAST(A.Timestamp)) < 30
OUTPUT FIRST(A. Timestamp)
'Click Number in Session' Expression
Calculate where a click happened in a session by partitioning by session and ordering by time. The
matching logic represented by symbol A simply matches to any row in the session. The PATTERN (A
+) means that the matching logic represented by symbol A must be true for one or more consecutive
rows. The output then returns to count of the row within the partition (based on its order or position in
the partition).
PARTITION BY [Session ID]
ORDER BY Timestamp
PATTERN (A+)
DEFINE A AS TRUE
OUTPUT COUNT(A)
'Path to Page' Expression
This is a complicated expression that looks back from the current row's position to determine the
previous 4 pages viewed in a session. Since a PARTITION expression can only output one column value
as its result, the OUTPUT clause uses the PACK_VALUES function to return the previous page positions
1,2,3, and 4 in one output value. You can then use a series of EXTRACT_VALUE expressions to create
individual columns for each prior page view in the path.
PARTITION BY SessionID
ORDER BY Timestamp
PATTERN (^OtherPreviousPages*?, Page4Back??, Page3Back??, Page2Back??,
Page1Back??, CurrentPage)
DEFINE OtherPreviousPages AS TRUE,
Page4Back AS TRUE,
Page3Back AS TRUE,
Page2Back AS TRUE,
Page1Back AS TRUE,
CurrentPage AS TRUE
OUTPUT PACK_VALUES("Back4",Page4Back.Page, "Back3",Page3Back.Page,
"Back2",Page2Back.Page, "Back1",Page1Back.Page)
‘Page -1 Back’ Expression
Use the output from the Path to Page expression and extract the last page viewed before the current
page.
EXTRACT_VALUE([Path to Page],"Back1")
Page 369
Data Ingest Guide - Platfora Expression Language Reference
PACK_VALUES
PACK_VALUES is a row function that returns multiple output values packed into a single string of key/
value pairs separated by the Platfora default key and pair separators. This is useful when the OUTPUT
clause of a PARTITION expression returns multiple output values. The string returned is in a format that
can be read by the EXTRACT_VALUE function. PACK_VALUES uses the same key and pair separator
values that EXTRACT_VALUE uses (the Unicode escape sequences u0003 and u0002, respectively).
Syntax
PACK_VALUES(key_string,value_expression[,key_string,value_expression]
[,...])
Return Value
Returns one value per row of type STRING. If the value for either key_string or
value_expression of a pair is null or contains either of the two separators, the full key/value pair is
omitted from the return value.
Input Parameters
key_string
At least one required. A field name of any type, a literal string or number, or an expression that returns
any value.
value_expression
At least one required. A field name of any type, a literal string or number, or an expression that returns
any value. The expression must include one value_expression instance for each key_string instance.
Examples
Combine the values of the custid and age fields into a single string field.
PACK_VALUES("ID",custid,"Age",age)
The following expression returns ID\u00035555\u0002Age\u000329 when the value of the custid field is
5555 and the value of the age field is 29:
PACK_VALUES("ID",custid,"Age",age)
The following expression returns Age\u000329 when the value of the age field is 29:
PACK_VALUES("ID",NULL,"Age",age)
The following expression returns 29 as a STRING value when the age field is an INTEGER and its value
is 29:
EXTRACT_VALUE(PACK_VALUES("ID",custid,"Age",age),"Age")
You might want to use the PACK_VALUES function to combine multiple field values into a single value
in the OUTPUT clause of the PARTITION (event series processing) function. Then you can use the
EXTRACT_VALUE function in a different computed field in the dataset to get one of the values returned
Page 370
Data Ingest Guide - Platfora Expression Language Reference
by the PARTITION function. For example, in the example below, the PARTITION function creates a set
of rows that defines the previous five web pages accessed in a particular user session:
PARTITION BY Session
ORDER BY Time DESC
PATTERN (A?, B?, C?, D?, E)
DEFINE A AS true, B AS true, C AS true, D AS true, E AS true
OUTPUT PACK_VALUES("A", A.Page, "B", B.Page, "C", C.Page, "D", D.Page)
String Functions
String functions allow you to manipulate and transform textual data, such as combining string values or
extracting a portion of a string value.
CONCAT
CONCAT is a row function that returns a string by concatenating (combining together) the results of
multiple string expressions.
Syntax
CONCAT(value_expression[,value_expression][,...])
Return Value
Returns one value per row of type STRING.
Input Parameters
value_expression
At least one required. A field name of any type, a literal string or number, or an expression that returns
any value.
Examples
Combine the values of the month, day, and year fields into a single date field formatted as MM/DD/
YYYY.
CONCAT(month,"/",day,"/",year)
ARRAY_CONTAINS
ARRAY_CONTAINS is a row function that performs a whole string match against a string containing
delimited values and returns a 1 or 0 depending on whether or not the string contains the search value.
Syntax
ARRAY_CONTAINS(array_string,"delimiter","search_string")
Page 371
Data Ingest Guide - Platfora Expression Language Reference
Return Value
Returns one value per row of type INTEGER. A return value of 1 indicates a positive match, and a return
value of 0 indicates no match.
Input Parameters
array_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
array.
delimiter
Required. The delimiter used between values in the array string. This can be a name of a field or
expression of type STRING.
search_string
Required. The literal string that you want to search for. This can be a name of a field or expression of
type STRING.
Examples
If you had a device field that contained a comma delimited list formatted like this:
Safari,iPad
You could determine whether or not the device used was an iPad using the following expression:
ARRAY_CONTAINS(device,",","iPad")
The following expressions return 1:
ARRAY_CONTAINS("platfora","|","platfora")
ARRAY_CONTAINS("platfora|hadoop|2.3","|","hadoop")
The following expressions return 0:
ARRAY_CONTAINS("platfora","|","plat")
ARRAY_CONTAINS("platfora,hadoop","|","platfora")
FILE_NAME
FILE_NAME is a row function that returns the original file name from the source file system. This is
useful when the source data that comprises a dataset comes from multiple files, and there is useful
information in the file names themselves (such as dates or server names). You can use FILE_NAME in
combination with other string processing functions to extract useful information from the file name.
Syntax
FILE_NAME()
Page 372
Data Ingest Guide - Platfora Expression Language Reference
Return Value
Returns one value per row of type STRING.
Examples
Your dataset is based on daily log files that use an 8 character date as part of the file name. For example,
20120704.log is the file name used for the log file created on July 4, 2012. The following expression
uses FILE_NAME in combination with SUBSTRING and TO_DATE to create a date field from the first 8
characters of the file name.
TO_DATE(SUBSTRING(FILE_NAME(),0,8),"yyyyMMdd")
Your dataset is based on log files that use the server IP address as part of the file name. For example,
172.12.131.118.log is the log file name for server 172.12.131.118. The following expression uses
FILE_NAME in combination with REGEX to extract the IP address from the file name.
REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log")
FILE_PATH
FILE_PATH is a row function that returns the full URI path from the source file system. This is
useful when the source data that comprises a dataset comes from multiple files, and there is useful
information in the directory names or file names themselves (such as dates or server names). You can
use FILE_PATH in combination with other string processing functions to extract useful information
from the file path.
Syntax
FILE_PATH()
Return Value
Returns one value per row of type STRING.
Examples
Your dataset is based on daily log files that are organized into directories by date on the source file
system, and the file names are the server IP address of the server that produced the log file. For
example, the URI path to a log file produced by server 172.12.131.118 on July 4, 2012 is hdfs://myhdfsserver.com/data/logs/20120704/172.12.131.118.log.
The following expression uses FILE_PATH in combination with REGEX and TO_DATE to create a date
field from the date directory name.
TO_DATE(REGEX(FILE_PATH(),"hdfs://myhdfs-server.com/data/logs/(\d{8})/(?:
\d{1,3}\.*)+\.log"),"yyyyMMdd")
And the following expression uses FILE_NAME and REGEX to extract the server IP address from the file
name:
REGEX(FILE_NAME(),"(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\.log")
Page 373
Data Ingest Guide - Platfora Expression Language Reference
EXTRACT_COOKIE
EXTRACT_COOKIE is a row function that extracts the value of the given cookie identifier from a semicolon delimited list of cookie key=value pairs. This function can be used to extract a particular cookie
value from a combined web access log Cookie column.
Syntax
EXTRACT_COOKIE("cookie_list_string",cookie_key_string)
Return Value
Returns the value of the specified cookie key as type STRING.
Input Parameters
cookie_list_string
Required. A field or literal string that has a semi-colon delimited list of cookie key=value pairs.
cookie_key_string
Required. The cookie key name for which to extract the cookie value.
Examples
Extract the value of the vID cookie from a literal cookie string:
EXTRACT_COOKIE("SSID=ABC; vID=44", "vID") returns 44
Extract the value of the vID cookie from a field named Cookie:
EXTRACT_COOKIE(Cookie,"vID")
EXTRACT_VALUE
EXTRACT_VALUE is a row function that extracts the value for the given key from a string containing
delimited key/value pairs.
Syntax
EXTRACT_VALUE(string,key_name [,delimiter] [,pair_delimiter])
Return Value
Returns the value of the specified key as type STRING.
Input Parameters
string
Required. A field or literal string that contains a delimited list of key/value pairs.
key_name
Required. The key name for which to extract the value.
Page 374
Data Ingest Guide - Platfora Expression Language Reference
delimiter
Optional. The delimiter used between the key and the value. If not specified, the value u0003 is used.
This is the Unicode escape sequence for the start of text character (which is the default delimiter used
by Hive).
pair_delimiter
Optional. The delimiter used between key/value pairs when the input string contains more than one key/
value pair. If not specified, the value u0002 is used. This is the Unicode escape sequence for the end of
text character (which is the default delimiter used by Hive).
Examples
Extract the value of the lastname key from a literal string of key/value pairs:
EXTRACT_VALUE("firstname;daria|lastname;hutch","lastname",";","|")
returns hutch
Extract the value of the email key from a string field named contact_info that contains strings in the
format of key:value,key:value:
EXTRACT_VALUE(contact_info,"email",":",",")
INSTR
INSTR is a row function that returns an integer indicating the position of a character within a string that
is the first character of the occurrence of a substring. Platfora's INSTR function is similar to the FIND
function in Excel, except that the first letter is position 0 and the order of the arguments is reversed.
Syntax
INSTR(string,substring,position,occurrence)
Return Value
Returns one value per row of type INTEGER. The first position is indicated with the value of zero (0).
Input Parameters
string
Required. The name of a field or expression of type STRING (or a literal string).
substring
Required. A literal string or name of a field that specifies the substring to search for in string.
position
Optional. An integer that specifies at which character in string to start searching for substring. A value
of 0 (zero) starts the search at the beginning of string. Use a positive integer to start searching from
the beginning of string, and use a negative integer to start searching from the end of string. When no
position is specified, INSTR searches at the beginning of the string (0).
Page 375
Data Ingest Guide - Platfora Expression Language Reference
occurrence
Optional. A positive integer that specifies which occurrence of substring to search for. When no
occurrence is specified, INSTR searches for the first occurrence of the substring (1).
Examples
Return the position of the first occurrence of the substring "http://" starting at the end of the url field:
INSTR(url,"http://",-1,1)
The following expression searches for the second occurrence of the substring "st" starting at the
beginning of the string "bestteststring". INSTR finds that the substring starts at the seventh character in
the string, so it returns 6:
INSTR("bestteststring","st",0,2)
The following expression searches backward for the second occurrence of the substring "st" starting at 7
characters before the end of the string "bestteststring". INSTR finds that the substring starts at the third
character in the string, so it returns 2:
INSTR("bestteststring","st",-7,2)
JAVA_STRING
JAVA_STRING is a row function that returns the unescaped version of a Java unicode character escape
sequence as a string value. This is useful when you want to specify unicode characters in an expression.
For example, you can use JAVA_STRING to specify the unicode value representing a control character.
Syntax
JAVA_STRING(unicode_escape_sequence)
Return Value
Returns the unescaped version of the specified unicode character, one value per row of type STRING.
Input Parameters
unicode_escape_sequence
Required. A STRING value containing a unicode character expressed as a Java unicode escape
sequence. Unicode escape sequences consist ofa backslash '\' (ASCII character 92, hex 0x5c), a
'u' (ASCII 117, hex 0x75), optionally one or more additional 'u' characters, and four hexadecimal digits
(the characters '0' through '9' or 'a' through 'f' or 'A' through 'F'). Such sequences represent the UTF-16
encoding of a Unicode character. For example, the letter 'a' is equivalent to '\u0061'.
Examples
Evaluates whether the currency field is equal to the yen symbol.
CASE WHEN currency == JAVA_STRING("\u00a5") THEN "yes" ELSE "no" END
Page 376
Data Ingest Guide - Platfora Expression Language Reference
JOIN_STRINGS
JOIN_STRINGS is a row function that returns a string by concatenating (combining together) the results
of multiple values with the separator in between each non-null value.
Syntax
JOIN_STRINGS(separator,value_expression[,value_expression][,...])
Return Value
Returns one value per row of type STRING.
Input Parameters
separator
Required. A field name of type STRING, a literal string, or an expression that returns a string.
value_expression
At least one required. A field name of any type, a literal string or number, or an expression that returns
any value.
Examples
Combine the values of the month, day, and year fields into a single date field formatted as MM/DD/
YYYY.
JOIN_STRINGS("/",month,day,year)
The following expression returns NULL:
JOIN_STRINGS("+",NULL,NULL,NULL)
The following expression returns a+b:
JOIN_STRINGS("+","a","b",NULL)
JSON_ARRAY_CONTAINS
JSON_ARRAY_CONTAINS is a row function that performs a whole string match against a string
formatted as a JSON array and returns a 1 or 0 depending on whether or not the string contains the
search value.
Syntax
JSON_ARRAY_CONTAINS(json_array_string,"search_string")
Return Value
Returns one value per row of type INTEGER. A return value of 1 indicates a positive match, and a return
value of 0 indicates no match.
Page 377
Data Ingest Guide - Platfora Expression Language Reference
Input Parameters
json_array_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
JSON array. A JSON array is an ordered sequence of values separated by commas and enclosed in
square brackets.
search_string
Required. The literal string that you want to search for. This can be a name of a field or expression of
type STRING.
Examples
If you have a software field that contains a JSON array formatted like this:
["hadoop","platfora"]
The following expression returns 1:
JSON_ARRAY_CONTAINS(software,"platfora")
JSON_DOUBLE
JSON_DOUBLE is a row function that extracts a DOUBLE value from a field in a JSON object.
Syntax
JSON_DOUBLE(json_string,"json_field")
Return Value
Returns one value per row of type DOUBLE.
Input Parameters
json_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
JSON object.
json_field
Required. The key or name of the field value you want to extract.
For top-level fields, specify the name identifier (key) of the field.
To access fields within a nested object, specify a dot-separated path of field names (for example
top_level_field_name.nested_field_name).
To extract a value from an array, specify the dot-separated path of field names and the array position
starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).
If the name identifier contains dot or period characters within the name itself, escape the name by
enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]
Page 378
Data Ingest Guide - Platfora Expression Language Reference
If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].
Examples
If you had a top_scores field that contained a JSON object formatted like this (with the values contained
in an array):
{"practice_scores":["538.67","674.99","1021.52"], "test_scores":
["753.21","957.88","1032.87"]}
You could extract the third value of the test_scores array using the expression:
JSON_DOUBLE(top_scores,"test_scores.2")
JSON_FIXED
JSON_FIXED is a row function that extracts a FIXED value from a field in a JSON object.
Syntax
JSON_FIXED(json_string,"json_field")
Return Value
Returns one value per row of type FIXED.
Input Parameters
json_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
JSON object.
json_field
Required. The key or name of the field value you want to extract.
For top-level fields, specify the name identifier (key) of the field.
To access fields within a nested object, specify a dot-separated path of field names (for example
top_level_field_name.nested_field_name).
To extract a value from an array, specify the dot-separated path of field names and the array position
starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).
If the name identifier contains dot or period characters within the name itself, escape the name by
enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]
If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].
Examples
If you had a top_scores field that contained a JSON object formatted like this (with the values contained
in an array):
Page 379
Data Ingest Guide - Platfora Expression Language Reference
{"practice_scores":["538.67","674.99","1021.52"], "test_scores":
["753.21","957.88","1032.87"]}
You could extract the third value of the test_scores array using the expression:
JSON_FIXED(top_scores,"test_scores.2")
JSON_INTEGER
JSON_INTEGER is a row function that extracts an INTEGER value from a field in a JSON object.
Syntax
JSON_INTEGER(json_string,"json_field")
Return Value
Returns one value per row of type INTEGER.
Input Parameters
json_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
JSON object.
json_field
Required. The key or name of the field value you want to extract.
For top-level fields, specify the name identifier (key) of the field.
To access fields within a nested object, specify a dot-separated path of field names (for example
top_level_field_name.nested_field_name).
To extract a value from an array, specify the dot-separated path of field names and the array position
starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).
If the name identifier contains dot or period characters within the name itself, escape the name by
enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]
If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].
Examples
If you had an address field that contained a JSON object formatted like this:
{"street_address":"123 B Street", "city":"San Mateo", "state":"CA",
"zip_code":"94403"}
You could extract the zip_code value using the expression:
JSON_INTEGER(address,"zip_code")
Page 380
Data Ingest Guide - Platfora Expression Language Reference
If you had a top_scores field that contained a JSON object formatted like this (with the values contained
in an array):
{"practice_scores":["538","674","1021"], "test_scores":
["753","957","1032"]}
You could extract the third value of the test_scores array using the expression:
JSON_INTEGER(top_scores,"test_scores.2")
JSON_LONG
JSON_LONG is a row function that extracts a LONG value from a field in a JSON object.
Syntax
JSON_LONG(json_string,"json_field")
Return Value
Returns one value per row of type LONG.
Input Parameters
json_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
JSON object.
json_field
Required. The key or name of the field value you want to extract.
For top-level fields, specify the name identifier (key) of the field.
To access fields within a nested object, specify a dot-separated path of field names (for example
top_level_field_name.nested_field_name).
To extract a value from an array, specify the dot-separated path of field names and the array position
starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).
If the name identifier contains dot or period characters within the name itself, escape the name by
enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]
If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].
Examples
If you had a top_scores field that contained a JSON object formatted like this (with the values contained
in an array):
{"practice_scores":["538","674","1021"], "test_scores":
["753","957","1032"]}
Page 381
Data Ingest Guide - Platfora Expression Language Reference
You could extract the third value of the test_scores array using the expression:
JSON_LONG(top_scores,"test_scores.2")
JSON_STRING
JSON_STRING is a row function that extracts a STRING value from a field in a JSON object.
Syntax
JSON_STRING(json_string,"json_field")
Return Value
Returns one value per row of type STRING.
Input Parameters
json_string
Required. The name of a field or expression of type STRING (or a literal string) that contains a valid
JSON object.
json_field
Required. The key or name of the field value you want to extract.
For top-level fields, specify the name identifier (key) of the field.
To access fields within a nested object, specify a dot-separated path of field names (for example
top_level_field_name.nested_field_name).
To extract a value from an array, specify the dot-separated path of field names and the array position
starting at 0 for the first value in an array, 1 for the second value, and so on (for example, field_name.0).
If the name identifier contains dot or period characters within the name itself, escape the name by
enclosing it in brackets (for example, [field.name.with.dot].[another.dot.field.name]
If the field name is null (empty), use brackets with nothing in between as the identifier, for example [].
Examples
If you had an address field that contained a JSON object formatted like this:
{"street_address":"123 B Street", "city":"San Mateo", "state":"CA",
"zip":"94403"}
You could extract the state value using the expression:
JSON_STRING(address,"state")
If you had a misc field that contained a JSON object formatted like this (with the values contained in an
array):
Page 382
Data Ingest Guide - Platfora Expression Language Reference
{"hobbies":["sailing","hiking","cooking"], "interests":
["art","music","travel"]}
You could extract the first value of the hobbies array using the expression:
JSON_STRING(misc,"hobbies.0")
LENGTH
LENGTH is a row function that returns the count of characters in a string value.
Syntax
LENGTH(string)
Return Value
Returns one value per row of type INTEGER.
Input Parameters
string
Required. The name of a field or expression of type STRING (or a literal string).
Examples
Return count of characters from values in the name field. For example, the value Bob would return a
length of 3, Julie would return a length of 5, and so on:
LENGTH(name)
REGEX
REGEX is a row function that performs a whole string match against a string value with a regular
expression and returns the portion of the string matching the first capturing group of the regular
expression.
Syntax
REGEX(string_expression,"regex_matching_pattern")
Return Value
Returns the matched STRING value of the first capturing group of the regular expression. If there is no
match, returns NULL.
Input Parameters
string_expression
Required. The name of a field or expression of type STRING (or a literal string).
regex_matching_pattern
Page 383
Data Ingest Guide - Platfora Expression Language Reference
Required. A regular expression pattern based on the regular expression pattern matching syntax of the
Java programming language. To return a non-NULL value, the regular expression pattern must match
the entire string value.
Regular Expression Constructs
This section lists a summary of the most commonly used constructs for defining a regular expression
matching pattern. See the Regular Expression Reference for more information about regular expression
support in Platfora.
Literal and Special Characters
The most basic form of pattern matching is the match of literal characters. For example, if the regular
expression is foo and the input string is foo, the match will succeed because the strings are identical.
Certain characters are reserved for special use in regular expressions. These special characters are often
called metacharacters. If you want to use special characters as literal characters, they must be escaped.
You can escape a single character using a \ (backslash), or escape a character sequence by enclosing it
in \Q ... \E.
To escape literal double-quotes, double the double-quotes ("").
Character Name
Character
Reserved For
opening bracket
[
start of a character class
closing bracket
]
end of a character class
hyphen
-
character ranges within a character class
backslash
\
general escape character
caret
^
beginning of string, negating of a character class
dollar sign
$
end of string
period
.
matching any single character
pipe
|
alternation (OR) operator
question mark
?
optional quantifier, quantifier minimizer
asterisk
*
zero or more quantifier
plus sign
+
once or more quantifier
opening parenthesis
(
start of a subexpression group
closing parenthesis
)
end of a subexpression group
opening brace
{
start of min/max quantifier
Page 384
Data Ingest Guide - Platfora Expression Language Reference
Character Name
Character
Reserved For
closing brace
}
end of min/max quantifier
Character Class Constructs
A character class allows you to specify a set of characters, enclosed in square brackets, that can produce
a single character match. There are also a number of special predefined character classes (backslash
character sequences that are shorthand for the most common character sets).
Construct
Type
Description
[abc]
simple
matches
a
or
b
or
c
[^abc]
negation
matches any character except
a
or
b
or
c
[a-zA-Z]
range
matches
a
through
z
, or
A
through
Z
(inclusive)
[a-d[m-p]]
union
matches
a
through
d
, or
m
through
p
Page 385
Data Ingest Guide - Platfora Expression Language Reference
Construct
Type
Description
[a-z&&[def]]
intersection matches
d
,
e
, or
f
[a-z&&[^xq]]
subtraction matches
a
through
z
, except for
x
and
q
Predefined Character Classes
Predefined character classes offer convenient shorthands for commonly used regular expressions.
Construct
Description
Example
.
matches any single character (except newline)
.at
matches "cat", "hat", and also"bat" in the
phrase "batch files"
\d
\D
matches any digit character (equivalent to
\d
[0-9]
)
matches "3" in "C3PO" and "2" in
"file_2.txt"
matches any non-digit character (equivalent to
\D
[^0-9]
matches "S" in "900S" and "Q" in "Q45"
)
\s
matches any single white-space character
(equivalent to
[ \t\n\x0B\f\r]
\sbook
matches "book" in "blue book" but
nothing in "notebook"
)
\S
matches any single non-white-space character
\Sbook
matches "book" in "notebook" but
nothing in "blue book"
Page 386
Data Ingest Guide - Platfora Expression Language Reference
Construct
Description
Example
\w
matches any alphanumeric character, including r\w*
underscore (equivalent to
matches "rm" and "root"
[A-Za-z0-9_]
)
\W
matches any non-alphanumeric character
(equivalent to
[^A-Za-z0-9_]
\W
matches "&" in "stmd &" , "%" in
"100%", and "$" in "$HOME"
)
Line and Word Boundaries
Boundary matching constructs are used to specify where in a string to apply a matching pattern. For
example, you can search for a particular pattern within a word boundary, or search for a pattern at the
beginning or end of a line.
Construct
Description
Example
^
matches from the beginning of a line (multiline matches are currently not supported)
^172
matches from the end of a line (multi-line
matches are currently not supported)
d$
matches within a word boundary
\bis\b
$
\b
will match the "172" in IP address
"172.18.1.11" but not in "192.172.2.33"
will match the "d" in "maid" but not in
"made"
matches the word "is" in "this is my
island", but not the "is" part of "this" or
"island".
\bis
matches both "is" and the "is" in "island",
but not in "this".
\B
matches within a non-word boundary
\Bb
matches "b" in "sbin" but not in "bash"
Quantifiers
Quantifiers specify how often the preceding regular expression construct should match. There are three
classes of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and
Page 387
Data Ingest Guide - Platfora Expression Language Reference
possessive quantifiers involves what part of the string to try for the initial match, and how to retry if the
initial attempt does not produce a match.
Greedy ReluctantPossessiveDescription
ConstructConstructConstruct
Example
?
matches the previous
character or construct once
or not at all
st?on
matches the previous
character or construct zero
or more times
if*
matches the previous
character or construct one
or more times
if+
matches the previous
character or construct
exactly
o{2}
*
+
{n}
??
*?
+?
{n}?
?+
*+
++
{n}+
matches "son" in "johnson" and "ston"
in "johnston" but nothing in "clinton" or
"version"
matches "if", "iff" in "diff", or "i" in "print"
matches "if", "iff" in "diff", but nothing in
"print"
matches "oo" in "lookup" and the first two o's
in "fooooo" but nothing in "mount"
n
times
{n,}
{n,}?
{n,}+
matches the previous
character or construct at
least
o{2,}
matches "oo" in "lookup" all five o's in
"fooooo" but nothing in "mount"
n
times
{n,m} {n,m}? {n,m}+ matches the previous
character or construct at
least
F{2,4}
matches "FF" in "#FF0000" and the last four
F's in "#FFFFFF"
n
times, but no more than
m
times
Capturing and Non-Capturing Groups
Groups are specified by a pair of parenthesis around a subpattern in the regular expression. A pattern can
have more than one group and the groups can be nested. The groups are numbered 1-n from left to right,
starting with the first opening parenthesis. There is always an implicit group 0, which contains the entire
match. For example, the pattern:
(a(b*))+(c)
Page 388
Data Ingest Guide - Platfora Expression Language Reference
contains three groups:
group 1: (a(b*))
group 2: (b*)
group 3: (c)
Capturing Groups
By default, a group captures the text that produces a match, and only the most recent match is captured.
The REGEX function returns the string that matches the first capturing group in the regular expression.
For example, if the input string to the expression above was abc, the entire REGEX function would
match to abc, but only return the result of group 1, which is ab.
Non-Capturing Groups
In some cases, you may want to use parenthesis to group subpatterns, but not capture text. A noncapturing group starts with (?: (a question mark and colon following the opening parenthesis). For
example, h(?:a|i|o)t matches hat or hit or hot, but does not capture the a, i, or o from the
subexpression.
Examples
Match all possible email address strings with a pattern of [email protected], but only return
the provider portion of the email address from the email field:
REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$")
Match the request line of a web log, where the value is in the format of:
GET /some_page.html HTTP/1.1
and return just the requested HTML page names:
REGEX(weblog.request_line,"GET\s/([a-zA-Z0-9._%-]+\.[html])\sHTTP/[0-9.]+")
Extract the inches portion from a height field where example values are 6'2", 5'11" (notice the
escaping of the literal quote with a double double-quote):
REGEX(height, "\d\'(\d)+""")
Extract all of the contents of the device field when the value is either iPod, iPad, or iPhone:
REGEX(device,"(iP[ao]d|iPhone)")
REGEX_REPLACE
REGEX_REPLACE is a row function that evaluates a string value against a regular expression to
determine if there is a match, and replaces matched strings with the specified replacement value.
Syntax
REGEX_REPLACE(string_expression,"regex_match_pattern","regex_replace_pattern")
Page 389
Data Ingest Guide - Platfora Expression Language Reference
Return Value
Returns the regex_replace_pattern as a STRING value when regex_match_pattern produces a match. If
there is no match, returns the value of string_expression as a STRING.
Input Parameters
string_expression
Required. The name of a field or expression of type STRING (or a literal string).
regex_match_pattern
Required. A string literal or regular expression pattern based on the regular expression pattern matching
syntax of the Java programming language. You can use capturing groups to create backreferences that
can be used in the regex_replace_pattern. You might want to use a string literal to make a case-sensitive
match. For example, when you enter jane as the match value, the function matches jane but not Jane.
The function matches all occurrences of a string literal in the string expression.
regex_replace_pattern
Required. A string literal or regular expression pattern based on the regular expression pattern
matching syntax of the Java programming language. You can refer to backreferences from the
regex_match_pattern using the syntax $n (where n is the group number).
Regular Expression Constructs
This section lists a summary of the most commonly used constructs for defining a regular expression
matching pattern. See the Regular Expression Reference for more information.
Literal and Special Characters
The most basic form of pattern matching is the match of literal characters. For example, if the regular
expression is foo and the input string is foo, the match will succeed because the strings are identical.
Certain characters are reserved for special use in regular expressions. These special characters are often
called metacharacters. If you want to use special characters as literal characters, they must be escaped.
You can escape a single character using a \ (backslash), or escape a character sequence by enclosing it
in \Q ... \E.
Character Name
Character
Reserved For
opening bracket
[
start of a character class
closing bracket
]
end of a character class
hyphen
-
character ranges within a character class
backslash
\
general escape character
caret
^
beginning of string, negating of a character class
Page 390
Data Ingest Guide - Platfora Expression Language Reference
Character Name
Character
Reserved For
dollar sign
$
end of string
period
.
matching any single character
pipe
|
alternation (OR) operator
question mark
?
optional quantifier, quantifier minimizer
asterisk
*
zero or more quantifier
plus sign
+
once or more quantifier
opening parenthesis
(
start of a subexpression group
closing parenthesis
)
end of a subexpression group
opening brace
{
start of min/max quantifier
closing brace
}
end of min/max quantifier
Character Class Constructs
A character class allows you to specify a set of characters, enclosed in square brackets, that can produce
a single character match. There are also a number of special predefined character classes (backslash
character sequences that are shorthand for the most common character sets).
Construct
Type
Description
[abc]
simple
matches
a
or
b
or
c
[^abc]
negation
matches any character except
a
or
b
or
c
Page 391
Data Ingest Guide - Platfora Expression Language Reference
Construct
Type
Description
[a-zA-Z]
range
matches
a
through
z
, or
A
through
Z
(inclusive)
[a-d[m-p]]
union
matches
a
through
d
, or
m
through
p
[a-z&&[def]]
intersection matches
d
,
e
, or
f
[a-z&&[^xq]]
subtraction matches
a
through
z
, except for
x
and
q
Predefined Character Classes
Page 392
Data Ingest Guide - Platfora Expression Language Reference
Predefined character classes offer convenient shorthands for commonly used regular expressions.
Construct
Description
Example
.
matches any single character (except newline)
.at
matches "cat", "hat", and also"bat" in the
phrase "batch files"
\d
\D
matches any digit character (equivalent to
\d
[0-9]
)
matches "3" in "C3PO" and "2" in
"file_2.txt"
matches any non-digit character (equivalent to
\D
[^0-9]
matches "S" in "900S" and "Q" in "Q45"
)
\s
matches any single white-space character
(equivalent to
[ \t\n\x0B\f\r]
\sbook
matches "book" in "blue book" but
nothing in "notebook"
)
\S
matches any single non-white-space character
\Sbook
matches "book" in "notebook" but
nothing in "blue book"
\w
matches any alphanumeric character, including r\w*
underscore (equivalent to
matches "rm" and "root"
[A-Za-z0-9_]
)
\W
matches any non-alphanumeric character
(equivalent to
[^A-Za-z0-9_]
\W
matches "&" in "stmd &" , "%" in
"100%", and "$" in "$HOME"
)
Line and Word Boundaries
Boundary matching constructs are used to specify where in a string to apply a matching pattern. For
example, you can search for a particular pattern within a word boundary, or search for a pattern at the
beginning or end of a line.
Construct
Description
Example
^
matches from the beginning of a line (multiline matches are currently not supported)
^172
Page 393
will match the "172" in IP address
"172.18.1.11" but not in "192.172.2.33"
Data Ingest Guide - Platfora Expression Language Reference
Construct
Description
Example
$
matches from the end of a line (multi-line
matches are currently not supported)
d$
matches within a word boundary
\bis\b
\b
will match the "d" in "maid" but not in
"made"
matches the word "is" in "this is my
island", but not the "is" part of "this" or
"island".
\bis
matches both "is" and the "is" in "island",
but not in "this".
\B
\Bb
matches within a non-word boundary
matches "b" in "sbin" but not in "bash"
Quantifiers
Quantifiers specify how often the preceding regular expression construct should match. There are three
classes of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and
possessive quantifiers involves what part of the string to try for the initial match, and how to retry if the
initial attempt does not produce a match.
Greedy ReluctantPossessiveDescription
ConstructConstructConstruct
Example
?
matches the previous
character or construct once
or not at all
st?on
matches the previous
character or construct zero
or more times
if*
matches the previous
character or construct one
or more times
if+
matches the previous
character or construct
exactly
o{2}
*
+
{n}
??
*?
+?
{n}?
?+
*+
++
{n}+
n
times
Page 394
matches "son" in "johnson" and "ston"
in "johnston" but nothing in "clinton" or
"version"
matches "if", "iff" in "diff", or "i" in "print"
matches "if", "iff" in "diff", but nothing in
"print"
matches "oo" in "lookup" and the first two o's
in "fooooo" but nothing in "mount"
Data Ingest Guide - Platfora Expression Language Reference
Greedy ReluctantPossessiveDescription
ConstructConstructConstruct
Example
{n,}
o{2,}
{n,}?
{n,}+
matches the previous
character or construct at
least
matches "oo" in "lookup" all five o's in
"fooooo" but nothing in "mount"
n
times
{n,m} {n,m}? {n,m}+ matches the previous
character or construct at
least
F{2,4}
matches "FF" in "#FF0000" and the last four
F's in "#FFFFFF"
n
times, but no more than
m
times
Examples
Match the values in a phone_number field where phone number values are formatted as
xxx.xxx.xxxx and replace them with phone number values formatted as (xxx) xxx-xxxx:
REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]]
{4})","\($1\) $2-$3")
Match the values in a name field where name values are formatted as firstname lastname and
replace them with name values formatted as lastname, firstname:
REGEX_REPLACE(name,"(.*) (.*)","$2, $1")
Match the string literal mrs in a title field and replace it with the string literal Mrs.
REGEX_REPLACE(title,"mrs","Mrs")
SPLIT
SPLIT is a row function that breaks down a delimited input string into sections and returns the specified
section of the string. A section is considered any sub-string between the specified delimiter.
Syntax
SPLIT(input_string_expression,"delimiter_string",position_integer)
Return Value
Returns one value per row of type STRING.
Input Parameters
input_string_expression
Page 395
Data Ingest Guide - Platfora Expression Language Reference
Required. The name of a field or expression of type STRING (or a literal string).
delimiter_string
Required. A literal string representing the delimiter used to separate values in the input string. The
delimiter can be a single character or multiple characters.
position_integer
Required. An integer representing the position of the section in the input string that you want to extract.
Positive integers count the position from the beginning of the string, and negative integers count the
position from the end of the string. A value of 0 returns NULL.
Examples
Return the third section of the literal delimited string: Restaurants>Location>San Francisco:
SPLIT("Restaurants>Location>San Francisco",">", -1) returns San Francisco
Return the first section of a phone_number field where phone number values are in the format of
123-456-7890:
SPLIT(phone_number,"-",1)
SUBSTRING
SUBSTRING is a row function that returns the specified characters of a string value based on the given
start and end position.
Syntax
SUBSTRING(string,start,end)
Return Value
Returns one value per row of type STRING.
Input Parameters
string
Required. The name of a field or expression of type STRING (or a literal string).
start
Required. An integer that specifies where the returned characters start (inclusive), with 0 being the first
character of the string. If start is greater than the number of characters, then an empty string is returned.
If start is greater than end, then an empty string is returned.
end
Required. A positive integer that specifies where the returned characters end (exclusive), with the end
character not being part of the return value. If end is greater than the number of characters, the whole
string value (from start) is returned.
Page 396
Data Ingest Guide - Platfora Expression Language Reference
Examples
Return the first letter of the name field:
SUBSTRING(name,0,1)
TO_LOWER
TO_LOWER is a row function that converts all alphabetic characters in a string to lower case.
Syntax
TO_LOWER(string_expression)
Return Value
Returns one value per row of type STRING.
Input Parameters
string_expression
Required. The name of a field or expression of type STRING (or a literal string).
Examples
Return the literal input string 123 Main Street in all lower case letters::
TO_LOWER("123 Main Street") returns 123 main street
TO_UPPER
TO_UPPER is a row function that converts all alphabetic characters in a string to upper case.
Syntax
TO_UPPER(string_expression)
Return Value
Returns one value per row of type STRING.
Input Parameters
string_expression
Required. The name of a field or expression of type STRING (or a literal string).
Examples
Return the literal input string 123 Main Street in all upper case letters:
TO_UPPER("123 Main Street") returns 123 MAIN STREET
Page 397
Data Ingest Guide - Platfora Expression Language Reference
TRIM
TRIM is a row function that removes leading and trailing spaces from a string value.
Syntax
TRIM(string_expression)
Return Value
Returns one value per row of type STRING.
Input Parameters
string_expression
Required. The name of a field or expression of type STRING (or a literal string).
Examples
Return the value of the area_code field without any leading or trailing spaces. For example, if the input
string is " 650 ", then the return value would be "650":
TRIM(area_code)
Return the value of the phone_number field without any leading or trailing spaces. For example, if the
input string is " 650 123-4567 ", then the return value would be "650 123-4567" (note that the extra
spaces in the middle of the string are not removed, only the spaces at the beginning and end of the
string):
TRIM(phone_number)
XPATH_STRING
XPATH_STRING is a row function that takes an XML-formatted string and returns the first string
matching the given XPath expression.
Syntax
XPATH_STRING(xml_formatted_string,"xpath_expression")
Return Value
Returns one value per row of type STRING.
If the XPath expression matches more than one string in the given XML node, this function will return
the first match only. To return all matches, use XPATH_STRINGS instead.
Input Parameters
xml_formatted_string
Required. The name of a field or a literal string that contains a valid XML node (a snippet of XML
consisting of a parent element and one or more child nodes).
Page 398
Data Ingest Guide - Platfora Expression Language Reference
xpath_expression
Required. An XPath expression that refers to a node, element, or attribute within the XML string passed
to this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0
specification is valid.
Examples
These example XPATH_STRING expressions assume you have a field in your dataset named address
that contains XML-formatted strings such as this:
<list>
<address type="work">
<street>1300 So. El Camino Real</street1>
<street>Suite 600</street2>
<city>San Mateo</city>
<state>CA</state>
<zipcode>94403</zipcode>
</address>
<address type="home">
<street>123 Oakdale Street</street1>
<street/>
<city>San Francisco</city>
<state>CA</state>
<zipcode>94123</zipcode>
</address>
</list>
Get the zipcode value from any address element where the type attribute equals home:
XPATH_STRING(address,"//address[@type='home']/zipcode")
returns: 94123
Get the city value from the second address element:
XPATH_STRING(address,"/list/address[2]/city")
returns: San Francisco
Get the values from all child elements of the first address element (as one string):
XPATH_STRING(address,"/list/address")
returns: 1300 So. El Camino RealSuite 600 San MateoCA94403
XPATH_STRINGS
XPATH_STRINGS is a row function that takes an XML-formatted string and returns a newline-separated
array of strings matching the given XPath expression.
Syntax
XPATH_STRINGS(xml_formatted_string,"xpath_expression")
Page 399
Data Ingest Guide - Platfora Expression Language Reference
Return Value
Returns one value per row of type STRING.
If the XPath expression matches more than one string in the given XML node, this function will return
all matches separated by a newline (you cannot specify a different delimiter).
Input Parameters
xml_formatted_string
Required. The name of a field or a literal string that contains a valid XML node (a snippet of XML
consisting of a parent element and one or more child nodes).
xpath_expression
Required. An XPath expression that refers to a node, element, or attribute within the XML string passed
to this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0
specification is valid.
Examples
These example XPATH_STRINGS expressions assume you have a field in your dataset named address
that contains XML-formatted strings such as this:
<list>
<address type="work">
<street>1300 So. El Camino Real</street1>
<street>Suite 600</street2>
<city>San Mateo</city>
<state>CA</state>
<zipcode>94403</zipcode>
</address>
<address type="home">
<street>123 Oakdale Street</street1>
<street/>
<city>San Francisco</city>
<state>CA</state>
<zipcode>94123</zipcode>
</address>
</list>
Get all zipcode values from all address elements:
XPATH_STRINGS(address,"//address/zipcode")
returns:
94123
94403
Get all street values from the first address element:
XPATH_STRINGS(address,"/list/address[1]/street")
Page 400
Data Ingest Guide - Platfora Expression Language Reference
returns:
1300 So. El Camino Real
Suite 600
Get the values from all child elements of all address elements (as one string per line):
XPATH_STRINGS(address,"/list/address")
returns:
123 Oakdale StreetSan FranciscoCA94123
1300 So. El Camino RealSuite 600 San MateoCA94403
XPATH_XML
XPATH_XML is a row function that takes an XML-formatted string and returns an XML-formatted string
matching the given XPath expression.
Syntax
XPATH_XML(xml_formatted_string,"xpath_expression")
Return Value
Returns one value per row of type STRING in XML format.
Input Parameters
xml_formatted_string
Required. The name of a field or a literal string that contains a valid XML node (a snippet of XML
consisting of a parent element and one or more child nodes).
xpath_expression
Required. An XPath expression that refers to a node, element, or attribute within the XML string passed
to this expression. Any XPath expression that complies to the XML Path Language (XPath) Version 1.0
specification is valid.
Examples
These example XPATH_STRING expressions assume you have a field in your dataset named address
that contains XML-formatted strings such as this:
<list>
<address type="work">
<street>1300 So. El Camino Real</street1>
<street>Suite 600</street2>
<city>San Mateo</city>
<state>CA</state>
<zipcode>94403</zipcode>
</address>
<address type="home">
<street>123 Oakdale Street</street1>
Page 401
Data Ingest Guide - Platfora Expression Language Reference
<street/>
<city>San Francisco</city>
<state>CA</state>
<zipcode>94123</zipcode>
</address>
</list>
Get the last address node and its child nodes in XML format:
XPATH_XML(address,"//address[last()]")
returns:
<address type="home">
<street>123 Oakdale Street</street1>
<street/>
<city>San Francisco</city>
<state>CA</state>
<zipcode>94123</zipcode>
</address>
Get the city value from the second address node in XML format:
XPATH_XML(address,"/list/address[2]/city")
returns: <city>San Francisco</city>
Get the first address node and its child nodes in XML format:
XPATH_XML(address,"/list/address[1]")
returns:
<address type="work">
<street>1300 So. El Camino Real</street1>
<street>Suite 600</street2>
<city>San Mateo</city>
<state>CA</state>
<zipcode>94403</zipcode>
</address>
URL Functions
URL functions allow you to extract different portions of a URL string, and decode text that is URLencoded.
URL_AUTHORITY
URL_AUTHORITY is a row function that returns the authority portion of a URL string. The authority
portion of a URL is the part that has the information on how to locate and connect to the server.
Page 402
Data Ingest Guide - Platfora Expression Language Reference
Syntax
URL_AUTHORITY(string)
Return Value
Returns the authority portion of a URL as a STRING value, or NULL if the input string is not a valid
URL.
For example, in the string http://www.platfora.com/company/contact.html, the authority
portion is www.platfora.com.
In the string http://user:[email protected]:8012/mypage.html, the authority
portion is user:[email protected]:8012.
In the string mailto:[email protected]?subject=Topic, the authority portion is
NULL.
Input Parameters
string
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment].
The authority portion of the URL contains the host information, which can be specified as a domain
name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). The host
information can be preceeded by optional user information terminated with @ (for example,
username:[email protected]), and followed by an optional port number preceded by a colon
(for example, localhost:8001).
Examples
Return the authority portion of URL string values in the referrer field:
URL_AUTHORITY(referrer)
Return the authority portion of a literal URL string:
URL_AUTHORITY("http://user:[email protected]:8012/mypage.html")
returns user:[email protected]:8012
URL_FRAGMENT
URL_FRAGMENT is a row function that returns the fragment portion of a URL string.
Syntax
URL_FRAGMENT(string)
Page 403
Data Ingest Guide - Platfora Expression Language Reference
Return Value
Returns the fragment portion of a URL as a STRING value, NULL if the URL or does not contain a
fragment, or NULL if the input string is not a valid URL.
For example, in the string http://www.platfora.com/contact.html#phone, the fragment
portion is phone.
In the string http://www.platfora.com/contact.html, the fragment portion is NULL.
In the string http://platfora.com/news.php?topic=press#Platfora%20News, the
fragment portion is Platfora%20News.
Input Parameters
string
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment].
The optional fragment portion of the URL is separated by a hash mark (#) and provides direction to a
secondary resource, such as a heading or anchor identifier.
Examples
Return the fragment portion of URL string values in the request field:
URL_FRAGMENT(request)
Return the fragment portion of a literal URL string:
URL_FRAGMENT("http://platfora.com/news.php?topic=press#Platfora%20News")
returns Platfora%20News
Return and decode the fragment portion of a literal URL string:
URLDECODE(URL_FRAGMENT("http://platfora.com/news.php?
topic=press#Platfora%20News")) returns Platfora News
URL_HOST
URL_HOST is a row function that returns the host, domain, or IP address portion of a URL string.
Syntax
URL_HOST(string)
Return Value
Returns the host portion of a URL as a STRING value, or NULL if the input string is not a valid URL.
For example, in the string http://www.platfora.com/company/contact.html, the host
portion is www.platfora.com.
Page 404
Data Ingest Guide - Platfora Expression Language Reference
In the string http://admin:[email protected]:8001/index.html, the host portion is
127.0.0.1.
In the string mailto:[email protected]?subject=Topic, the host portion is NULL.
Input Parameters
string
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment].
The authority portion of the URL contains the host information, which can be specified as a domain
name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1).
Examples
Return the host portion of URL string values in the referrer field:
URL_HOST(referrer)
Return the host portion of a literal URL string:
URL_HOST("http://user:[email protected]:8012/mypage.html") returns
mycompany.com
URL_PATH
URL_PATH is a row function that returns the path portion of a URL string.
Syntax
URL_PATH(string)
Return Value
Returns the path portion of a URL as a STRING value, NULL if the URL or does not contain a path, or
NULL if the input string is not a valid URL.
For example, in the string http://www.platfora.com/company/contact.html, the path
portion is /company/contact.html.
In the string http://admin:[email protected]:8001/index.html, the path portion is /
index.html.
In the string mailto:[email protected]?subject=Topic, the path portion is
[email protected].
Input Parameters
string
Page 405
Data Ingest Guide - Platfora Expression Language Reference
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment].
The optional path portion of the URL is a sequence of resource location segments separated by a
forward slash (/), conceptually similar to a directory path.
Examples
Return the path portion of URL string values in the request field:
URL_PATH(request)
Return the path portion of a literal URL string:
URL_PATH("http://platfora.com/company/contact.html") returns /company/
contact.html
URL_PORT
URL_PORT is a row function that returns the port portion of a URL string.
Syntax
URL_PORT(string)
Return Value
Returns the port portion of a URL as an INTEGER value. If the URL does not specify a port, then returns
-1. If the input string is not a valid URL, returns NULL.
For example, in the string http://localhost:8001, the port portion is 8001.
Input Parameters
string
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment].
The authority portion of the URL contains the host information, which can be specified as a
domain name (www.platfora.com), a host name (localhost), or an IP address (127.0.0.1). The
host information can be followed by an optional port number preceded by a colon (for example,
localhost:8001).
Examples
Return the port portion of URL string values in the referrer field:
URL_PORT(referrer)
Return the port portion of a literal URL string:
Page 406
Data Ingest Guide - Platfora Expression Language Reference
URL_PORT("http://user:[email protected]:8012/mypage.html") returns
8012
URL_PROTOCOL
URL_PROTOCOL is a row function that returns the protocol (or URI scheme name) portion of a URL
string.
Syntax
URL_PROTOCOL(string)
Return Value
Returns the protocol portion of a URL as a STRING value, or NULL if the input string is not a valid
URL.
For example, in the string http://www.platfora.com, the protocol portion is http.
In the string ftp://ftp.platfora.com/articles/platfora.pdf, the protocol portion is
ftp.
Input Parameters
string
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment]
The protocol portion of a URL consists of a sequence of characters beginning with a letter and followed
by any combination of letter, number, plus (+), period (.), or hyphen (-) characters, followed by a colon
(:). For example: http:, ftp:, mailto:
Examples
Return the protocol portion of URL string values in the referrer field:
URL_PROTOCOL(referrer)
Return the protocol portion of the literal URL string:
URL_PROTOCOL("http://www.platfora.com") returns http
URL_QUERY
URL_QUERY is a row function that returns the query portion of a URL string.
Syntax
URL_QUERY(string)
Page 407
Data Ingest Guide - Platfora Expression Language Reference
Return Value
Returns the query portion of a URL as a STRING value, NULL if the URL or does not contain a query, or
NULL if the input string is not a valid URL.
For example, in the string http://www.platfora.com/contact.html, the query portion is
NULL.
In the string http://platfora.com/news.php?
topic=press&timeframe=today#Platfora%20News, the query portion is
topic=press&timeframe=today.
In the string mailto:[email protected]?subject=Topic, the query portion is
subject=Topic.
Input Parameters
string
Required. A field or expression that returns a STRING value in URI (uniform resource identifier) format
of: protocol:authority[/path][?query][#fragment].
The optional query portion of the URL is separated by a question mark (?) and typically contains an
unordered list of key=value pairs separated by an ampersand (&) or semicolon (;).
Examples
Return the query portion of URL string values in the request field:
URL_QUERY(request)
Return the query portion of a literal URL string:
URL_QUERY("http://platfora.com/news.php?topic=press&timeframe=today")
returns topic=press&timeframe=today
URLDECODE
URLDECODE is a row function that decodes a string that has been encoded with the application/
x-www-form-urlencoded media type. URL encoding, also known as percent-encoding, is a
mechanism for encoding information in a Uniform Resource Identifier (URI). When sent in an HTTP
GET request, application/x-www-form-urlencoded data is included in the query component
of the request URI. When sent in an HTTP POST request, the data is placed in the body of the message,
and the name of the media type is included in the message Content-Type header.
Syntax
URLDECODE(string)
Return Value
Returns a value of type STRING with characters decoded as follows:
Page 408
Data Ingest Guide - Platfora Expression Language Reference
• Alphanumeric characters (a-z, A-Z, 0-9) remain unchanged.
• The special characters hyphen (-), comma (,), underscore (_), period (.), and asterisk (*) remain
unchanged.
• The plus sign (+) character is converted to a space character.
• The percent character (%) is interpreted as the start of a special escaped sequence, where in the
sequence %HH, HH represents the hexadecimal value of the byte. For example, some common escape
sequences are:
percent encoding sequence
value
%20
space
%0A or %0D or %0D%0A
newline
%22
double quote (")
%25
percent (%)
%2D
hyphen (-)
%2E
period (.)
%3C
less than (<)
%3D
greater than (>)
%5C
backslash (\)
%7C
pipe (|)
Input Parameters
string
Required. A field or expression that returns a STRING value. It is assumed that all characters in the
input string are one of the following: lower-case letters (a-z), upper-case letters (A-Z), numeric digits
(0-9), or the hyphen (-), comma (,), underscore (_), period (.) or asterisk (*) character. The percent
character (%) is allowed, but is interpreted as the start of a special escaped sequence. The plus character
(+) is allowed, but is interpreted as a space character.
Examples
Decode the values of the url_query field:
URLDECODE(url_query)
Convert a literal URL encoded string (N%2FA%20or%20%22not%20applicable%22) to a humanreadable value (N/A or "not applicable"):
Page 409
Data Ingest Guide - Platfora Expression Language Reference
URLDECODE("N%2FA%20or%20%22not%20applicable%22") returns N/A or "not
applicable"
IP Address Functions
IP address functions allow you to manipulate and transform STRING data consisting of IP address
values.
CIDR_MATCH
CIDR_MATCH is a row function that compares two STRING arguments representing a CIDR mask and
an IP address, and returns 1 if the IP address falls within the specified subnet mask or 0 if it does not.
Syntax
CIDR_MATCH(CIDR_string, IP_string)
Return Value
Returns an INTEGER value of 1 if the IP address falls within the subnet indicated by the CIDR mask
and 0 if it does not.
Input Parameters
CIDR_string
Required. A field or expression that returns a STRING value containing either an IPv4 or IPv6 CIDR
mask (Classless InterDomain Routing subnet notation). An IPv4 CIDR mask can only successfully
match IPv4 addresses, and an IPv6 CIDR mask can only successfully match IPv6 addresses.
IP_string
Required. A field or expression that returns a STRING value containing either an IPv4 or IPv6 internet
protocol (IP) address.
Examples
Compare an IPv4 CIDR subnet mask to an IPv4 IP address:
CIDR_MATCH("60.145.56.0/24","60.145.56.246") returns 1
CIDR_MATCH("60.145.56.0/30","60.145.56.246") returns 0
Compare an IPv6 CIDR subnet mask to an IPv6 IP address:
CIDR_MATCH("fe80::/70","FE80::0202:B3FF:FE1E:8329") returns 1
CIDR_MATCH("fe80::/72","FE80::0202:B3FF:FE1E:8329") returns 0
Page 410
Data Ingest Guide - Platfora Expression Language Reference
HEX_TO_IP
HEX_TO_IP is a row function that converts a hexadecimal-encoded STRING to a text representation of
an IP address.
Syntax
HEX_TO_IP(string)
Return Value
Returns a value of type STRING representing either an IPv4 or IPv6 address. The type of IP address
returned depends on the input string. An 8 character hexadecimal string will return an IPv4 address. A
32 character long hexadecimal string will return an IPv6 address. IPv6 addresses are represented in full
length,
without removing any leading zeros and without using the compressed :: notation.
For example, 2001:0db8:0000:0000:0000:ff00:0042:8329 rather than
2001:db8::ff00:42:8329. Input strings that do not contain either 8 or 32 valid hexadecimal
characters will return NULL.
Input Parameters
string
Required. A field or expression that returns a hexadecimal-encoded STRING value. The hexadecimal
string must be either 8 characters long (in which case it is converted to an IPv4 address) or 32 characters
long (in which case it is converted to an IPv6 address).
Examples
Return a plain text IP address for each hexadecimal-encoded string value in the byte_encoded_ips
column:
HEX_TO_IP(byte_encoded_ips)
Convert an 8 character hexadecimal-encoded string to a plain text IPv4 address:
HEX_TO_IP(AB20FE01) returns 171.32.254.1
Convert a 32 character hexadecimal-encoded string to a plain text IPv6 address:
HEX_TO_IP(FE800000000000000202B3FFFE1E8329) returns
fe80:0000:0000:0000:0202:b3ff:fe1e:8329
Date and Time Functions
Date and time functions allow you to manipulate and transform datetime values, such as calculating time
differences between two datetime values, or extracting a portion of a datetime value.
Page 411
Data Ingest Guide - Platfora Expression Language Reference
DAYS_BETWEEN
DAYS_BETWEEN is a row function that calculates the whole number of days (ignoring time) between
two DATETIME values (value1-value2).
Syntax
DAYS_BETWEEN(datetime_1,datetime_2)
Return Value
Returns one value per row of type INTEGER.
Input Parameters
datetime_1
Required. A field or expression of type DATETIME.
datetime_2
Required. A field or expression of type DATETIME.
Examples
Calculate the number of days to ship a product by subtracting the value of the order_date field from the
ship_date field:
DAYS_BETWEEN(ship_date,order_date)
Calculate the number of days since a product's release by subtracting the value of the release_date field
in the product dataset from the current date (the result of the expression):
DAYS_BETWEEN(NOW(),product.release_date)
DATE_ADD
DATE_ADD is a row function that adds the specified time interval to a DATETIME value.
Syntax
DATE_ADD(datetime,quantity,"interval")
Return Value
Returns a value of type DATETIME.
Input Parameters
datetime
Required. A field name or expression that returns a DATETIME value.
quantity
Page 412
Data Ingest Guide - Platfora Expression Language Reference
Required. An integer value. To add time intervals, use a positive integer. To subtract time intervals, use
a negative integer.
interval
Required. One of the following time intervals:
• millisecond - Adds the specified number of milliseconds to a datetime value.
• second - Adds the specified number of seconds to a datetime value.
• minute - Adds the specified number of minutes to a datetime value.
• hour - Adds the specified number of hours to a datetime value.
• day - Adds the specified number of days to a datetime value.
• week - Adds the specified number of weeks to a datetime value.
• month - Adds the specified number of months to a datetime value.
• quarter - Adds the specified number of quarters to a datetime value.
• year - Adds the specified number of years to a datetime value.
• weekyear - Adds the specified number of weekyears to a datetime value.
Examples
Add 45 days to the value of the invoice_date field to calculate the date a payment is due:
DATE_ADD(invoice_date,45,"day")
HOURS_BETWEEN
HOURS_BETWEEN is a row function that calculates the whole number of hours (ignoring minutes,
seconds, and milliseconds) between two DATETIME values (value1-value2).
Syntax
HOURS_BETWEEN(datetime_1,datetime_2)
Return Value
Returns one value per row of type INTEGER.
Input Parameters
datetime_1
Required. A field or expression of type DATETIME.
datetime_2
Required. A field or expression of type DATETIME.
Examples
Calculate the number of hours to ship a product by subtracting the value of the ship_date field from the
order_date field:
Page 413
Data Ingest Guide - Platfora Expression Language Reference
HOURS_BETWEEN(ship_date,order_date)
Calculate the number of hours since an advertisement was viewed by subtracting the value of the
adview_timestamp field in the impressions dataset from the current date and time (the result of the
expression):
HOURS_BETWEEN(NOW(),impressions.adview_timestamp)
EXTRACT
EXTRACT is a row function that returns the specified portion of a DATETIME value.
Syntax
EXTRACT("extract_value",datetime)
Return Value
Returns the specified extracted value as type INTEGER. EXTRACT removes leading zeros. For example,
the month of April returns a value of 4, not 04.
Input Parameters
extract_value
Required. One of the following extract values:
• millisecond - Returns the millisecond portion of a datetime value. For example, an input datetime
value of 2012-08-15 20:38:40.213 would return an integer value of 213.
• second - Returns the second portion of a datetime value. For example, an input datetime value of
2012-08-15 20:38:40.213 would return an integer value of 40.
• minute - Returns the minute portion of a datetime value. For example, an input datetime value of
2012-08-15 20:38:40.213 would return an integer value of 38.
• hour - Returns the hour portion of a datetime value. For example, an input datetime value of
2012-08-15 20:38:40.213 would return an integer value of 20.
• day - Returns the day portion of a datetime value. For example, an input datetime value of
2012-08-15 would return an integer value of 15.
• week - Returns the ISO week number for the input datetime value. For example, an input datetime
value of 2012-01-02 would return an integer value of 1 (the first ISO week of 2012 starts on
Monday January 2). An input datetime value of 2012-01-01 would return an integer value of 52
(January 1, 2012 is part of the last ISO week of 2011).
• month - Returns the month portion of a datetime value. For example, an input datetime value of
2012-08-15 would return an integer value of 8.
• quarter - Returns the quarter number for the input datetime value, where quarters start on January 1,
April 1, July 1, or October 1. For example, an input datetime value of 2012-08-15 would return a
integer value of 3.
• year - Returns the year portion of a datetime value. For example, an input datetime value of
2012-01-01 would return an integer value of 2012.
Page 414
Data Ingest Guide - Platfora Expression Language Reference
• weekyear - Returns the year value that corresponds the the ISO week number of the input datetime
value. For example, an input datetime value of 2012-01-02 would return an integer value of 2012
(the first ISO week of 2012 starts on Monday January 2). An input datetime value of 2012-01-01
would return an integer value of 2011 (January 1, 2012 is part of the last ISO week of 2011).
datetime
Required. A field name or expression that returns a DATETIME value.
Examples
Extract the hour portion from the order_date datetime field:
EXTRACT("hour",order_date)
Cast the value of the order_date string field to a datetime value using TO_DATE, and extract the ISO
week year:
EXTRACT("weekyear",TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss"))
MILLISECONDS_BETWEEN
MILLISECONDS_BETWEEN is a row function that calculates the whole number of milliseconds between
two DATETIME values (value1-value2).
Syntax
MILLISECONDS_BETWEEN(datetime_1,datetime_2)
Return Value
Returns one value per row of type INTEGER.
Input Parameters
datetime_1
Required. A field or expression of type DATETIME.
datetime_2
Required. A field or expression of type DATETIME.
Examples
Calculate the number of milliseconds it took to serve a web page by subtracting the value of the
request_timestamp field from the response_timestamp field:
MILLISECONDS_BETWEEN(request_timestamp,response_timestamp)
MINUTES_BETWEEN
MINUTES_BETWEEN is a row function that calculates the whole number of minutes (ignoring seconds
and milliseconds) between two DATETIME values (value1-value2).
Page 415
Data Ingest Guide - Platfora Expression Language Reference
Syntax
MINUTES_BETWEEN(datetime_1,datetime_2)
Return Value
Returns one value per row of type INTEGER.
Input Parameters
datetime_1
Required. A field or expression of type DATETIME.
datetime_2
Required. A field or expression of type DATETIME.
Examples
Calculate the number of minutes it took for a user to click on an advertisement by subtracting the value
of the impression_timestamp field from the conversion_timestamp field:
MINUTES_BETWEEN(impression_timestamp,conversion_timestamp)
Calculate the number of minutes since a user last logged in by subtracting the login_timestamp field in
the weblogs dataset from the current date and time (the result of the expression):
MINUTES_BETWEEN(NOW(),weblogs.login_timestamp)
NOW
NOW is a scalar function that returns the current system date and time as a DATETIME value. It can be
used in other expressions involving DATETIME type fields, such as , , or . Note that the value of NOW is
only evaluated at the time a lens is built (it is not re-evaluated with each query).
Syntax
NOW()
Return Value
Returns the current system date and time as a DATETIME value.
Examples
Calculate a user's age using to subtract the value of the birthdate field in the users dataset from the
current date:
YEAR_DIFF(NOW(),users.birthdate)
Calculate the number of days since a product's release using to subtract the value of the release_date
field from the current date:
DAYS_BETWEEN(NOW(),release_date)
Page 416
Data Ingest Guide - Platfora Expression Language Reference
SECONDS_BETWEEN
SECONDS_BETWEEN is a row function that calculates the whole number of seconds (ignoring
milliseconds) between two DATETIME values (value1-value2).
Syntax
SECONDS_BETWEEN(datetime_1,datetime_2)
Return Value
Returns one value per row of type INTEGER.
Input Parameters
datetime_1
Required. A field or expression of type DATETIME.
datetime_2
Required. A field or expression of type DATETIME.
Examples
Calculate the number of seconds it took for a user to click on an advertisement by subtracting the value
of the impression_timestamp field from the conversion_timestamp field:
SECONDS_BETWEEN(impression_timestamp,conversion_timestamp)
Calculate the number of seconds since a user last logged in by subtracting the login_timestamp field in
the weblogs dataset from the current date and time (the result of the expression):
SECONDS_BETWEEN(NOW(),weblogs.login_timestamp)
TRUNC
TRUNC is a row function that truncates a DATETIME value to the specified format.
Syntax
TRUNC(datetime,"format")
Return Value
Returns a value of type DATETIME truncated to the specified format.
Input Parameters
datetime
Required. A field or expression that returns a DATETIME value.
format
Required. One of the following format values:
Page 417
Data Ingest Guide - Platfora Expression Language Reference
• millisecond - Returns a datetime value truncated to millisecond granularity. Has no effect since
millisecond is already the most granular format for datetime values. For example, an input datetime
value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:40.213.
• second - Returns a datetime value truncated to second granularity. For example, an input datetime
value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:40.000.
• minute - Returns a datetime value truncated to minute granularity. For example, an input datetime
value of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:38:00.000.
• hour - Returns a datetime value truncated to hour granularity. For example, an input datetime value
of 2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 20:00:00.000.
• day - Returns a datetime value truncated to day granularity. For example, an input datetime value of
2012-08-15 20:38:40.213 would return a datetime value of 2012-08-15 00:00:00.000.
• week - Returns a datetime value truncated to the first day of the week (starting on a Monday). For
example, an input datetime value of 2012-08-15 (a Wednesday) would return a datetime value of
2012-08-13 (the Monday prior).
• month - Returns a datetime value truncated to the first day of the month. For example, an input
datetime value of 2012-08-15 would return a datetime value of 2012-08-01.
• quarter - Returns a datetime value truncated to the first day of the quarter (January 1, April 1, July 1,
or October 1). For example, an input datetime value of 2012-08-15 20:38:40.213 would return a
datetime value of 2012-07-01.
• year - Returns a datetime value truncated to the first day of the year (January 1). For example, an
input datetime value of 2012-08-15 would return a datetime value of 2012-01-01.
• weekyear - Returns a datetime value trucated to the first day of the ISO weekyear (the ISO week
starting with the Monday which is nearest in time to January 1). For example, an input datetime value
of 2008-08-15 would return a datetime value of 2007-12-31. The first day of the ISO weekyear for
2008 is December 31, 2007 (the prior Monday closest to January 1).
Examples
Truncate the order_date datetime field to day granularity:
TRUNC(order_date,"day")
Cast the value of the order_date string field to a datetime value using TO_DATE, and truncate it to day
granularity:
TRUNC(TO_DATE(order_date,"MM/dd/yyyy HH:mm:ss"),"day")
YEAR_DIFF
YEAR_DIFF is a row function that calculates the fractional number of years between two DATETIME
values (value1-value2).
Syntax
YEAR_DIFF(datetime_1,datetime_2)
Page 418
Data Ingest Guide - Platfora Expression Language Reference
Return Value
Returns one value per row of type DOUBLE.
Input Parameters
datetime_1
Required. A field or expression of type DATETIME.
datetime_2
Required. A field or expression of type DATETIME.
Examples
Calculate the number of years a user has been a customer by subtracting the value of the
registration_date field from the current date (the result of the expression):
YEAR_DIFF(NOW(),registration_date)
Calculate a user's age by subtracting the value of the birthdate field in the users dataset from the current
date (the result of the expression):
YEAR_DIFF(NOW(),users.birthdate)
Math Functions
Math functions allow you to perform basic math calculations on numeric values. You can also use
arithmetic operators to perform simple math calculations.
DIV
DIV is a row function that divides two LONG values and returns a quotient value of type LONG (the result
is truncated to 0 decimal places).
Syntax
DIV(dividend,divisor)
Return Value
Returns one value per row of type LONG.
Input Parameters
dividend
Required. A field or expression of type LONG.
divisor
Required. A field or expression of type LONG.
Page 419
Data Ingest Guide - Platfora Expression Language Reference
Examples
Cast the value of the file_size field to LONG and divide by 1024:
DIV(TO_LONG(file_size),1024)
EXP
EXP is a row function that raises the mathematical constant e to the power (exponent) of a numeric value
and returns a value of type DOUBLE.
Syntax
EXP(power)
Return Value
Returns one value per row of type DOUBLE.
Input Parameters
power
Required. A field or expression of a numeric type.
Examples
Raise e to the power in the Value field.
EXP(Value)
When the Value field value is 2.0, the result is equal to 7.3890 when truncated to four decimal places.
FLOOR
FLOOR is a row function that returns the largest integer that is less than or equal to the input argument.
Syntax
FLOOR(double)
Return Value
Returns one value per row of type DOUBLE.
Input Parameters
double
Required. A field or expression of type DOUBLE.
Examples
Return the floor value of 32.6789:
FLOOR(32.6789) returns 32.0
Page 420
Data Ingest Guide - Platfora Expression Language Reference
HASH
HASH is a row function that evenly partitions data values into the specified number of buckets. It creates
a hash of the input value and assigns that value a bucket number. Equal values will always hash to the
same bucket number.
Syntax
HASH(field_name,integer)
Return Value
Returns one value per row of type INTEGER corresponding to the bucket number that the input value
hashes to.
Input Parameters
field_name
Required. The name of the field whose values you want to partition.
integer
Required. The desired number of buckets. This parameter can be a numeric value of any data type, but
when it is a non-integer value, Platfora truncates the value to an integer. When the value is zero, the
function returns NULL. When the value is negative, the function uses absolute value.
Examples
Partition the values of the username field into 20 buckets:
HASH(username,20)
LN
LN is a row function that returns the natural logarithm of a number. The natural logarithm is the
logarithm to the base e, where e (Euler's number) is a mathematical constant approximately equal to
2.718281828. The natural logarithm of a number x is the power to which the constant e must be raised in
order to equal x.
Syntax
LN(positive_number)
Return Value
Returns the exponent to which base e must be raised to obtain the input value, where e denotes the
constant number 2.718281828. The return value is the same data type as the input value.
For example, LN(7.389) is 2, because e to the power of 2 is approximately 7.389.
Input Parameters
positive_number
Page 421
Data Ingest Guide - Platfora Expression Language Reference
Required. A field or expression that returns a number greater than 0. Inputs can be of type INTEGER,
LONG, DOUBLE, or FIXED.
Examples
Return the natural logarithm of base number e, which is approximately 2.718281828:
LN(2.718281828) returns 1
LN(3.0000) returns 1.098612
LN(300.0000) returns 5.703782
MOD
MOD is a row function that divides two LONG values and returns the remainder value of type LONG (the
result is truncated to 0 decimal places).
Syntax
MOD(dividend,divisor)
Return Value
Returns one value per row of type LONG.
Input Parameters
dividend
Required. A field or expression of type LONG.
divisor
Required. A field or expression of type LONG.
Examples
Cast the value of the file_size field to LONG and divide by 1024:
MOD(TO_LONG(file_size),1024)
POW
POW is a row function that raises the a numeric value to the power (exponent) of another numeric value
and returns a value of type DOUBLE.
Syntax
POW(index,power)
Return Value
Returns one value per row of type DOUBLE.
Page 422
Data Ingest Guide - Platfora Expression Language Reference
Input Parameters
index
Required. A field or expression of a numeric type.
power
Required. A field or expression of a numeric type.
Examples
Calculate the compound annual growth rate (CAGR) percentage for a given investment over a five year
span.
100 * POW(end_value/start_value, 0.2) - 1
Calculate the square of the Value field.
POW(Value,2)
Calculate the square root of the Value field.
POW(Value,0.5)
The following expression returns 1.
POW(0,0)
ROUND
ROUND is a row function that rounds a DOUBLE value to the specified number of decimal places.
Syntax
ROUND(double,number_decimal_places)
Return Value
Returns one value per row of type DOUBLE.
Input Parameters
double
Required. A field or expression of type DOUBLE.
number_decimal_places
Required. An integer that specifies the number of decimal places to round to.
Examples
Round the number 32.4678954 to two decimal places:
ROUND(32.4678954,2) returns 32.47
Page 423
Data Ingest Guide - Platfora Expression Language Reference
Data Type Conversion Functions
Data type conversion functions allow you to cast data values from one data type to another. These
functions are used implicitly whenever you set the data type of a field or column in the Platfora user
interface. The supported data types are: INTEGER, LONG, DOUBLE, FIXED, DATETIME, and STRING.
EPOCH_MS_TO_DATE
EPOCH_MS_TO_DATE is a row function that converts LONG values to DATETIME values, where the
input number represents the number of milliseconds since the epoch.
Syntax
EPOCH_MS_TO_DATE(long_expression)
Return Value
Returns one value per row of type DATETIME in UTC format yyyy-MM-dd HH:mm:ss:SSS Z.
Input Parameters
long_expression
Required. A field or expression of type LONG representing the number of milliseconds since the epoch
datetime (January 1, 1970 00:00:00:000 GMT).
Examples
Convert a number representing the number of milliseconds from the epoch to a human-readable date and
time:
EPOCH_MS_TO_DATE(1360260240000) returns 2013-02-07T18:04:00:000Z or February 7,
2013 18:04:00:000 GMT
Or if your data is in seconds instead of milliseconds:
EPOCH_MS_TO_DATE(1360260240 * 1000) returns 2013-02-07T18:04:00:000Z or February
7, 2013 18:04:00:000 GMT
TO_CURRENCY
This function is deprecated. Use the TO_FIXED function instead.
TO_DATE
TO_DATE is a row function that converts STRING values to DATETIME values, and specifies the format
of the date and time elements in the string.
Syntax
TO_DATE(string_expression,"date_format")
Page 424
Data Ingest Guide - Platfora Expression Language Reference
Return Value
Returns one value per row of type DATETIME (which by definition is in UTC).
Input Parameters
string_expression
Required. A field or expression of type STRING.
date_format
Required. A pattern that describes how the date is formatted.
Date Pattern Format
Use the following pattern symbols to define your date format. The count and ordering of the pattern
letters determines the datetime format. Any characters in the pattern that are not in the ranges of a-z and
A-Z are treated as quoted delimiter text. For instance, characters such as slash (/) or colon (:) will appear
in the resulting output even they are not escaped with single quotes.
Table 3: Date Pattern Symbols
SymbolMeaning
Presentation
Examples
G
era
text
AD
C
century of era (0 or
greater)
number
20
Y
year of era (0 or
greater)
year
1996
x
week year
year
1996
w
week number of week
year
number
27
e
day of week (number)
number
2
E
day of week (name)
text
Tuesday; Tue
y
year
year
1996
D
day of year
number
189
M
month of year
month
July; Jul; 07
3 or more uses text, otherwise uses a
number
d
day of month
number
10
If the number of pattern letters is 3 or
more, the text form is used; otherwise
the number is used.
Page 425
Notes
Numeric presentation for year and week
year fields are handled specially. For
example, if the count of 'y' is 2, the year
will be displayed as the zero-based year
of the century, which is two digits.
If the number of pattern letters is 4 or
more, the full form is used; otherwise a
short or abbreviated form is used.
Data Ingest Guide - Platfora Expression Language Reference
SymbolMeaning
Presentation
Examples
Notes
a
half day of day
text
K
hour of half day (0-11) number
0
h
clock hour of half day
(1-12)
number
12
H
hour of day (0-23)
number
0
k
clock hour of day
(1-24)
number
24
m
minute of hour
number
30
s
second of minute
number
55
S
fraction of second
number
978
z
time zone
text
Pacific Standard
Time; PST
If the number of pattern letters is 4 or
more, the full form is used; otherwise a
short or abbreviated form is used.
Z
time zone offset/id
zone
-0800; -08:00;
America/
Los_Angeles
'Z' outputs offset without a colon, 'ZZ'
outputs the offset with a colon, 'ZZZ' or
more outputs the zone id.
'
escape character for
text-based delimiters
delimiter
''
literal representation of literal
a single quote
PM
'
Examples
Define a new DATETIME computed field based on the order_date base field, which contains timestamps
in the format of: 2014.07.10 at 15:08:56 PDT:
TO_DATE(order_date,"yyyy.MM.dd 'at' HH:mm:ss z")
Define a new DATETIME computed field by first combining individual month, day, year, and
depart_time fields (using CONCAT), and performing a transformation on depart_time to make sure threedigit times are converted to four-digit times (using REGEX_REPLACE):
TO_DATE(CONCAT(month,"/",day,"/",year,":",REGEX_REPLACE(depart_time,"\b(\d{3})\b",
dd/yyyy:HHmm")
Define a new DATETIME computed field based on the created_at base field, which contains timestamps
in the format of: Sat Jan 25 16:35:23 +0800 2014 (this is the timestamp format returned by Twitter's
API):
TO_DATE(created_at,"EEE MMM dd HH:mm:ss Z yyyy")
Page 426
Data Ingest Guide - Platfora Expression Language Reference
TO_DOUBLE
TO_DOUBLE is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to DOUBLE
(decimal) values.
Syntax
TO_DOUBLE(expression)
Return Value
Returns one value per row of type DOUBLE.
Input Parameters
expression
Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or
DOUBLE.
Examples
Convert the values of the average_rating field to a double data type:
TO_DOUBLE(average_rating)
Convert the average_rating field to a double data type, but first transform the occurrence of any NA
values to NULL values using a CASE expression:
TO_DOUBLE(CASE WHEN average_rating="N/A" then NULL ELSE average_rating
END)
TO_FIXED
TO_FIXED is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to fixeddecimal values. Using a FIXED data type to represent monetary values allows you to calculate and
aggregate monetary values with accuracy to a ten-thousandth of a monetary unit.
Syntax
TO_FIXED(expression)
Return Value
Returns one value per row of type FIXED (fixed-decimal value to 10000th accuracy).
Input Parameters
expression
Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or
DOUBLE.
Page 427
Data Ingest Guide - Platfora Expression Language Reference
Examples
Convert the opening_price field to a fixed decimal data type:
TO_FIXED(opening_price)
Convert the sale_price field to a fixed decimal data type, but first transform the occurrence of any N/A
string values to NULL values using a CASE expression:
TO_FIXED(CASE WHEN sale_price="N/A" then NULL ELSE sale_price END)
TO_INT
TO_INT is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to INTEGER
(whole number) values. When converting DOUBLE values, everything after the decimal will be truncated
(not rounded up or down).
Syntax
TO_INT(expression)
Return Value
Returns one value per row of type INTEGER.
Input Parameters
expression
Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or
DOUBLE.
Examples
Convert the values of the average_rating field to an integer data type:
TO_INT(average_rating)
Convert the flight_duration field to an integer data type, but first transform the occurrence of any NA
values to NULL values using a CASE expression:
TO_INT(CASE WHEN flight_duration="N/A" then NULL ELSE flight_duration
END)
TO_LONG
TO_LONG is a row function that converts STRING, INTEGER, LONG, or DOUBLE values to LONG (whole
number) values. When converting DOUBLE values, everything after the decimal will be truncated (not
rounded up or down).
Syntax
TO_LONG(expression)
Page 428
Data Ingest Guide - Platfora Expression Language Reference
Return Value
Returns one value per row of type LONG.
Input Parameters
expression
Required. A field or expression of type STRING (must be numeric characters), INTEGER, LONG, or
DOUBLE.
Examples
Convert the values of the average_rating field to a long data type:
TO_LONG(average_rating)
Convert the average_rating field to a long data type, but first transform the occurrence of any NA values
to NULL values using a CASE expression:
TO_LONG(CASE WHEN average_rating="N/A" then NULL ELSE average_rating
END)
TO_STRING
TO_STRING is a row function that converts values of other data types to STRING (character) values.
Syntax
TO_STRING(expression)
TO_STRING(datetime_expression,date_format)
Return Value
Returns one value per row of type STRING.
Input Parameters
expression
A field or expression of type FIXED, STRING, INTEGER, LONG, or DOUBLE.
datetime_expression
A field or expression of type DATETIME.
date_format
If converting a DATETIME to a string, a pattern that describes how the date is formatted. See TO_DATE
for the date format patterns.
Examples
Convert the values of the sku_number field to a string data type:
Page 429
Data Ingest Guide - Platfora Expression Language Reference
TO_STRING(sku_number)
Convert values in the age column into a range-based groupings (binning), and cast output values to a
STRING:
TO_STRING(CASE WHEN age <= 25 THEN "0-25" WHEN age <= 50 THEN "26-50"
ELSE "over 50" END)
Convert the values of a timestamp datetime field to a string, where the timestamp values are in the
format of: 2002.07.10 at 15:08:56 PDT:
TO_STRING(timestamp,"yyyy.MM.dd 'at' HH:mm:ss z")
Aggregate Functions
An aggregate function groups the values of multiple rows together based on some defined input
expression. Aggregate functions return one value for a group of rows, and are only valid for defining
measures in Platfora. Aggregate functions cannot be combined with row functions.
AVG
AVG is an aggregate function that returns the average of all valid numeric values. It sums all values in
the provided expression and divides by the number of valid (NOT NULL) rows. If you want to compute
an average that includes all values in the row count (including NULL values), you can use a SUM/COUNT
expression instead.
Syntax
AVG(numeric_field)
Return Value
Returns a value of type DOUBLE.
Input Parameters
numeric_field
Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate
functions can only take field names as input.
Examples
Get the average of the valid sale_amount field values:
AVG(sale_amount)
Get the average of the valid net_worth field values in the billionaires data set, which resides in the
samples namespace:
AVG([(samples) billionaires].net_worth)
Page 430
Data Ingest Guide - Platfora Expression Language Reference
Get the average of all page_views field values in the web_logs dataset (including NULL values):
SUM(page_views)/COUNT(web_logs)
COUNT
COUNT is an aggregate function that returns the number of rows in a dataset.
Syntax
COUNT([namespace_name]dataset_name)
Return Value
Returns a value of type INTEGER.
Input Parameters
namespace_name
Optional. The name of the namespace in which the dataset resides. If not specified, uses the default
namespace.
dataset_name
Required. The name of the dataset for which to obtain a count of rows. If you want to count rows of a
down-stream dataset that is related to the current dataset, you can specify the hierarchy of dataset names
in the format of:
parent_dataset_name.child_dataset_name.[...]
Examples
Count the rows in the sales dataset:
COUNT(sales)
Count the rows in the billionaires dataset, which resides in the samples namespace:
COUNT([(samples) billionaires])
Count the rows in the customer dataset, which is a related dataset down-stream of sales:
COUNT(sales.customers)
COUNT_VALID
COUNT_VALID is an aggregate function that returns the number of rows for which the given expression
is valid (excludes NULL values).
Syntax
COUNT_VALID(field)
Page 431
Data Ingest Guide - Platfora Expression Language Reference
Return Value
Returns a numeric value of type INTEGER.
Input Parameters
field
Required. A field name. Unlike row functions, aggregate functions can only take field names as input.
Examples
Count the valid values in the page_views field:
COUNT_VALID(page_views)
DISTINCT
DISTINCT is an aggregate function that returns the number of distinct values for the given expression.
Syntax
DISTINCT(field)
Return Value
Returns a numeric value of type INTEGER.
Input Parameters
field
Required. A field name. Unlike row functions, aggregate functions can only take field names as input.
Examples
Count the unique values of the user_id field in the currently selected dataset:
DISTINCT(user_id)
Count the unique values of the name field in the billionaires dataset, which resides in the samples
namespace:
DISTINCT([(samples) billionaires].name)
Count the unique values of the customer_id field in the customer dataset, which is a related dataset
down-stream of web sales:
DISTINCT([web sales].customers.customer_id)
MAX
MAX is an aggregate function that returns the biggest value from the given input expression.
Page 432
Data Ingest Guide - Platfora Expression Language Reference
Syntax
MAX(numeric_or_datetime_field)
Return Value
Returns a numeric or datetime value of the same type as the input expression.
Input Parameters
numeric_or_datetime_field
Required. A field of type INTEGER, LONG, DOUBLE, FIXED, or DATETIME. Unlike row
functions, aggregate functions can only take field names as input.
Examples
Get the highest value from the sale_amount field:
MAX(sale_amount)
Get the latest date from the Session Timestamp datetime field:
MAX([Session Timestamp])
MIN
MIN is an aggregate function that returns the smallest value from the given input expression.
Syntax
MIN(numeric_or_datetime_field)
Return Value
Returns a numeric or datetime value of the same type as the input expression.
Input Parameters
numeric_or_datetime_field
Required. A field of type INTEGER, LONG, DOUBLE, FIXED, or DATETIME. Unlike row
functions, aggregate functions can only take field names as input.
Examples
Get the lowest value from the sale_amount field:
MIN(sale_amount)
Get the earliest date from the Session Timestamp datetime field:
MIN([Session Timestamp])
Page 433
Data Ingest Guide - Platfora Expression Language Reference
SUM
SUM is an aggregate function that returns the total of all values from the given input expression.
Syntax
SUM(numeric_field)
Return Value
Returns a numeric value of the same type as the input expression.
Input Parameters
numeric_field
Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate
functions can only take field names as input.
Examples
Add the values of the sale_amount field:
SUM(sale_amount)
Add values of the session count field in the users dataset, which is a related dataset down-stream of
clicks:
SUM(clicks.users.[session count])
STDDEV
STDDEV is an aggregate function that calculates the population standard deviation for a group of
numeric values. Standard deviation is the square root of the variance.
Syntax
STDDEV(numeric_field)
Return Value
Returns a value of type DOUBLE. If there are less than two values in the input group, returns NULL.
Input Parameters
numeric_field
Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate
functions can only take field names as input.
Examples
Calculate the standard deviation of the values contained in the sale_amount field:
STDDEV(sale_amount)
Page 434
Data Ingest Guide - Platfora Expression Language Reference
VARIANCE
VARIANCE is an aggregate function that calculates the population variance for a group of numeric
values. Variance measures the amount by which all values in a group vary from the average value of
the group. Data with low variance contains values that are identical or similar. Data with high variance
contains values that are not similar. Variance is calculated as the average of the squares of the deviations
from the mean. Squaring the deviations ensures that negative and positive deviations do not cancel each
other out.
Syntax
VARIANCE(numeric_field)
Return Value
Returns a value of type DOUBLE. If there are less than two values in the input group, returns NULL.
Input Parameters
numeric_field
Required. A field of type INTEGER, LONG, DOUBLE, or FIXED. Unlike row functions, aggregate
functions can only take field names as input.
Examples
Get the population variance of the values contained in the sale_amount field:
VARIANCE(sale_amount)
ROLLUP and Window Functions
Window functions can only be used in conjunction with ROLLUP. ROLLUP is a modifier to an aggregate
expression that determines the partitioning and ordering of a rowset before the associated aggregate
function or window function is applied. ROLLUP defines a window or user-specified set of rows within
a query result set. A window function then computes a value for each row in the window. You can
use window functions to compute aggregated values such as moving averages, cumulative aggregates,
running totals, or a top N per group results.
ROLLUP
ROLLUP is a modifier to an aggregate function that turns a regular aggregate function into a windowed,
partitioned, or adaptive aggregate function. This is useful when you want to compute an aggregation
over a subset of rows within the overall result of a viz query.
Syntax
ROLLUP aggregate_expression [ WHERE input_group_condition [...] ]
[ TO ([partitioning_columns])
Page 435
Data Ingest Guide - Platfora Expression Language Reference
]
[ ORDER BY (ordering_column [ASC | DESC])
ROWS|RANGE window_boundary [window_boundary]
| BETWEEN window_boundary AND window_boundary ]
where window_boundary can be one of:
UNBOUNDED
PRECEDING
value PRECEDING
value FOLLOWING
UNBOUNDED FOLLOWING
Description
A regular measure is the result of an aggregation (such as SUM or AVG) applied to some fact or metric
column of a dataset. For example, suppose we had a dataset with the following rows and columns:
Date
Sale Amount
Product
Region
05/01/2013
100
gadget
west
05/01/2013
200
widget
east
06/01/2013
100
gadget
east
06/01/2013
400
widget
west
07/01/2013
300
widget
west
07/01/2013
200
gadget
east
To define a regular measure called Total Sales, we would use the expression:
SUM([Sale Amount])
When this measure is used in a visualization, the group of input records passed into the aggregate
calculation is determined by the dimensions selected by the user when they create the viz. For example,
if the user chose Region as a dimension in the viz, there would be two input groups for which the
measure would be calculated:
Total Sales / Region
east
west
500
800
Page 436
Data Ingest Guide - Platfora Expression Language Reference
If an aggregate expression includes a ROLLUP clause, the column(s) specified in the TO clause of the
ROLLUP expression determine the additional partitions over which to compute the aggregate expression.
It divides the overall rows returned by the viz query into subsets or buckets, and then computes the
aggregate expression within each bucket. Every ROLLUP expression has implicit partitioning defined: an
absent TO clause treats the entire result set as one partition; an empty TO clause partitions by whatever
dimension columns are present in the viz query.
The WHERE clause is used to filter the input rows that flow into each partition. Input rows that meet the
WHERE clause criteria will be partitioned, and rows that don't will not be partitioned.
The ORDER BY with a RANGE or ROW clause is used to define a window frame within each partition
over which to compute the aggregate expression.
When a ROLLUP measure is used in a visualization, the aggregate calculation is computed across a
set of input rows that are related to, but separate from, the other dimension(s) used in the viz. This is
similar to the type of calculation that is done with a regular measure. However unlike a regular measure,
a ROLLUP measure does not cause the input rows to be grouped into a single result set; the input rows
still retain their separate identities. The ROLLUP clause determines how the input rows are split up for
processing by the ROLLUP's aggregate function.
ROLLUP expressions can be written to make the partitioning adaptive to whatever dimension columns
are selected in the visualization. This is done by using a reference name as the partitioning column, as
opposed to a regular column. For example, suppose we wanted to be able to calculate the total sales for
any granularity of date. We could create an adaptive measure called Rollup Sales to Date that partitions
total sales by date as follows:
ROLLUP SUM([Sale Amount]) TO (Date)
When this measure is used in a visualization, the group of input records passed into the aggregate
calculation is determined by the dimension fields selected by the user in the viz, but partitioned by the
granularity of Date selected by the user. For example, if the user chose the dimensions Date.Month and
Region in the viz, then total sales would be grouped by month and region, but the ROLLUP measure
expression would aggregate the sales by month only.
Notice that the results for the east and west regions are the same - this is because the aggregation
expression is only considering rows that share the same month when calculating the sum of sales.
Month / (Measures) / Region
May 2013
June 2013
July 2013
Rollup Sales to Date
Rollup Sales to Date
Rollup Sales to Date
east | west
east | west
east | west
300 | 300
500 | 500
500 | 500
Page 437
Data Ingest Guide - Platfora Expression Language Reference
Suppose within the date partition, we wanted to calculate the cumulative total day to day. We could
define a window measure called Running Total to Date that looks at each day and all preceding days as
follows:
ROLLUP SUM([Sale Amount]) TO (Date) ORDER BY (Date.Date) ROWS UNBOUNDED
PRECEDING
When this measure is used in a visualization, the group of input records passed into the aggregate
calculation is determined by the dimension fields selected by the user in the viz, and partitioned by the
granularity of Date selected by the user. Within each partition the rows are ordered chronologically (by
Date.Date), and the sum amount is then calculated per date partition by looking at the current row (or
mark), and all rows that come before it within the partition. For example, if the user chose the dimension
Date.Month in the viz, then the ROLLUP measure expression would cumulatively aggregate the sales
within each month.
Month / (Measures) / Date.Date
May 2013
June 2013
July 2013
2013-05-01
2013-06-01
2013-07-01
Running Total to Date
Rollup Sales to Date
Rollup Sales to Date
300
500
500
Return Value
Returns a numeric value per partition based on the output type of the aggregate_expression.
Input Parameters
aggregate_expression
Required. An expression containing an aggregate or window function. Simple aggregate
functions such as COUNT, AVG, SUM, MIN, and MAX are supported. Window functions
such as RANK, DENSE_RANK, and NTILE are supported and can only be used in
conjuction with ROLLUP.
Complex aggregate functions such as STDDEV and VARIANCE are not supported.
WHERE input_group_condition
The WHERE clause limits the group of input rows over which to compute the aggregate
expression. The input group condition is a Boolean (true or false) condition defined
using a comparison operator expression. Any row that does not satisfy the condition will
be excluded from the input group used to calculate the aggregated measure value. For
example (note that datetime values must be specified in yyyy-MM-dd format):
WHERE Date.Date BETWEEN 2012-06-01 AND 2012-07-31
WHERE Date.Year BETWEEN 2009 AND 2013
Page 438
Data Ingest Guide - Platfora Expression Language Reference
WHERE Company LIKE("Plat*")
WHERE Code IN("a","b","c")
WHERE Sales < 50.00
WHERE Age >= 21
You can specify multiple WHERE clauses in a ROLLUP expression.
TO ([partitioning_columns])
The TO clause is used to specify the dimension column(s) used to partition a group of
input rows. This allows you to calculate a measure value for a specific dimension group
(a subset of input rows) that are somehow related to the other dimension groups used in a
visualization (all input rows). It is possible to define an empty group (meaning all rows) by
using empty parenthesis.
When used in a visualization, measure values are computed for groups of input rows that
return the same value for the columns specified in the partitioning list. For example, if the
Date.Month column is used as a partitioning column, then all records that have the same
value for Date.Month will be grouped together in order to calculate the measure value.
The aggregate expression is applied to the group specified in the TO clause independently
of the other dimension groupings used in the visualization. Note that the partitioning
column(s) specified in the TO clause of an adaptive measure expression must also be
included as dimensions (or grouping columns) in the visualization.
A partitioning column can also be the name of a reference field. Using a reference field allows the
partition criteria to dynamically adapt based on any field of the referenced dataset that is used in a viz.
For example, if the partition column is a reference field pointing to the Date dimension, then any subfield of Date (Date.Year, Date.Month, etc.) can be used as the partitioning column by selecting it in a
viz.
A TO clause with an empty partitioning list treats each mark in the result set as an input
group. For example, if the viz includes the Month and Region columns, then TO() would
be equivalent to TO(Month,Region).
ORDER BY (ordering_column)
The optional ORDER BY clause orders the input rows using the values in the specified
column within each partition identified in the TO clause. Use the ORDER BY clause
along with the ROWS or RANGE clauses to define windows over which to compute the
aggregate function. This is useful for computing moving averages, cumulative aggregates,
running totals, or a top value per group of input rows. The ordering column specified in the
ORDER BY clause can be a dimension, measure, or an aggregate expression (for example
ORDER BY (SUM(Sales))). If the ordering column is a dimension, it must be included in
the viz.
By default, rows are sorted in ascending order (low to high values). You can use the DESC
keyword to sort in descending order (high to low values).
ROWS | RANGE
Page 439
Data Ingest Guide - Platfora Expression Language Reference
Required when using ORDER BY. Further limits the rows within the partition by
specifying start and end points within the partition. This is done by specifying a range of
rows with respect to the current row either by logical association (RANGE) or physical
association (ROWS). Use either a ROWS or RANGE clause to express the window
boundary (the set of input rows in each partition, relative to the current row, over which to
compute the aggregate expression). The window boundary can include one, several, or all
rows of the partition.
When using the RANGE clause, the ordering column used in the ORDER BY clause must
be a sub-column of a reference to Platfora's built-in Date dimension dataset.
window_boundary
A window boundary is required when using either ROWS or RANGE. This defines the set
of rows, relative to the current row, over which to compute the aggregate expression. The
row order is based on the ordering specified in the ORDER BY clause.
A PRECEEDING clause defines a lower window boundary (the number of rows to include
before the current row). The FOLLOWING clause defines an upper window boundary
(the number of rows to include after the current row). The window boundary expression
must include either a PRECEEDING or FOLLOWING clause, or both. If PRECEEDING
is omitted, the current row is considered the first row in the window. Similarly, if
FOLLOWING is omitted, the current row is considered the last row in the window. The
UNBOUNDED keyword includes all rows in the direction specified. When you need to
specify both a start and end of a window, use the BETWEEN and AND keywords.
For example:
ROWS 2 PRECEDING means that the window is three rows in size, starting with two
rows preceding until and including the current row.
ROWS BETWEEN 2 PRECEDING AND 5 FOLLOWING means that the window is eight
rows in size, starting with two rows preceding, the current row, and five rows following
the current row. The current row is included in the set of rows by default.
You can exclude the current row from the window by specifying a window start and end
point before or after the current row. For example:
ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING starts the window
with all rows that come before the current row, and ends the window one row before the
current row, thereby excluding the current row from the window.
Examples
Calculate the percentage of flight records in the same departure date period. Note that the
departure_date field is a reference to the Date dataset, meaning that the group to which the
measure is applied can adapt to any downstream field of departure_date (departure_date.Year,
departure_date.Month, and so on). When used in a viz, this will calculate the percentage of flights for
each dimension group in the viz that share the same value for departure_date:
100 * COUNT(Flights) / ROLLUP COUNT(Flights) TO ([Departure Date])
Page 440
Data Ingest Guide - Platfora Expression Language Reference
Normalize the number of flights using the carrier American Airlines (AA) as the benchmark. This will
allow you to compare the number of flights for other carriers against the fixed baseline number of flights
for AA (if AA = 100 percent, then all other carriers will fall either above or below that percentage):
100 * COUNT(Flights) / ROLLUP COUNT(Flights) WHERE [Carrier Code]="AA"
Calculate a generic percentage of total sales. When this measure is used in a visualization, it will show
the percentage of total sales that a mark in the viz is contributing to the total for all marks in the viz. The
input rows depend on the dimensions selected in the viz.
100 * SUM(sales) / ROLLUP SUM(sales) TO ()
Calculate the cumulative total of sales for a given year on a month-to-month basis (year-to-month sales
totals):
ROLLUP SUM(sales) TO (Date.Year) ORDER BY (Date.Month) ROWS UNBOUNDED
PRECEDING
Calculate the cumulative total of sales (for all input rows) for all previous years, but exclude the current
year from the total.
ROLLUP SUM(sales) TO () ORDER BY (Date.Year) ROWS BETWEEN UNBOUNDED
PRECEDING AND 1 PRECEDING
DENSE_RANK
DENSE_RANK is a windowing aggregate function that orders rows by a measure value and assigns a
rank number to each row in the given partition. Rank positions are not skipped in the event of a tie.
DENSE_RANK must be used within a ROLLUP expression.
Syntax
ROLLUP DENSE_RANK()
TO ([partitioning_column])
ORDER BY (measure_expression [ASC | DESC])
ROWS|RANGE window_boundary [window_boundary]
| BETWEEN window_boundary AND window_boundary ]
where window_boundary can be one of:
UNBOUNDED
PRECEDING
value PRECEDING
value FOLLOWING
UNBOUNDED FOLLOWING
Description
DENSE_RANK is a window aggregate function used to assign a ranking number to each row in a group.
If multiple rows have the same ranking value (there is a tie), then the tied rows are given the same rank
value and subsequent rank positions are not skipped.
Page 441
Data Ingest Guide - Platfora Expression Language Reference
The TO clause of the ROLLUP is used to specify the dimension column(s) used to partition a group of
input rows. To define a global ranking that can adapt to any dimension groupings used in a viz, specify
an empty TO clause.
The ORDER BY clause of the ROLLUP expression determines how to order the rows before they are
ranked. The ORDER BY clause should specify the measure field for which you want to calculate the
ranks. The ranked rows in the partition are numbered starting at one.
For example, suppose we had a dataset with the following rows and columns and you want to rank the
Quarters and Regions according to the values in the Sales column.
Quarter
Region
Sales
2010 Q1
North
100
2010 Q1
South
200
2010 Q1
East
300
2010 Q1
West
400
2010 Q2
North
400
2010 Q2
South
250
2010 Q2
East
150
2010 Q2
West
250
Supposing the lens has an existing measure field called Sales(Sum), you could then define a measure
called Sales_Dense_Rank using the following expression:
ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED
PRECEDING
When you include the Quarter, Region, and Sales_Dense_Rank columns in the viz, you get the
following data points. Notice that tied values are given the same rank number and no rank positions are
skipped:
Quarter
Region
SalesRank
2010 Q1
North
6
2010 Q1
South
4
2010 Q1
East
2
2010 Q1
West
1
Page 442
Data Ingest Guide - Platfora Expression Language Reference
Quarter
Region
SalesRank
2010 Q2
North
1
2010 Q2
South
3
2010 Q2
East
5
2010 Q2
West
3
Return Value
Returns a value of type LONG.
Input Parameters
ROLLUP
Required. DENSE_RANK must be used within a ROLLUPROLLUP expression in place of the
aggregate_expression of the ROLLUP.
The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate the
window function. An empty TO calculates the window function over all rows in the query as one group.
The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression.
Examples
Rank the sum of all sales in descending order, so the highest sales is given the ranking of 1.
ROLLUP DENSE_RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED
PRECEDING
Rank the sum of all sales within a given quarter in descending order, so the highest sales in each quarter
is given the ranking of 1.
ROLLUP DENSE_RANK() TO (Quarter) ORDER BY ([Sales(Sum)] DESC) ROWS
UNBOUNDED PRECEDING
NTILE
NTILE is a windowing aggregate function that divides a partitioned group of rows into the specified
number of buckets, and returns the bucket number to which the current row belongs. NTILE must be
used within a ROLLUP expression.
Syntax
ROLLUP NTILE(integer)
TO ([partitioning_column])
ORDER BY (measure_expression [ASC | DESC])
ROWS|RANGE window_boundary [window_boundary]
| BETWEEN window_boundary AND window_boundary ]
Page 443
Data Ingest Guide - Platfora Expression Language Reference
where window_boundary can be one of:
UNBOUNDED
PRECEDING
value PRECEDING
value FOLLOWING
UNBOUNDED FOLLOWING
Description
NTILE is a window aggregate function typically used to calculate percentiles. A percentile (or centile)
is a measure used in statistics indicating the value below which a given percentage of records in a group
falls. For example, the 20th percentile is the value (or score) below which 20 percent of the records may
be found. The term percentile is often used in the reporting of test scores. For example, if a score is in
the 86th percentile, it is higher than 86% of the other scores. The 25th percentile is also known as the
first quartile (Q1), the 50th percentile as the median or second quartile (Q2), and the 75th percentile as
the third quartile (Q3). In general, percentiles, deciles and quartiles are specific types of ntiles.
NTILE must be used within a ROLLUPROLLUP expression in place of the aggregate_expression
of the ROLLUP.
The TO clause of the ROLLUP is used to specify a fixed dimension column used to partition a group of
input rows. To define a global NTILE ranking that can adapt to any dimension groupings used in a viz,
specify an empty TO clause.
The ORDER BY clause of the ROLLUP expression determines how to order the rows before they are
divided into buckets. The ORDER BY clause should specify the measure field for which you want to
calculate NTILE bucket values. A centile would be 100 buckets, a decile would be 10 buckets, a quartile
4 buckets, and so on. The buckets in the partition are numbered starting at one.
For example, suppose we had a dataset with the following rows and columns and you want to divide
the year-to-date sales into four buckets (quartiles) with the highest quartile ranked as 1 and the
lowest ranked as 4. Supposing a measure field has been defined called Sum_YTD_Sales, defined as
SUM([Sales YTD]), you could then define a measure called YTD_Sales_Quartile using the following
expression:
ROLLUP NTILE(4) TO() ORDER BY(Sum_YTD_Sales DESC) ROWS UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING
Name
Gender
Sales YTD
YTD_Sales_Quartile
Chen
F
3,500,000
1
John
M
3,100,000
1
Pete
M
2,900,000
1
Daria
F
2,500,000
2
Page 444
Data Ingest Guide - Platfora Expression Language Reference
Name
Gender
Sales YTD
YTD_Sales_Quartile
Jennie
F
2,200,000
2
Mary
F
2,100,000
2
Mike
M
1,900,000
3
Brian
M
1,700,000
3
Molly
F
1,500,000
3
Theresa
F
1,200,000
4
Hans
M
900,000
4
Ben
M
500,000
4
Because the TO clause of the ROLLUP expression is empty, the quartile partitioning adapts to whatever
dimensions are used in the viz. For example, if you include the Gender dimension field in the viz, the
quartiles would then be computed per gender. The following example divides each gender into buckets
with each gender having 6 year-to-date sales values. The two extra values (the remainder of 6 / 4) are
allocated to buckets 1 and 2, which therefore have one more value than buckets 3 or 4.
Name
Gender
Sales YTD
YTD_Sales_Quartile (partitioned by Gender)
Chen
F
3,500,000
1
Daria
F
2,500,000
1
Jennie
F
2,200,000
2
Mary
F
2,100,000
2
Molly
F
1,500,000
3
Theresa
F
1,200,000
4
John
M
3,100,000
1
Pete
M
2,900,000
1
Mike
M
1,900,000
2
Brian
M
1,700,000
2
Hans
M
900,000
3
Ben
M
500,000
4
Page 445
Data Ingest Guide - Platfora Expression Language Reference
Return Value
Returns a value of type LONG.
Input Parameters
ROLLUP
Required. NTILE must be used within a ROLLUPROLLUP expression in place of the
aggregate_expression of the ROLLUP.
The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate the
window function. An empty TO calculates the window function over all rows in the query as one group.
The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression.
integer
Required. An integer that specifies the number of buckets to divide the partitioned rows into.
Examples
Perhaps the most common use case for NTILE is to get a global ranking of result rows. For example,
if you wanted to get the percentile of Total Records per City, you may think the expression to use is:
ROLLUP NTILE(100) TO (City) ORDER BY ([Total Records] DESC) ROWS UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING.
However, by leaving the TO clause blank, the percentile buckets can adapt to whatever dimension(s)
you use in the viz. To calculate the Total Records percentiles by City, you could define a global
Total_Records_Percentiles measure and then use this measure in conjunction with the City dimension in
the viz (or any other dimension for that matter).
ROLLUP NTILE(100) TO () ORDER BY ([Total Records] DESC) ROWS UNBOUNDED
PRECEDING AND UNBOUNDED FOLLOWING
RANK
RANK is a windowing aggregate function that orders rows by a measure value and assigns a rank number
to each row in the given partition. Rank positions are skipped in the event of a tie. RANK must be used
within a ROLLUP expression.
Syntax
ROLLUP RANK()
TO ([partitioning_column])
ORDER BY (measure_expression [ASC | DESC])
ROWS|RANGE window_boundary [window_boundary]
| BETWEEN window_boundary AND window_boundary ]
where window_boundary can be one of:
Page 446
Data Ingest Guide - Platfora Expression Language Reference
UNBOUNDED
PRECEDING
value PRECEDING
value FOLLOWING
UNBOUNDED FOLLOWING
Description
RANK is a window aggregate function used to assign a ranking number to each row in a group. If
multiple rows have the same ranking value (there is a tie), then the tied rows are given the same rank
value and the subsequent rank position is skipped.
The TO clause of the ROLLUP is used to specify the dimension column(s) used to partition a group of
input rows. To define a global ranking that can adapt to any dimension groupings used in a viz, specify
an empty TO clause.
The ORDER BY clause of the ROLLUP expression determines how to order the rows before they are
ranked. The ORDER BY clause should specify the measure field for which you want to calculate the
ranks. The ranked rows in the partition are numbered starting at one.
For example, suppose we had a dataset with the following rows and columns and you want to rank the
Quarters and Regions according to the values in the Sales column.
Quarter
Region
Sales
2010 Q1
North
100
2010 Q1
South
200
2010 Q1
East
300
2010 Q1
West
400
2010 Q2
North
400
2010 Q2
South
250
2010 Q2
East
150
2010 Q2
West
250
Supposing the lens has an existing measure field called Sales(Sum), you could then define a measure
called Sales_Rank using the following expression:
ROLLUP RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED
PRECEDING
Page 447
Data Ingest Guide - Platfora Expression Language Reference
When you include the Quarter, Region, and Sales_Rank columns in the viz, you get the following
data points. Notice that tied values are given the same rank number and the rank positions 2 and 5 are
skipped:
Quarter
Region
SalesRank
2010 Q1
North
8
2010 Q1
South
6
2010 Q1
East
3
2010 Q1
West
1
2010 Q2
North
1
2010 Q2
South
4
2010 Q2
East
7
2010 Q2
West
4
Return Value
Returns a value of type LONG.
Input Parameters
ROLLUP
Required. RANK must be used within a ROLLUPROLLUP expression in place of the
aggregate_expression of the ROLLUP.
The TO clause of the ROLLUP expression specifies the dimension group(s) over which to calculate the
window function. An empty TO calculates the window function over all rows in the query as one group.
The ORDER BY clause of the ROLLUP expression specifies a measure field or aggregate expression.
Examples
Rank the sum of all sales in descending order, so the highest sales is given the ranking of 1.
ROLLUP RANK() TO () ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED
PRECEDING
Rank the sum of all sales within a given quarter in descending order, so the highest sales in each quarter
is given the ranking of 1.
ROLLUP RANK() TO (Quarter) ORDER BY ([Sales(Sum)] DESC) ROWS UNBOUNDED
PRECEDING
Page 448
Data Ingest Guide - Platfora Expression Language Reference
ROW_NUMBER
ROW_NUMBER is a windowing aggregate function that assigns a unique, sequential number to each row
in a group (partition) of rows, starting at 1 for the first row in each partition. ROW_NUMBER must be used
within a ROLLUP expression, which acts as a modifier for ROW_NUMBER. Use a column in the ROLLUP
order by clause to determine on which column to determine the row number.
Syntax
ROLLUP ROW_NUMBER(integer)
TO ([partitioning_column])
ORDER BY (ordering_column [ASC | DESC])
ROWS|RANGE window_boundary [window_boundary]
| BETWEEN window_boundary AND window_boundary ]
where window_boundary can be one of:
UNBOUNDED
PRECEDING
value PRECEDING
value FOLLOWING
UNBOUNDED FOLLOWING
Description
For example, suppose we had a dataset with the following rows and columns:
Quarter
Region
Sales
2010 Q1
North
100
2010 Q1
South
200
2010 Q1
East
300
2010 Q1
West
400
2010 Q2
North
400
2010 Q2
South
250
2010 Q2
East
150
2010 Q2
West
250
Page 449
Data Ingest Guide - Platfora Expression Language Reference
Suppose you want to assign a unique ID to the sales of each region by quarter in descending order. In
this example, a measure field is defined called Sum_Sales with the expression SUM(Sales). You could
then define a measure called SalesNumber using the following expression:
ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWS
UNBOUNDED PRECEDING
When you include the Quarter, Region, and SalesNumber columns in the viz, you get the following data
points:
Quarter
Region
SalesNumber
2010 Q1
North
4
2010 Q1
South
3
2010 Q1
East
2
2010 Q1
West
1
2010 Q2
North
1
2010 Q2
South
2
2010 Q2
East
4
2010 Q2
West
3
Return Value
Returns a value of type LONG.
Input Parameters
None
Examples
Assign a unique ID to the sales of each region by quarter in descending order, so the highest sales is
given the number of 1.
ROLLUP ROW_NUMBER() TO (Quarter) ORDER BY (Sum_Sales DESC) ROWS
UNBOUNDED PRECEDING
Page 450
Data Ingest Guide - Platfora Expression Language Reference
User Defined Functions (UDFs)
User defined functions (UDFs) allow you to define your own per-row processing logic, and then expose
that functionality to users in the Platfora application expression builder.
User defined functions can only be used to implement new row functions, not
aggregate functions. If a computed field that uses a UDF is included in a lens, the
UDF will be executed once for each row during the lens build process. This is good
to keep in mind when writing UDF Java programs, so you do not write programs
that negatively impact lens build resources or execution times.
Writing a Platfora UDF Java Program
User defined functions (UDFs) are written in the Java programming language and implement the
Platfora-provided Java interface, com.platfora.udf.UserDefinedFunction.
Verify that any JAR file that the UDF will use is compatible with the existing libraries Platfora uses.
You can find those libraries in $PLATFORA_HOME/lib.
To define a user defined function for Platfora, you must have the Java Development Kit (JDK) version 6
or 7 installed on the machine where you plan to do your development.
You will also need the com.platfora.udf.UserDefinedFunction interface Java code from
your Platfora master server installation. If you go to the $PLATFORA_HOME/tools/udf directory of
your Platfora master server installation, you will find two files:
• platfora-udf.jar – This is the compiled code for the
com.platfora.udf.UserDefinedFunction interface. You must link to this jar file (place it
in the CLASSPATH) when you compile your UDF Java program.
• /com/platfora/udf/UserDefinedFunction.java – This is the source code for the
Java interface that your UDF classes need to implement. The source code is provided as reference
documentation of the Platfora UserDefinedFunction interface. You can refer to this file when
writing your UDF Java programs.
1. Copy the file $PLATFORA_HOME/tools/udf/platfora-udf.jar to a directory on the
machine where you plan to develop and compile your UDF program.
2. Write a Java program that implements com.platfora.udf.UserDefinedFunction interface.
For example, here is a sample Java program that defines a REPEAT_STRING user defined function.
This simple function repeats an input string a specified number of times.
import java.util.List;
/**
* Sample user-defined function implementation that demonstrates
* how to create a REPEAT_STRING function.
*/
Page 451
Data Ingest Guide - Platfora Expression Language Reference
public class RepeatString implements
com.platfora.udf.UserDefinedFunction {
/**
* Returns the name of the user-defined function.
* The first character in the name must be a letter,
* and subsequent characters must be either letters,
* digits, or underscores. You cannot name your function
* the same name as an existing Platfora
* built-in function. Names are case-insensitive.
*/
@Override
public String getFunctionName() {
return "REPEAT_STRING";
}
/**
* Returns one of the following values, reflecting the
* return type of the user-defined function:
* DATETIME, DOUBLE, FIXED, INTEGER, LONG, or STRING.
*/
@Override
public String getReturnType() {
return "STRING";
}
/**
* Returns an array of Strings, one for each of the
* input arguments to the user-defined function,
* specifying the required data type for each argument.
* The Strings should be of the following values:
* DATETIME, DOUBLE, FIXED, INTEGER, LONG, STRING.
*/
@Override
public String[] getArgumentTypes() {
return new String[] { "STRING", "INTEGER" };
}
/**
* Returns a human-readable description of what the function
* does, to be displayed to Platfora users in the
* Expression Builder. May return null.
*/
@Override
public String getDescription() {
return "The REPEAT_STRING function returns an input string
repeated " +
" a specified number of times.";
}
Page 452
Data Ingest Guide - Platfora Expression Language Reference
/**
* Returns a human-readable description explaining the
* value that the function returns, to be displayed to
* Platfora users in the Expression Builder. May return null.
*/
@Override
public String getReturnValueDescription() {
return "Returns one value per row of type STRING";
}
/**
* Returns a human-readable example of the function syntax,
* to be displayed to Platfora users in the Expression
* Builder. May return null.
*/
@Override
public String getExampleUsage() {
return "CONCAT(\"It's a \", REPEAT_STRING(\"Mad \",4), \"
World\")";
}
/**
* The compute method performs the actual work of evaluating
* the user-defined function. The method should operate on the
* argument values provided to calculate the function return
value
* and return a Java object of the appropriate type to represent
* the return value. The following mapping describes the Java
* object type that is used to represent each Platfora data type:
* DATETIME -> java.util.Date
* DOUBLE -> java.lang.Double
* FIXED -> java.lang.Long
* INTEGER -> java.lang.Integer
* LONG -> java.lang.Long
* STRING -> java.lang.String
* Note on FIXED type: fixed-precision numbers in Platfora
* are represented as Longs that have been scaled by a
* factor of 10,000.
*
* In the event that the user-defined function
* encounters invalid inputs, or the function return value is not
* defined given the inputs provided, the compute method should
return
* null rather than throwing an exception. The compute method
should
* avoid throwing any exceptions.
*
* @param arguments The values of the function inputs.
*
* The entries in this list will match the specification
* provided by getArgumentTypes method in type, number, and order:
* for example, if getArgumentTypes returned an array of
* length 3 with the values STRING, DOUBLE, STRING, then
Page 453
Data Ingest Guide - Platfora Expression Language Reference
* the arguments parameter will hold be a list of 3 Java
* objects: a java.lang.String, a java.lang.Double, and a
* java.lang.String. Any of the values within the
* arguments List may be null.
*/
@Override
public String compute(List arguments) {
// cast the inputs to the correct types
final String toRepeat = (String) arguments.get(0);
final Integer numberOfRepeats = (Integer) arguments.get(1);
// check for invalid inputs
if (toRepeat == null || numberOfRepeats == null ||
numberOfRepeats < 0)
return null;
}
}
// repeat the input string the specified number of times
final StringBuilder builder = new StringBuilder();
for (int i = 0; i < numberOfRepeats; i++) {
builder.append(toRepeat);
}
return builder.toString();
3. Compile your .java UDF program file into a .class file (make sure to link to the platforaudf.jar file or place it in your Java CLASSPATH).
The target Java version must be Java 1.6. Compiling with a target of Java 1.7 will result in an error
when the UDF is used.
For example, to compile the RepeatString.java program using Java 1.6:
javac -source 1.6 -target 1.6 -cp platfora-udf.jar RepeatString.java
4. Create a Java archive file (.jar) containing your .class file.
For example:
jar cf repeat-string-udf.jar RepeatString.class
After you have written and compiled your UDF Java program, you must then install and enable it on the
Platfora master server. See Adding a UDF to the Platfora Expression Builder.
Adding a UDF to the Platfora Expression Builder
After you have written and compiled a user defined function (UDF) Java class, you must install your
class on the Platfora master server and enable it so that it can be seen and used in the Platfora expression
builder.
This task is performed on the Platfora master server.
Before you begin, you must have written and compiled a Java class for your user defined function. See
Writing a Platfora UDF Java Program.
1. Create a directory named extlib in the Platfora data directory on the Platfora master server.
Page 454
Data Ingest Guide - Platfora Expression Language Reference
For example:
$ mkdir $PLATFORA_DATA_DIR/extlib
2. Copy the Java archive (.jar) file containing your UDF class to the $PLATFORA_DATA_DIR/
extlib directory on the Platfora master server.
For example:
$ cp repeat-string-udf.jar $PLATFORA_DATA_DIR/extlib/
3. Set the Platfora server configuration property, platfora.udf.class.names, so it contains
the name of your UDF Java class. If you have more than one class, separate the class names with a
comma.
For example, to set this property using the platfora-config command-line utility:
$ $PLATFORA_HOME/bin/platfora-config set --key
platfora.udf.class.names --value RepeatString
4. Restart the Platfora server:
$ platfora-services restart
Page 455
Data Ingest Guide - Platfora Expression Language Reference
The user defined function will then be available for defining computed field expressions in the Add
Field dialog of the Platfora application.
Due to the way some web browsers cache Javascript files, the newly added
function may not appear in the Functions list for up to 24 hours. However, the
function is immediately available for use and recognized by the Expression autocomplete feature.
Regular Expression Reference
Regular expressions vary in complexity using a combination of basic constructs to describe a string
matching pattern. This reference describes the most common regular expression matching patterns, but
is not a comprehensive list.
Regular expressions, also referred to as regex or regexp, are a standardized collection of special
characters and constructs used for matching strings of text. They provide a flexible and precise language
for matching particular characters, words, or patterns of characters.
Page 456
Data Ingest Guide - Platfora Expression Language Reference
Platfora regular expressions are based on the pattern matching syntax of the Java programming
language. For more in depth information on writing valid regular expressions, refer to the Java regular
expression pattern documentation.
Platfora makes use of regular expressions in the following contexts:
• In computed field expressions that use the REGEX or REGEX_REPLACE functions.
• In PARTITION expression statements for event series processing computed fields.
• In the Regex file parser in data ingest.
• In the data source location path descriptor in data ingest.
• In lens filter expressions.
Regex Literal and Special Characters
The most basic form of regular expression pattern matching is the match of a literal character or string.
Regular expressions also have a number of special characters that affect the way a pattern is matched.
This section describes the regular expression syntax for referring to literal characters, special characters,
non-printable characters (such as a tab or a newline), and special character escaping.
Literal Characters
The most basic form of pattern matching is the match of literal characters. For example, if the regular
expression is foo and the input string is foo, the match will succeed because the strings are identical.
Special Characters
Certain characters are reserved for special use in regular expressions. These special characters are often
called metacharacters. If you want to use special characters as literal characters, they must be escaped.
Character Name
Character
Reserved For
opening bracket
[
start of a character class
closing bracket
]
end of a character class
hyphen
-
character ranges within a character class
backslash
\
general escape character
caret
^
beginning of string, negating of a character class
dollar sign
$
end of string
period
.
matching any single character
pipe
|
alternation (OR) operator
question mark
?
optional quantifier, quantifier minimizer
Page 457
Data Ingest Guide - Platfora Expression Language Reference
Character Name
Character
Reserved For
asterisk
*
zero or more quantifier
plus sign
+
once or more quantifier
opening parenthesis
(
start of a subexpression group
closing parenthesis
)
end of a subexpression group
opening brace
{
start of min/max quantifier
closing brace
}
end of min/max quantifier
Escaping Special Characters
There are two ways to force a special character to be treated as an ordinary character:
• Precede the special character with a \ (backslash character). For example, to specify an asterisk as a
literal character instead of a quantifier, use \*.
• Enclose the special character(s) within \Q (starting quote) and \E (ending quote). Everything
between \Q and \E is then treated as literal characters.
• To escape literal double-quotes in a REGEX() expression, double the double-quotes (""). For
example, to extract the inches portion from a height field where example values are 6'2", 5'11":
REGEX(height, "\'(\d)+""$")
Non-Printing Characters
You can use special character sequence constructs to specify non-printable characters in a regular
expression. Some of the most commonly used constructs are:
Construct
Matches
\n
newline character
\r
carriage return character
\t
tab character
\f
form feed character
Regex Character Classes
A character class allows you to specify a set of characters, enclosed in square brackets, that can produce
a single character match. There are also a number of special predefined character classes (backslash
character sequences that are shorthand for the most common character sets).
Page 458
Data Ingest Guide - Platfora Expression Language Reference
Character Class Constructs
A character class matches only to a single character. For example, gr[ae]y will match to gray or
grey, but not to graay or graey. The order of the characters inside the brackets does not matter.
You can use a hyphen inside a character class to specify a range of characters. For example, [az] matches a single lower-case letter between a and z. You can also use more than one range, or a
combination of ranges and single characters. For example, [0-9X] matches a numeric digit or the letter
X. Again, the order of the characters and the ranges does not matter.
A caret following an opening bracket specifies characters to exclude from a match. For example,
[^abc] will match any character except a, b, or c.
Construct
Type
Description
[abc]
simple
matches
a
or
b
or
c
[^abc]
negation
matches any character except
a
or
b
or
c
[a-zA-Z]
range
matches
a
through
z
, or
A
through
Z
(inclusive)
[a-d[m-p]]
union
matches
a
through
d
, or
m
through
p
Page 459
Data Ingest Guide - Platfora Expression Language Reference
Construct
Type
Description
[a-z&&[def]]
intersection matches
d
,
e
, or
f
[a-z&&[^xq]]
subtraction matches
a
through
z
, except for
x
and
q
Predefined Character Classes
Predefined character classes offer convenient shorthands for commonly used regular expressions.
Construct
Description
Example
.
matches any single character (except newline)
.at
matches "cat", "hat", and also"bat" in the
phrase "batch files"
\d
\D
matches any digit character (equivalent to
\d
[0-9]
)
matches "3" in "C3PO" and "2" in
"file_2.txt"
matches any non-digit character (equivalent to
\D
[^0-9]
matches "S" in "900S" and "Q" in "Q45"
)
\s
matches any single white-space character
(equivalent to
[ \t\n\x0B\f\r]
\sbook
matches "book" in "blue book" but
nothing in "notebook"
)
\S
matches any single non-white-space character
\Sbook
matches "book" in "notebook" but
nothing in "blue book"
Page 460
Data Ingest Guide - Platfora Expression Language Reference
Construct
Description
Example
\w
matches any alphanumeric character, including r\w*
underscore (equivalent to
matches "rm" and "root"
[A-Za-z0-9_]
)
\W
matches any non-alphanumeric character
(equivalent to
[^A-Za-z0-9_]
\W
matches "&" in "stmd &" , "%" in
"100%", and "$" in "$HOME"
)
POSIX Character Classes (US-ASCII)
POSIX has a set of character classes that denote certain common ranges. They are similar to bracket and
predefined character classes, except they take into account the locale (the local language/coding system).
\p{Lower}
a lower-case alphabetic character,
[a-z]
\p{Upper}
an upper-case alphabetic character,
[A-Z]
\p{ASCII}
an ASCII character,
[\x00-\x7F]
\p{Alpha}
an alphabetic character,
[a-zA-z]
\p{Digit}
a decimal digit,
[0-9]
\p{Alnum}
an alphanumeric character,
[a-zA-z0-9]
\p{Punct}
a punctuation character, one of
!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
\p{Graph}
a visible character,
[\p{Alnum}\p{Punct}]
\p{Print}
a printable character,
[\p{Graph}\x20]
\p{Blank}
a space or tab,
[ t]
Page 461
Data Ingest Guide - Platfora Expression Language Reference
\p{Cntrl}
a control character,
[\x00-\x1F\x7F]
\p{XDigit}
a hexidecimal digit,
[0-9a-fA-F]
\p{Space}
a whitespace character,
[ \t\n\x0B\f\r]
Regex Line and Word Boundaries
Boundary matching constructs are used to specify where in a string to apply a matching pattern. For
example, you can search for a particular pattern within a word boundary, or search for a pattern at the
beginning or end of a line.
Construct
Description
Example
^
matches from the beginning of a line (multiline matches are currently not supported)
^172
matches from the end of a line (multi-line
matches are currently not supported)
d$
matches within a word boundary
\bis\b
$
\b
will match the "172" in IP address
"172.18.1.11" but not in "192.172.2.33"
will match the "d" in "maid" but not in
"made"
matches the word "is" in "this is my
island", but not the "is" part of "this" or
"island".
\bis
matches both "is" and the "is" in "island",
but not in "this".
\B
matches within a non-word boundary
\Bb
matches "b" in "sbin" but not in "bash"
Regex Quantifiers
Quantifiers specify how often the preceding regular expression construct should match. There are three
classes of quantifiers: greedy, reluctant, and possessive. The difference between greedy, reluctant, and
possessive quantifiers involves what part of the string to try for the initial match, and how to retry if the
initial attempt does not produce a match.
Page 462
Data Ingest Guide - Platfora Expression Language Reference
Quantifier Constructs
By default, quantifiers are greedy. A greedy quantifier will first try for a match with the entire input
string. If that produces a match, then the match is considered a success, and the engine can move on to
the next construct in the regular expression. If the first try does not produce a match, the engine backsoff one character at a time until a match is found. So a greedy quantifier checks for possible matches in
order from the longest possible input string to the shortest possible input string, recursively trying from
right to left.
Adding a ? (question mark) to a greedy quantifier makes it reluctant. A reluctant quantifier will first try
for a match from the beginning of the input string, starting with the shortest possible piece of the string
that matches the regex construct. If that produces a match, then the match is considered a success, and
the engine can move on to the next construct in the regular expression. If the first try does not produce
a match, the engine adds one character at a time until a match is found. So a reluctant quantifier checks
for possible matches in order from the shortest possible input string to the longest possible input string,
recursively trying from left to right.
Adding a + (plus sign) to a greedy quantifier makes it possessive. A possessive quantifier is like a greedy
quantifier on the first attempt (it tries for a match with the entire input string). The difference is that
unlike a greedy quantifier, a possessive quantifier does not retry a shorter string if a match is not found.
If the initial match fails, the possessive quantifier reports a failed match. It does not make any more
attempts.
Greedy ReluctantPossessiveDescription
ConstructConstructConstruct
Example
?
matches the previous
character or construct once
or not at all
st?on
matches the previous
character or construct zero
or more times
if*
matches the previous
character or construct one
or more times
if+
matches the previous
character or construct
exactly
o{2}
*
+
{n}
??
*?
+?
{n}?
?+
*+
++
{n}+
n
times
Page 463
matches "son" in "johnson" and "ston"
in "johnston" but nothing in "clinton" or
"version"
matches "if", "iff" in "diff", or "i" in "print"
matches "if", "iff" in "diff", but nothing in
"print"
matches "oo" in "lookup" and the first two o's
in "fooooo" but nothing in "mount"
Data Ingest Guide - Platfora Expression Language Reference
Greedy ReluctantPossessiveDescription
ConstructConstructConstruct
Example
{n,}
o{2,}
{n,}?
{n,}+
matches the previous
character or construct at
least
matches "oo" in "lookup" all five o's in
"fooooo" but nothing in "mount"
n
times
{n,m} {n,m}? {n,m}+ matches the previous
character or construct at
least
F{2,4}
matches "FF" in "#FF0000" and the last four
F's in "#FFFFFF"
n
times, but no more than
m
times
Regex Capturing Groups
Groups are specified by a pair of parenthesis around a subpattern in the regular expression. By placing
part of a regular expression inside parentheses, you group that part of the regular expression together.
This allows you to apply regex operators and quantifiers to the entire group at once. Besides grouping
part of a regular expression together, parenthesis also create a capturing group. Capturing groups are
used to determine which matching values to save or return from your regular expression.
Group Numbering
A regular expression can have more than one group and the groups can be nested. The groups are
numbered 1-n from left to right, starting with the first opening parenthesis. There is always an implicit
group 0, which contains the entire match. For example, the pattern:
(a(b*))+(c)
contains three groups:
group 1: (a(b*))
group 2: (b*)
group 3: (c)
Capturing Groups
By default, a group captures the text that produces a match. Besides grouping part of a regular
expression together, parenthesis also create a capturing group or a backreference. The portion of the
string matched by the grouped subexpression is captured in memory for later retrieval or use.
Capturing Groups and the Regex Line Parser
When you choose the Regex line parser during the Parse Data phase of the data ingest process,
Platfora uses capturing groups to determine what parts of the regular expression to return as columns.
Page 464
Data Ingest Guide - Platfora Expression Language Reference
The Regex line parser applies the user-supplied regular expression against each line in the source file,
and returns each capturing group in the regular expression as a column value.
For example, suppose you had user records in a file, and the lines were formatted like this:
Name: John Smith Address: 123 Main St. Age: 25 Comment: Active
Name: Sally R. Jones Address: 2 E. El Camino Real Age: 32
Name: Rod Rogers Address: 55 Elm Street Comment: Suspended
You could use the following regular expression to extract the Full Name, Last Name, Address, Age, and
Comment column values:
Name: (.*\s(\p{Alpha}+)) Address:\s+(.*) Age:\s+([0-9]+)(?:\s+Comment:\s
+(.*))?
Capturing Groups and the REGEX Function
The REGEX function can be used to extract a portion of a string value. For the REGEX function, only the
value of the first capturing group is returned. For example, if you wanted to match all possible email
address strings with a pattern of [email protected], but only return the provider portion of the
email address from the email field:
REGEX(email,"^[a-zA-Z0-9._%+-]+@([a-zA-Z0-9._-]+)\.[a-zA-Z]{2,4}$")
Capturing Groups and the REGEX_REPLACE Function
The REGEX_REPLACE function is used to match a string value, and replace matched strings with
another value. The REGEX_REPLACE function takes three arguments: an input string, a matching regex,
and a replacement regex. Capturing groups can be used to capture backreferences (see Backreferences),
but do not control what portions of the match are returned (the entire match is always returned).
Backreferences
Backreferences allow you to capture and reuse a subexpression match inside the same regular
expression. You can reuse a capturing group as a backreference by referring to its group number
preceded by a backslash (for example, \1 refers to capturing group 1, \2 refers to capturing group 2,
and so on).
For example, if you wanted to match a pair of HTML tags and their enclosed text, you could capture the
opening tag into a backreference, and then reuse it to match the corresponding closing tag:
(<([A-Z][A-Z0-9]*)\b[^>]*>.*?</\2>)
This regular expression contains two capturing groups, the outermost capturing group (which captures
the entire string), and one which captures the string matched by [A-Z][A-Z0-9]* into backreference
number two. This backreference can then be reused with \2 (backslash two) to match the corresponding
closing HTML tag.
When referring to capturing groups in the previous regular expression, the backreference syntax is
slightly different. The backreference group number is preceded by a dollar sign instead of a backslash
(for example, $1 refers to capturing group 1 of the previous expression). An example of this would be
Page 465
Data Ingest Guide - Platfora Expression Language Reference
the REGEX_REPLACE function, which takes two regular expressions: one for the matching string, and
one for the replacement string.
The following example matches the values in a phone_number field where phone number values are
formatted as xxx.xxx.xxxx, and replaces them with phone number values formatted as (xxx) xxxxxxx. Notice the backreferences in the replacement expression; they refer to the capturing groups of the
previous matching expression:
REGEX_REPLACE(phone_number,"([0-9]{3})\.([[0-9]]{3})\.([[0-9]]
{4})","\($1\) $2-$3")
Non-Capturing Groups
In some cases, you may want to use parenthesis to group subpatterns, but not capture text. A noncapturing group starts with (?: (a question mark and colon following the opening parenthesis). For
example, h(?:a|i|o)t matches hat or hit or hot, but does not capture the a, i, or o from the
subexpression.
Page 466
Appendix
B
Lens Query Language Reference
Platfora's lens query language is a SQL-like language for programmatically querying the prepared data in a lens.
This reference describes the query language syntax and usage.
Topics:
•
SELECT Statement
SELECT Statement
Queries an aggregate lens. A SELECT statement is input to a progammatic lens query.
Syntax
[ DEFINE alias-name AS expression [ DEFINE ... ] ]
SELECT measure-field [ AS alias-name ] | measure-expression AS alias-name [ , {
dimension-field [ AS alias-name ] | row-expression AS alias-name } [ , ...] ]
FROM lens-name
[ WHERE filter-expression [ AND filter-expression ] ]
[ GROUP BY dimension-field [ [, group-ordering ] ]
[ HAVING measure-filter-expression ]
Description
Use SELECT to query an aggregate lens. You cannot query an event series lens. The SELECT must
include at least one measure field (column) or expression. Once you've supplied a measure value, your
SELECT can contain additional measures or dimensions.
If you include non-measure columns in the SELECT, you must include those columns in a GROUP BY
clause. Use the DEFINE clause to add one or more computed fields to the lens.
Platfora always queries the current version of the lens-name. Keep in mind lens definitions can
change. If you write a query against a column that is later dropped from the lens, a previously working
query can fail as a result and return an error message.
Page 467
Data Ingest Guide - Lens Query Language Reference
Querying via REST
A lens query is meant to support external applications that want to access Platfora lens data. For this
reason, you query a lens by making API calls to the query REST resource:
https://hostname:port/api/v1/query
The resource supports passing the statement as a GET or POST with application/x-www-formurlencoded URL/form parameters. The caller must authenticate as a user with the Analyst
(Limited) role or higher to execute a query. To query a specific lens, the caller must have Data
Access on all datasets the lens references.
A query returns comma separated values (CSV) by default. You have the option of receiving the results
in a JSON body. For detailed information about using this REST APIs, see the Platfora API Reference.
Writing a SELECT Expression
The SELECT expression can contain multiple columns and or expressions. You must specify at least one
measure column or measure expression. Once you meet this requirement, you can include additional
dimension columns or row expressions. Recall that a measure is a aggregate numeric value where as a
dimension is a numeric, text, or time-based.
A measure expression supports addition, subtraction, multiplication, and division. Inputs values can be
column references (fields) or expressions that contain any of these supported functions:
• aggregate ( that is, AVG(), COUNT(), SUM() and so forth
• ROLLUP()
• EXP()
• POW()
• SQRT()
Measure expressions can also include literal (integers or string) values. When constructing a measure
expression, make sure you understand the expression syntax rules and limitations of aggregate functions.
See the Expression and Query Language Reference for information on the aggregate function limitations
and expression syntax.
If the SELECT statement includes a dimension value, you must include the column in your GROUP BY
clause. A dimension or row expression supports addition, subtraction, multiplication, and division of
row values. Your SELECT can reference columns or supply row expressions that include following
functions:
• data type conversion
• date and time
• general processing
• math
• string
• URL
Page 468
Data Ingest Guide - Lens Query Language Reference
An expression can include literal (integers or string) values or other expressions. Make sure you
understand the expression syntax rules. See the Expression and Query Language Reference for
information on the expression syntax.
When specifying an expression, supply an alias (AS clause) if you want to refer
to the expression elsewhere in other clauses. You cannot use an * (asterisk) to
retrieve all of the rows in a lens.
Specifying Lens and Column Names
When you specify the lens-name you use the name as it appears in the Data Catalog user interface.
Enclose the name in [ ] (brackets) if it contains spaces or special characters. For example, you would
refer to the Web Log-2014 lens as:
[Web Log-2014]
When specifying a column name, you should follow the expression language rules for field (column)
references. This means that for columns belonging to a reference dataset, you must qualify the name
using dot-notation as follows:
{ [ reference-dataset . [...] ] column-name | alias-name }
For example, use device.manufacturer to refer the manufacturer column in the device
dataset. If you define an alias, use the alias to refer to the column in other parts of your query.
DEFINE Clause
Defines a computed field to include in a SELECT statement.
Syntax
DEFINE alias-name AS { expression }
Description
Use a DEFINE clause to include new computed fields that aren't in the original lens. Using the DEFINE
clause is optional. Platfora applies the DEFINE statement before the main SELECT clause. New
computed fields can only use fields already in the lens.
The expression you write must be a valid expression for a vizboard computed field. This means your
computed field is subject to the following restrictions:
• You can only define a computed field that operates on fields that exist in the lens.
• A vizboard computed field can break if it operates on fields that are later removed from the lens or a
focus or referenced dataset.
• You cannot use aggregate functions to add new measures from dimension data in the lens.
• You can compute new measures from existing measures already in the lens. For example, if
an AVG(sales) aggregate exists in the data, you can define a SUM(sales) field because
SUM(sales) is necessary to compute AVG(sales).
• You cannot use custom user-defined functions (UDFs) in vizboard computed field expressions.
Page 469
Data Ingest Guide - Lens Query Language Reference
If you specify multiple DEFINE clauses, separate each new DEFINE with a space.
A computed field can depend on any fields pre-existing in the lens or other fields created in the query's
scope. For example, a computed field you DEFINE can depend on fields also created through other
DEFINE statements.
WHERE Clause
Filters a lens query by one or more predicate expression.
Syntax
WHERE predicate-expression [ AND predicate-expression ]
A predicate-expression can be a comparison:
column-name { = | < | > | <= | >= | != } literal
Or the predicate-expression can be a list-expression such as this:
column-name [ NOT ] { IN list | LIKE pattern | BETWEEN literal AND literal }
Description
Use WHERE clause to filter a lens query by one or more predicate expressions. Use the AND keyword
to join multiple expressions. A WHERE clause can include expressions that make use of the comparison
operators or list expressions. For detailed information about expressions syntax, see the Platfora
Expression and Query Language Reference.
You cannot useIS NULL or IS NOT NULL comparisons in the WHERE clause. You also cannot
use relative date filters (LAST integer DAYS) You can use the NOT keyword to negate any list
expressions.
The following example illustrate several different permutations of expressions structures you can use:
SELECT count()
FROM [View Summary]
WHERE prior_views NOT IN (3,5,7,11,13,17)
AND TO_LONG(prior_views) NOT IN (4294967296)
AND avebitrate_double NOT IN (3101.0, 2598.0, 804.0)
AND video.genre NOT IN ("Silent", "Exercise")
AND video.genre NOT LIKE ("*a*")
AND date.Date NOT IN (2011-08-04, 2011-06-04, 2011-07-04)
AND prior_views > 23 AND avebitrate_double < 3101.0
AND TO_FIXED(avebitrate_double) != 3101.0
AND TO_LONG(prior_views) != 4294967296
AND video.genre <= "Silent"
AND date.Date > 2011-08-04
AND date.Date NOT BETWEEN 2012-01-01 AND 2013-01-01
AND video.genre BETWEEN "Exercise" and "Silent"
AND prior_views BETWEEN 0 and 100
Page 470
Data Ingest Guide - Lens Query Language Reference
AND avebitrate_double NOT BETWEEN 1234.5678 AND 2345.6789
When comparing literal dates, make sure you use the format of yyyy-MM-dd without any enclosing
quotation marks or other punctuation.
GROUP BY Clause
Orders and optionally limit the results of a SELECT statement.
Syntax
GROUP BY group-ordering [ , group-ordering ]
The group-ordering clause has the following syntax:
column-name [ SORT [ BY measure-name ] [ { ASC | DESC } ] [ LIMIT integer [ WITH
OTHERS ] ]
Description
Use a GROUP BY clause to order and optionally limit results of a SELECT. If the SELECT statement
includes a dimension column, you must supply GROUP BY clause that includes the dimension column.
Otherwise, the GROUP BY clause is optional.
A GROUP BY can include more than one column. To do this, delimit each column with a , (comma) as
illustrated here:
GROUP BY col_A, col_B, col_c
You can GROUP BY a new computed field that is not defined in the lens. To do this, you add the field
using the DEFINE clause and then use the field in the GROUP BY clause. Alternatively, you can define
the computed field in the SELECT list, associate an alias with the field, and use the alias in the GROUP
BY clause.
A SORT specification is optional. If you do not specify SORT, the query returns in an unspecified order.
To sort the columns by their values ("natural sorting order"), simply specify ASC (ascending) or DESC
(descending). ASC is the default SORT order when sorting by natural values.
To SORT a particular column by another measure or measure expression use the SORT BY phrase. You
can specify a measure-name in the SORT BY clause that need not be in the SELECT list. You can also
order the sort in either ASC or DESC order. Unlike natural value sorts, SORT BY default to the DESC
(descending) sorting order.
GROUP BY col_A SORT BY meas_1 ASC, col_B SORT DESC, col_c SORT BY
measure_expression ASC
Using GROUP BY with multiple SORT BY combinations allows you to group values with respect to
one another. Consider three potential grouping columns, say Fee, Fi, and Foe. Sorting on column Fee
sorts the records on the Fee value. Another SORT BY clause on column Fi, sorts Fi values within the
existing Fee sort.
Page 471
Data Ingest Guide - Lens Query Language Reference
Use the LIMIT keyword to reduce the number of groups returned. For example, if you are sorting
airports by the number of departing flights in DESC order (most flights to least flights), you could
LIMIT the SORT to the 10 busiest airports.
GROUP BY airports SORT BY total_departures DESC LIMIT 10
The LIMIT restricts the results to top 10 busiest departure airports. The LIMIT clause excludes other
airports. You can use the WITH OTHERS keyword to combine all the other airports not in the top 10
under a single Others group.
HAVING Clause
Filters a SELECT statement by a measure expression.
Syntax
HAVING measure-predicate-expression [ AND measure-predicate-expression ]
A measure-predicate-expression has the following form:
{ measure-column | measure-expression } { = | < | > | <= | >= | != } literal
Description
The HAVING clause filters the result of the GROUP BY clause by a measure or measure expression. The
HAVING conditions apply to the GROUP BY clause.
SELECT device.manufacturer, [duration (Avg)]
FROM movie_view2G_PSM
GROUP BY device.manufacturer
HAVING [duration (Max)] = 10800
In the example above, you see a reference to two quick measure fields. Both the duration AVG() and
MAX() quick measures are already defined on the lens.
Example of Lens Queries
This section provides some tips and examples for querying a lens.
Discovering Lens Fields
When querying a lens, you must use the sql REST API endpoint. Before constructing your query, it is a
good idea to list the lens fields with a REST call to the lens resource. One suggested method is to make
the following calls:
• List the lens by calling GET on the http://hostname:port/api/v1/lenses resource.
• Locate the lens id value in the lens list.
• Get the lens by calling GET to the http://hostname:port/api/v1/lenses/id resource.
• Review the lens fields.
Page 472
Data Ingest Guide - Lens Query Language Reference
This is one way to discover existing aggregate expressions and quick measures in the lens. For example,
listing lens fields give you examples such as the following:
...
"fields": {
"Active Clusters (Total)": {
"name": "Active Clusters (Total)",
"expression": "DISTINCT([Temp Field for Count Active
Clusters])",
"lensExpression": false,
"platforaManaged": false,
"role": "MEASURE",
"type": "LONG"
},
"Are there new Active Clusters since Yesterday?": {
"name": "Are there new Active Clusters since Yesterday?",
"expression": "[Active Clusters (Total)] - ROLLUP [Active
Clusters (Total)] TO ([Log DateTime Date].Date) ORDER BY ([Log DateTime
Date].Date) ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING",
"lensExpression": false,
"platforaManaged": false,
"role": "MEASURE",
"type": "LONG"
},
"Avg Page Views per Session": {
"name": "Avg Page Views per Session",
"expression": "[Total Records]/(DISTINCT(sessionId))",
"lensExpression": false,
"platforaManaged": false,
"role": "MEASURE",
"type": "DOUBLE"
},
...
Using the JSON description of a lens, you can quickly see measures used in your lens versus navigating
the lens in Platfora's UI.
Complex DEFINE Clauses
This example illustrates the use of multiple DEFINE clauses. Notice the descriptive name for the
ROLLUP() computed field.
DEFINE Manu_Genre AS CONCAT
DEFINE [ROLLUP num_views TO
(device.manufacturer)
DEFINE [ROLLUP num_views TO
([Manu_Genre])
SELECT device.manufacturer,
TO Manu], [ROLLUP num_views
FROM moview_view2G_PSM
([device].[manufacturer], [video].[genre])
Manu] as ROLLUP COUNT() TO
Manu_Genre] as ROLLUP COUNT() TO
Manu_Genre, [Num Views], [ROLLUP num_views
TO Manu_Genre]
Page 473
Data Ingest Guide - Lens Query Language Reference
WHERE Manu_Genre LIKE (\"*Action/Comedy\", \"*Anime\", \"*Drama/Silent
\")
GROUP BY device.manufacturer SORT ASC, Manu_Genre SORT ASC
HAVING [ROLLUP num_views TO Manu] > 30000 AND [ROLLUP num_views TO
Manu_Genre] > 1000
Build a WHERE Clause
The following example shows a WHERE clause using mixed predicates and row comparison. It also uses
the NOT keyword to negate list expressions
SELECT count()
FROM [(test) View Summary]
WHERE prior_views NOT IN (3,5,7,11,13,17)
AND TO_LONG(prior_views) NOT IN (4294967296)
AND avebitrate_double NOT IN (3101.0, 2598.0, 804.0)
AND video.genre NOT IN ("Silent", "Exercise")
AND video.genre NOT LIKE ("*a*")
AND date.Date
NOT IN (2011-08-04, 2011-06-04, 2011-07-04)
AND prior_views > 23
AND avebitrate_double < 3101.0
AND TO_FIXED(avebitrate_double) != 3101.0
AND TO_LONG(prior_views) != 4294967296
AND video.genre <= "Silent" and date.Date > 2011-08-04
AND date.Date NOT BETWEEN 2012-01-01
AND 2013-01-01
AND video.genre BETWEEN "Exercise" AND "Silent"
AND prior_views BETWEEN 0 AND 100
AND avebitrate_double NOT BETWEEN 1234.5678 AND 2345.6789
You cannot use IS NULL or IS NOT NULL comparisons. You also cannot use relative date filters
(LAST integer DAYS).
Complex Measure Expression
The following example illustrates a measure expression that includes both a ROLLUP and use of
aggregate functions.
SELECT device.manufacturer,
CONCAT([device].[manufacturer],
[video].[genre]) AS Manu_Genre, [Num Views],
ROLLUP COUNT() TO (device.manufacturer) as [ROLLUP
num_views TO Manu],
ROLLUP COUNT() TO ([Manu_Genre]) AS [ROLLUP num_views TO
Manu_Genre]
FROM movie_view2G_PSM
WHERE Manu_Genre LIKE (\"*Action/Comedy\", \"*Anime\", \"*Drama/
Silent\")
GROUP BY device.manufacturer SORT ASC, Manu_Genre SORT ASC
Page 474
Data Ingest Guide - Lens Query Language Reference
HAVING [ROLLUP num_views TO Manu] > 30000 AND [ROLLUP num_views
TO Manu_Genre] > 1000
Complex Row Expressions
This row expression uses multiple row terms and factors:
SELECT duration + [days after release] + user.age +
user.location.estimatedpopulation AS [Row-Expression multi-factors],
[Num Views]
FROM movie_view2G_PSM
GROUP BY [Row-Expression multi-factors] SORT ASC
You'll notice that the Row-Expression multi-factors alias for the SELECT complex expression
is reused in the GROUP BY clause.
Page 475