Skytree Big Data Reference Architecture

TECHNOLOGY BRIEF
Skytree Big Data Reference Architecture
Enterprise Machine Learning for Hadoop Scale
Data - The Driving Force Behind Analytics
RECOMMENDED INFRASTRUCTURE
Data is the lifeblood of analytics and no analytics approach benefits more from bigger data than machine learning.
Machine learning can detect subtle patterns in data to detect fraud, retain your customers, score your leads, predict
equipment failures along with a plethora of other business use cases.
FLEXIBLE DEPLOYMENT OPTIONS:
The drive is in full swing to transform huge quantities of data, including structured, unstructured and time series data
into real, measurable business value. In the past, making predictions from large volumes and varieties of data was cost
prohibitive, but advances in Hadoop and massively parallel analytics on commodity hardware provide an efficient and
cost effective means to store, manage and analyze vast amounts of data. Skytree from the start has been architected
to take advantage of this transformation.
■■
■■
Deploy in the cloud with
Amazon, Rackspace or Google
Compute
HARDWARE RECOMMENDATIONS:
■■
Architecture Overview
Skytree Infinity™ is an enterprise grade machine learning platform built for your Hadoop infrastructure1. Our platform
is designed to fit naturally into your advanced analytics ecosystem. The Backend Core is a natively developed high-performance computing engine installed on your Hadoop data nodes to perform machine learning at scale. The Frontend
component is an open HTTP / HTTPS REST API that sits on the edge nodes of your Hadoop cluster to allow both people and machines to import, export and process data using Skytree. Data modelers and scientists interact with Skytree
Infinity using either our native web interface, which we call the Data Science Workspace, or using our Python or Java
Software Development Kits (SDKs).
Skytree Infinity™ Enterprise Machine Learning Platform
Deploy in your data center
■■
■■
Intel® or AMD® 64-bit x86
processors
64 to 512 GB of RAM per machine or more
4 to 32 cores per machine or
more
OPERATING SYSTEM:
■■
RHEL Linux 6+ and derivatives
including CentOS
CERTIFIED HADOOP
DISTRIBUTIONS:
■■
Cloudera 5.1 and above
■■
Hortonworks 2.1 and above
■■
MapR 4.0 and above
CERTIFIED ON CLOUDERA SPARK
AND DATABRICKS SPARK FOR DATA
PREPARATION:
■■
Infinity ships with Apache Spark
YARN SUPPORT
The Frontend Application Components
The frontend application component provides a single unified interface for modelers to interact with Skytree Infinity™. The frontend runs in a Java Servlet container on your edge node. Users interact with the frontend using our web-based Skytree Data Science Workspace™ or our open API based on
REST+HTTP(S). Skytree provides an SDK for Python or Java that data scientists can install on their local machines to communicate with Skytree’s REST API,
providing programmatic access to the data science workspace, data preparation and machine learning, and predicting and evaluating capabilities. Skytree
also provides a command-line interface on the edge node for power users who want to “get under the hood” of Skytree to gain direct access to Skytree’s
machine learning methods, data preparation and compute engine. Metadata for the application frontend including the user store, project database and system monitoring is stored in a relational database running either on the edge node or on a separate server2.
1
Skytree Infinity™ is still available on standalone x86 Linux Servers or HPC clusters for non-Hadoop customers.
Skytree Infinity™ provides Postgres out-of-the-box for storing application metadata but customers can choose an alternative RDBMS.
2
TECHNOLOGY BRIEF
SUPPORTED METHODS:
■■
Linear Regression
■■
Support Vector Machines
■■
Gradient Boosted Trees
■■
Collaborative Filtering
■■
Random Decision Forests
■■
K-Means
■■
Nearest Neighbor
■■
Kernel Density Estimation
■■
Single Value Decomposition
■■
■■
The Backend Core Component
Skytree’s machine learning, data preparation and compute engine utilizes Hadoop and YARN to schedule,
manage, and execute Skytree jobs in your distributed cluster. The Backend Core Component is installed on all
of your data nodes to take advantage of your cluster resources and, in the spirit of Hadoop, bring the compute
to your data. Skytree’s High Performance In-Memory Compute Engine with TrueScale™ technology is built with
faster, more accurate, linear scaling algorithms natively designed for maximum performance on Intel and AMD
x86 processors using techniques garnered from High Performance Computing. Our engine minimizes expensive
operations like file I/O and network latency to achieve speed and scale without overusing your cluster resources.
Skytree Infinity Architecture
Principal Component Analysis
Two-Point Correlation
ENTERPRISE FEATURES:
■■
■■
■■
■■
■■
■■
Kerberos Support
Supported file formats: CSV,
JSON, Pcap, Avro, Text
Integration with Relational
Databases using JDBC
Support for HDFS and MapRFS
Smarter modeling using AutoModel™ and Patent Pending
SmartSearch™
Self documenting models using
AutoDoc™
Anatomy of the Skytree Data Science Workflow
Data Science starts with a business question such as how can I reduce my fraud, reduce my customer churn, or select the next logical product for
my customer. Data scientists transform this question into a data science project. The data scientist has three options to create and execute projects
within Skytree:
■■
Running our Python or Java SDK client on their local machine or calling Skytree using any client that supports REST over HTTP/HTTPS.
■■
Using our web-based Data Science Workspace™.
■■
Logging into an edge node using SSH or any other terminal emulator and calling Skytree from our power-user command line interface.
Data scientists interact with Skytree on the edge node via REST or command line. When a data scientist requests a complex operation such as data
preparation, feature engineering or machine learning, Skytree calculates the memory, CPU and disk resource requirements and submits requests to
YARN to ensure that the Hadoop infrastructure can be shared with other Skytree and non-Skytree workloads.
YARN assigns cluster resources based on your cluster scheduling configuration and data locality rules. Skytree’s YARN application activates the
Skytree Backend Compute Engine, which reads data from HDFS and uses a high performing and efficient in-memory structure to perform scalable
machine learning.
The end result of the project is a predictive model output into HDFS that can then be deployed to your operational environment.
Deployment
The end result of your data science process is the predictive model. This predictive model can be used to cluster, predict or score new data points
in your operations. This model can score new data using Skytree’s online scoring or batch scoring feature or exported in PMML / XML for scoring
outside of Skytree. Skytree provides a native Java Archive library (JAR) called the Skytree Evaluator that can score natively in Java environments.
About Skytree
Skytree®—The leader in enterprise machine learning on big data, is disrupting the advanced analytics market with a machine learning
platform that gives organizations the power to discover deep analytic insights, predict future trends, make recommendations and reveal
untapped markets and customers. Advanced analytics is quickly becoming a strategic technology in the age of Big Data. Skytree is at the
forefront with enterprise-grade machine learning. Skytree’s flagship product—Skytree Infinity™—is the only general purpose platform on
the market, built for the highest accuracy, speed and scalability.
www.skytree.net
© 2015 Skytree, Inc. All Rights Reserved. Skytree The Machine Learning Company is a
registered trademark of Skytree, Inc. All other trademarks are the property of their respective
owners.
Bigger Data.
Better Results.™