RapidMiner Orange Paper Big Data Security on Hadoop

by Tobias Malbrecht and Zoltan Prekopcsak
February 2015
RapidMiner Orange Paper
Big Data Security on Hadoop
As an increasing number of enterprises move towards production
deployments of Hadoop, security continues to be an important topic and an
integral implementation initiative – often coinciding with initial deployments
of analytics platforms that run on Hadoop.
As such, modern analytics platforms must comply with security standards
early on.
In this OrangePaper we show how RapidMiner Radoop complies with current
and future security implementation standards – providing authentication and
authorization and integrating additional levels such as data encryption
support.
Challenge
These days, we see widespread adoption of Hadoop. Hadoop has grown beyond a series
of open source projects for programmers, and, now, organizations have matured in their
understanding of Big Data technologies and their expectations on the benefits of Hadoop.
Acknowledging the added value that can be generated by applying analytics on Big Data in
Hadoop in a cost-effective way, many organizations have successfully passed the proof of
concept stage and moved on to setting up production clusters. With that, new aspects of
deploying Hadoop gain the focus.
Among these aspects, data security is the one we see coming up most often. Though
requirements differ depending on the type of organization and level of regulations typically
applied within an industry sector, most organizations actively consider and implement
security as an integral part of a productive Hadoop environment.
The challenge is to deploy solutions that bring analytics to Hadoop while seamlessly
integrating with data security policies and platforms that make security transparent and
easily applicable for users in order to facilitate frictionless building of modern analytics.
Next: Analysis >
Big Data Security on Hadoop
Analysis
For implementing Hadoop security, there is a common understanding of the respective
measures to be implemented among leading Hadoop vendors. All Hadoop distribution
providers promote a 4-layer security model for Hadoop. Sometimes they use different
names for the security layers but the underlying concepts are typically similar.

Data Security Implementation Model
Perimeter
Security
Data Access
Security
Accountability
Data
Protection
Authentication
Authorization
Auditing and
Data Lineage
Encryption
Perimeter Security: The first level is responsible for authenticating a user, i.e. ensuring
that a user is who he or she claims to be. This is usually solved with MIT Kerberos, a
well-known system and de-facto standard for implementing authentication. Kerberos
integrates with LDAP or Active Directory to obtain user information. The Hadoop vendors
offer some tooling to manage Kerberos. As an alternative, Hortonworks also promotes
Apache Knox as a way of ensuring perimeter authentication.
Data Access: The second level is responsible for authorizing access to data, i.e. granting
access to users only to data, services and resources that they are specifically entitled to
use. Some Hadoop services like HDFS already have file permissions and other features to
ensure proper authorization, but sometimes users are looking for more fine-grained
authorization capabilities (e.g. on a column level or even on the data cell level). Cloudera
promotes Apache Sentry for this, while Hortonworks has acquired a company called XA
Secure to deliver data access security.
Analysis Continued >
Big Data Security
Solution
on Hadoop
Accountability: The common goal of this security level is to foster accountability by
allowing administrators to monitor and audit data access on Hadoop. Additional measures
include data lineage that allows understanding where data comes from and how different
data sets rely on each other. To support this level of security, Cloudera has a special
product for this called Navigator, while Hortonworks is again building on XA Secure
technology.
Data Protection: The fourth and last aspect of security is also a large field, covering
data-at-rest encryption, on-the-wire encryption, data masking, and many more. Hadoop
vendors usually have some features for this, but they currently rely mostly on partners to
provide full-blown solutions.
As of today, many enterprise production deployments of Hadoop already include
implementations of perimeter security, with a few also securing data access through
authorization. With deployments becoming more mature, adoption of security levels will
increase – and perimeter security and data access security will become standard and
integration a necessity for analytics tools. The increasing adoption of Cloud infrastructures
will also drive the implementation of data protection, whereas the audit level will in
particular be relevant for strongly regulated businesses such as financial services.
Solution
Analytics tools integrating with Hadoop – in particular those pushing computation down
into Hadoop clusters – need to deal with security levels once they are implemented in
Hadoop. Being on the forefront of in-Hadoop analytics, RapidMiner Radoop brings
ease-of-use and visual analytics workflow development into Hadoop. Continuing to
anticipate market needs, RapidMiner Radoop now integrates with Hadoop security
implementations to deliver analytics in Hadoop seamlessly and frictionless also with
secured Hadoop clusters.
RapidMiner Radoop pushes down visually designed workflows for analytics into Hadoop
environments for processing these workflows – integrating with core Hadoop technologies
HDFS, MapReduce/YARN and Hive among others to execute parts of the workflows.
“Kerberized” Hadoop clusters require authentication via Kerberos when connecting to and
accessing these services.
As of version 2.2, RapidMiner Radoop integrates with Kerberos authentication. When
accessing a Hadoop cluster, and any of the services listed above, RapidMiner Radoop
requests a ticket from Kerberos and – if authenticated – uses that ticket to gain access to
the services. To confirm user information, Kerberos itself typically integrates with an LDAP
(Lightweight Directory Access Protocol) or Active Directory server.
Solution Continued >
Big Data Security
Solution
on Hadoop

Kerberos Authentication
1. Request Authentication
RapidMiner
Radoop
2. Grant ‘Ticket-Granting’ Ticket
3. Request Service Ticket
Kerberos
Authentication
Server
4. Grant ‘Service Session’ Ticket
5. Access
Hadoop Service (e.g. Hive)
Beyond authentication, RapidMiner Radoop now also supports data access authorization
employing Apache Sentry. In several distributions, Apache Sentry is used to control access
e.g. to tables in Hive.
As with any other configuration requirement, configuration of Kerberos authentication
support in RapidMiner Radoop is easy and frictionless. RapidMiner Radoop hides all
administration and configuration complexity and reveals only necessary settings to the
user. Effectively, configuration and administration requirements for IT concerning
RapidMiner Radoop as in-Hadoop analytics solution are reduced to a minimum.
With perimeter security and data access security supported for most Hadoop clusters
(given the broad adoption of Kerberos and Sentry), RapidMiner Radoop already delivers
security for a large portion of production clusters deployed within organizations.
In upcoming platform releases, RapidMiner Radoop will be broadened to support those
security measures early-on that evolve and have the potential to be adopted as security
standards within enterprises. With that, RapidMiner Radoop is future-proof delivering
easy-to-use in-Hadoop analytics on any Hadoop cluster – no matter what security
implementations will be involved.
Next: Conclusion >
Big Data Security
Conclusion
on Hadoop
Conclusion
With the increased adoption of security implementations for Hadoop, organizations add
perimeter security through authentication, implement data access authorization, set up
auditing measures and encrypt data for better protection. RapidMiner Radoop complies
with the currently implemented security levels and seamlessly integrates analytics with
secured Hadoop clusters.
Furthermore, RapidMiner Radoop makes security configuration very easy to provide
hassle-free connectivity and frictionless deployment of RapidMiner Radoop as analytics
platform for Hadoop. In particular, RapidMiner Radoop integrates with Kerberos
authentication and data access authorization using Apache Sentry. Other security
implementations – providing data access authorization for all distributions and allowing for
reading encrypted data – are planned for integration as we expect importance of security
for Hadoop strengthen further and security implementations gain more traction in the
market.
With that, RapidMiner Radoop is not only leading in the way it does analytics on Big Data –
offering the visual design of analytical workflows and facilitating pushdown computation of
these workflows on Hadoop.
RapidMiner Radoop is also leading in how it integrates with heterogeneous Hadoop
infrastructures and security implementations by anticipating the trends in implementing
security for Hadoop and complying with the standards of tomorrow, today.
All content ©2015 RapidMiner
RapidMiner provides software, solutions, and services in the field of advanced analytics, including
predictive analytics, data mining, and text mining. Learn more at www.rapidminer.com
Tobias Malbrecht @TobiasMalbrecht
Tobias Malbrecht is Director of Product
Management and Product Marketing at RapidMiner.
Before, Tobias headed the consulting services unit
of RapidMiner and also served as a consultant and
product engineer. Tobias holds master degrees in
computer science, economics, and business
administration from the Technical University of
Dortmund, Germany.
Zoltan Prekopcsak @prekopcsak
Zoltan Prekopcsak is the V.P. of Big Data at
RapidMiner and has experience in data-driven
projects in industries including
telecommunications, financial services,
e-commerce, and neuroscience. Previously, he was
co-founder/CEO of Radoop before its acquisition by
RapidMiner, a data scientist at Secret Sauce
Partners, Inc., and has been a lecturer at Budapest
University of Technology and Economics.