by Tobias Malbrecht and Zoltan Prekopcsak February 2015 RapidMiner Orange Paper Big Data Security on Hadoop As an increasing number of enterprises move towards production deployments of Hadoop, security continues to be an important topic and an integral implementation initiative – often coinciding with initial deployments of analytics platforms that run on Hadoop. As such, modern analytics platforms must comply with security standards early on. In this OrangePaper we show how RapidMiner Radoop complies with current and future security implementation standards – providing authentication and authorization and integrating additional levels such as data encryption support. Challenge These days, we see widespread adoption of Hadoop. Hadoop has grown beyond a series of open source projects for programmers, and, now, organizations have matured in their understanding of Big Data technologies and their expectations on the benefits of Hadoop. Acknowledging the added value that can be generated by applying analytics on Big Data in Hadoop in a cost-effective way, many organizations have successfully passed the proof of concept stage and moved on to setting up production clusters. With that, new aspects of deploying Hadoop gain the focus. Among these aspects, data security is the one we see coming up most often. Though requirements differ depending on the type of organization and level of regulations typically applied within an industry sector, most organizations actively consider and implement security as an integral part of a productive Hadoop environment. The challenge is to deploy solutions that bring analytics to Hadoop while seamlessly integrating with data security policies and platforms that make security transparent and easily applicable for users in order to facilitate frictionless building of modern analytics. Next: Analysis > Big Data Security on Hadoop Analysis For implementing Hadoop security, there is a common understanding of the respective measures to be implemented among leading Hadoop vendors. All Hadoop distribution providers promote a 4-layer security model for Hadoop. Sometimes they use different names for the security layers but the underlying concepts are typically similar. Data Security Implementation Model Perimeter Security Data Access Security Accountability Data Protection Authentication Authorization Auditing and Data Lineage Encryption Perimeter Security: The first level is responsible for authenticating a user, i.e. ensuring that a user is who he or she claims to be. This is usually solved with MIT Kerberos, a well-known system and de-facto standard for implementing authentication. Kerberos integrates with LDAP or Active Directory to obtain user information. The Hadoop vendors offer some tooling to manage Kerberos. As an alternative, Hortonworks also promotes Apache Knox as a way of ensuring perimeter authentication. Data Access: The second level is responsible for authorizing access to data, i.e. granting access to users only to data, services and resources that they are specifically entitled to use. Some Hadoop services like HDFS already have file permissions and other features to ensure proper authorization, but sometimes users are looking for more fine-grained authorization capabilities (e.g. on a column level or even on the data cell level). Cloudera promotes Apache Sentry for this, while Hortonworks has acquired a company called XA Secure to deliver data access security. Analysis Continued > Big Data Security Solution on Hadoop Accountability: The common goal of this security level is to foster accountability by allowing administrators to monitor and audit data access on Hadoop. Additional measures include data lineage that allows understanding where data comes from and how different data sets rely on each other. To support this level of security, Cloudera has a special product for this called Navigator, while Hortonworks is again building on XA Secure technology. Data Protection: The fourth and last aspect of security is also a large field, covering data-at-rest encryption, on-the-wire encryption, data masking, and many more. Hadoop vendors usually have some features for this, but they currently rely mostly on partners to provide full-blown solutions. As of today, many enterprise production deployments of Hadoop already include implementations of perimeter security, with a few also securing data access through authorization. With deployments becoming more mature, adoption of security levels will increase – and perimeter security and data access security will become standard and integration a necessity for analytics tools. The increasing adoption of Cloud infrastructures will also drive the implementation of data protection, whereas the audit level will in particular be relevant for strongly regulated businesses such as financial services. Solution Analytics tools integrating with Hadoop – in particular those pushing computation down into Hadoop clusters – need to deal with security levels once they are implemented in Hadoop. Being on the forefront of in-Hadoop analytics, RapidMiner Radoop brings ease-of-use and visual analytics workflow development into Hadoop. Continuing to anticipate market needs, RapidMiner Radoop now integrates with Hadoop security implementations to deliver analytics in Hadoop seamlessly and frictionless also with secured Hadoop clusters. RapidMiner Radoop pushes down visually designed workflows for analytics into Hadoop environments for processing these workflows – integrating with core Hadoop technologies HDFS, MapReduce/YARN and Hive among others to execute parts of the workflows. “Kerberized” Hadoop clusters require authentication via Kerberos when connecting to and accessing these services. As of version 2.2, RapidMiner Radoop integrates with Kerberos authentication. When accessing a Hadoop cluster, and any of the services listed above, RapidMiner Radoop requests a ticket from Kerberos and – if authenticated – uses that ticket to gain access to the services. To confirm user information, Kerberos itself typically integrates with an LDAP (Lightweight Directory Access Protocol) or Active Directory server. Solution Continued > Big Data Security Solution on Hadoop Kerberos Authentication 1. Request Authentication RapidMiner Radoop 2. Grant ‘Ticket-Granting’ Ticket 3. Request Service Ticket Kerberos Authentication Server 4. Grant ‘Service Session’ Ticket 5. Access Hadoop Service (e.g. Hive) Beyond authentication, RapidMiner Radoop now also supports data access authorization employing Apache Sentry. In several distributions, Apache Sentry is used to control access e.g. to tables in Hive. As with any other configuration requirement, configuration of Kerberos authentication support in RapidMiner Radoop is easy and frictionless. RapidMiner Radoop hides all administration and configuration complexity and reveals only necessary settings to the user. Effectively, configuration and administration requirements for IT concerning RapidMiner Radoop as in-Hadoop analytics solution are reduced to a minimum. With perimeter security and data access security supported for most Hadoop clusters (given the broad adoption of Kerberos and Sentry), RapidMiner Radoop already delivers security for a large portion of production clusters deployed within organizations. In upcoming platform releases, RapidMiner Radoop will be broadened to support those security measures early-on that evolve and have the potential to be adopted as security standards within enterprises. With that, RapidMiner Radoop is future-proof delivering easy-to-use in-Hadoop analytics on any Hadoop cluster – no matter what security implementations will be involved. Next: Conclusion > Big Data Security Conclusion on Hadoop Conclusion With the increased adoption of security implementations for Hadoop, organizations add perimeter security through authentication, implement data access authorization, set up auditing measures and encrypt data for better protection. RapidMiner Radoop complies with the currently implemented security levels and seamlessly integrates analytics with secured Hadoop clusters. Furthermore, RapidMiner Radoop makes security configuration very easy to provide hassle-free connectivity and frictionless deployment of RapidMiner Radoop as analytics platform for Hadoop. In particular, RapidMiner Radoop integrates with Kerberos authentication and data access authorization using Apache Sentry. Other security implementations – providing data access authorization for all distributions and allowing for reading encrypted data – are planned for integration as we expect importance of security for Hadoop strengthen further and security implementations gain more traction in the market. With that, RapidMiner Radoop is not only leading in the way it does analytics on Big Data – offering the visual design of analytical workflows and facilitating pushdown computation of these workflows on Hadoop. RapidMiner Radoop is also leading in how it integrates with heterogeneous Hadoop infrastructures and security implementations by anticipating the trends in implementing security for Hadoop and complying with the standards of tomorrow, today. All content ©2015 RapidMiner RapidMiner provides software, solutions, and services in the field of advanced analytics, including predictive analytics, data mining, and text mining. Learn more at www.rapidminer.com Tobias Malbrecht @TobiasMalbrecht Tobias Malbrecht is Director of Product Management and Product Marketing at RapidMiner. Before, Tobias headed the consulting services unit of RapidMiner and also served as a consultant and product engineer. Tobias holds master degrees in computer science, economics, and business administration from the Technical University of Dortmund, Germany. Zoltan Prekopcsak @prekopcsak Zoltan Prekopcsak is the V.P. of Big Data at RapidMiner and has experience in data-driven projects in industries including telecommunications, financial services, e-commerce, and neuroscience. Previously, he was co-founder/CEO of Radoop before its acquisition by RapidMiner, a data scientist at Secret Sauce Partners, Inc., and has been a lecturer at Budapest University of Technology and Economics.
© Copyright 2024