A Primer - Databricks

Databricks
A Primer
Databricks: A Primer
Who is Databricks?
Databricks was founded by the team behind Apache Spark, the most active
open source project in the big data ecosystem today. Our mission at Databricks
is to dramatically simplify big data processing and free users to focus on turning
their data into value.
We do this through our product, Databricks, that is powered by Spark.
For more information on Spark, download the Spark Primer.
Databricks
Data
Value
“We’ve had great success using Apache Spark on Databricks to
compute the billions of data points behind our predictive models
guiding consumers to the right health insurance plan. The simplicity
and interactivity of Databricks makes it easy for developers and data
scientists new to Spark to get up to speed very quickly, and not have to
worry about the minutiae of managing clusters.”
— Ani Vemprala, CEO, Picwell
2
Databricks: A Primer
What is Databricks?
Databricks is a hosted end-to-end data platform powered by Spark. It enables
organizations to seamlessly transition from data ingest through exploration and
production.
There are four foundational components that comprise Databricks:
Managed Spark Clusters
Exploration and Visualization
Production Pipelines
Third-Party Apps
The Foundational Components of Databricks
3
Databricks: A Primer
Managed Spark Clusters
Fully managed Spark clusters in the cloud
that helps enterprises focus on their data
and not operations.
Easily Provision Clusters: Launch,
dynamically scale up or down, and
terminate clusters with just a few clicks.
We automate management so you can
focus on your data.
Harness the Power of Spark: Configured
and tuned by the people who built it.
Import Data Seamlessly: Import data
from S3, your local machine, or a wide
variety of data sources, including HDFS,
RDBMS, Cassandra, and MongoDB.
Exploration and Visualization
An interactive workspace for exploration
and visualization so users can learn,
work, and collaborate in a single, easy
to use environment.
Explore: Use interactive notebooks
to write Spark commands in R,
Python, Scala, or SQL and reuse
your favorite Python, Java, or Scala
libraries.
Collaborate: Work on the same
notebook in real time or send it
around for offline collaboration.
Visualize: Leverage a wide
assortment of point-and-click
visualizations. Or use powerful
scriptable options like matplotlib,
ggplot, and D3.
Publish: Build rich dashboards that
present key findings to share with
your colleagues and customers.
4
Databricks: A Primer
Production Pipelines
A production pipeline scheduler that
helps users get from prototype to
production without re-engineering.
Schedule Production Workflows:
Schedule any existing notebook or locally
developed Spark code to run periodically
using existing or newly-provisioned
clusters.
Implement Complete Pipelines: Build
production pipelines that span data
import and ETL, complex conditional
processing, and data export.
Monitor Progress and Results:
Set up custom alerts for job completion
and failure, and easily view historical
and in-progress results.
Third-Party Apps
A platform for powering Spark-based
applications that helps users leverage
a growing ecosystem of applications,
and re-use their favorite tools.
5
Databricks: A Primer
What are some of the technical and
operational bottlenecks faced by
data scientists, data engineers and
analysts with their data pipeline?
Over last few years, Spark has made great strides in helping enterprises overcome
some of their big data processing challenges, however many enterprises are still
struggling to extract value from their data pipelines. Capturing value from big data
requires capabilities beyond data processing; enterprises are finding out that
there are many challenges in their journey to operationalize their data pipeline:
1. Infrastructure issues requiring data teams to pre-provision, setup and
manage on-premise clusters that are both costly and time consuming.
2. Once the infrastructure challenges have been addressed, data scientists
and engineers still have to contend with siloed workspaces where
working with data, code, and visualization requires switching between
different software, and sharing work amongst peers means manually
copying data.
3. Sharing of insights to non-engineering stakeholders and the hand-off to
the production team.
6
Databricks: A Primer
Problem: the journey is complex and costly.
Get a cluster
up and running
Import and
explore data
Build a
Production
Pipeline
Expensive to build
and hard to manage
Disparate and
difficult tools
Months of
re-engineering to deploy
Your Data Pipeline: the journey is complex and costly
In all this, enterprises are required to cobble various components together,
making it not just highly inefficient, but also difficult to track data lineage and
usage patterns over the various components within the stack. With this current
model, enterprises are not able to implement complete pipelines - this severely
inhibits innovation and value creation.
Why Databricks?
Given the challenges faced by data
professionals and enterprises in managing
their data pipeline, we saw the need for a
single platform that can enable customers
to easily deploy Spark as-a-service while
providing a rich set of tools out-of-the-box.
Key attributes:
• Managed Spark Clusters
in the Cloud
• Notebook Environment
• Production Pipeline Scheduler
• 3rd Party Applications
7
Databricks: A Primer
Our key differentiators are:
Unified Platform
Zero Management
With Databricks, enterprises are able to go
from data ingest through exploration and
production on a single data platform. This
significantly minimizes the integration pains
they currently face when cobbling together
multiple tools and systems, and helps
streamline entire pipeline deployments.
With a unified platform, data professionals
are able to reuse their code base by utilizing
the same notebooks for exploration and
production, resulting in tremendous time
savings.
Databricks provides powerful cluster
management capabilities which allow
users to create new clusters in seconds,
dynamically scale them up and down, and
share them across users. This obviates the
need to set up and maintain the clusters.
As such organizations do not need to have
dedicated DevOps teams - their data teams
can now create self-service Spark clusters
and import their data seamlessly. This
allows them to focus on their core mission—
understanding and gaining insights from their
data, not in managing day-to-day operations.
Real-Time
Open Platform
Databricks provides real-time capabilities in
several dimensions.
Databricks is a platform for powering Sparkbased applications and comes with a thirdparty API in addition to JDBC connectivity,
so users can plug in their favorite BI tools
directly to their Databricks clusters, as each
cluster comes with a JDBC server. This
enables users to reuse their favorite tools,
leverage our growing application ecosystem
and to maximize their investments and
knowledge base, leading to improved time
to value and productivity.
1. The notebook feature allows users to
perform interactive queries and visualize
results in real-time. This can dramatically
increase their productivity when performing
explorations and gain additional insights.
2. T
he interactive workspace feature enables
real-time collaboration amongst multiple
users. Team members can seamlessly share
code, plots, and results, leveraging each
other’s work far more effectively.
3. T
he streaming feature provides low-latency
and fault-tolerant processing of continuous
data streams. This enables organizations to
rapidly take action in response to live data
in real-time.
8
Databricks: A Primer
How are enterprises typically
using Databricks?
Enterprises deploy Databricks to achieve a wide variety
of objectives, including:
Prepare Data
Perform Analytics
• Import data using APIs or connectors
• Clean mal-formed data
• Aggregate data to create a data warehouse
• Explore large data sets in real-time
• Find hidden patterns with advanced
analytics algorithms
• Publish customized dashboards
Databricks is powered by Spark, giving it
the ability to ingest data from a diverse set
of sources and perform simple yet scalable
transformations of data. The real-time
interactive querying environment and data
visualization capability of Databricks makes
this typically slow process much faster.
Build Data Products
• Rapid prototyping
• Implement advanced analytics algorithms
• Create and monitor robust production
pipelines
With Databricks, developers and data
scientists can work in SQL, Python,
Scala, Java, and R ­— with a wide range
of advanced analytics algorithms at
their disposal. Teams can be instantly
productive with real-time analysis of largescale datasets on topics ranging from user
behavior to customer funnel. Databricks
can easily publish these results and
complex visualizations as part of notebooks,
integration with third party BI tools, or
customized dashboards for consumption
with a few clicks.
Databricks allows teams of developers and
data scientists to efficiently experiment with
new product ideas through the interactive
workspace. Advanced analytics libraries
such as MLlib and GraphX also provide an
easy way for teams to deploy sophisticated
algorithms in Spark. Once a prototype has
been built, one can seamlessly deploy it
in production — at scale — using the Jobs
feature.
9
Databricks: A Primer
How will Databricks benefit data
professionals and enterprises?
Databricks helps data
professionals and
enterprises to focus on
finding answers from
their data, building data
products, and ultimately
capture the value
promised by big data.
The platform delivers the following key benefits
to data professionals and enterprises:
Higher productivity
• Maintenance-free infrastructure
• Real-time processing
• Easy to use tools
Faster deployment of data pipelines
• Zero management Spark clusters
• Instant transition from prototype to production
Evaluate
Databricks
with a trial
account now.
Data democratization within enterprises
• One shared repository
• Seamless collaboration
• Easy to build sophisticated dashboards and notebooks
databricks.com/registration
“The fact that explorations by our data science team now take less
than an hour, rather than days, has fundamentally changed how we
ask questions and visualize changes to the index.”
– Darian Shirazi, CEO, Radius Intelligence
10