Grids@Work V Oracle Coherence for Finance Applications Ewan Slater Senior Solution Specialist

<Insert Picture Here>
Grids@Work V
Oracle Coherence for Finance Applications
Ewan Slater
Senior Solution Specialist
EMEA Technology Fusion Middleware
Topics
•
•
•
•
•
•
•
•
•
•
•
Scalability – why do we care?
Scalability – what’s the problem?
Traditional approaches and their drawbacks
The Coherence approach
What is Coherence?
Where does Coherence fit?
How Coherence works
Using Coherence
Coherence in Action
Conclusion
Q&A
Scalability – why do we care?
IT Initiatives Driving Scalability Demand
• XTP
• Highest volume, Low Latency,
Absolute Transactional Integrity
• Virtualization
Resources
• Increased demand on Data Sources
• Application re-provisioning must occur transparently
without interruption of data access
• Must handle multiple load increases at the same time
• SOA
• Increasing common access to resources
• Sharing access means continuous availability and absolute reliability
• EDA
• Event driving transactions causing massive increase in load
• Pervasiveness driving data need across all systems affected
Demand
Supply
Time
The more people have, the more they want!
Software Framework Pressures
Hardware Capacity Impact

Service Oriented Architecture

Compute Power: SMP/Multicore

Web 2.0

Memory Arrives: “In Memory Option”

Event Driven Architecture

Network Speed: Gbe/10G/IB

Extreme Transaction Volumes

Storage: Flexibility
Enterprise Manageability Requirements
Enterprise Infrastructure Requirements

Grid Automation

Availability – Continuous

Service Level Management

Reliability – Transactional Integrity

Application Performance Mgmt

Scalability – Capacity on Demand

Provisioning

Performance – Zero Latency
Scalability – what’s the problem?
In general, applications don’t scale
well…
…what worked fine in development, or
for 50 users…
…can’t cope with production demand…
…that increases over time…
Why don’t applications scale?
• Single points of failure (SPOF)
• Database failure or pause = application failure or pause
• One server fails, the entire system fails
• One application or JVM fails, the application fails
• Single points of bottleneck (SPOB)
• Shared resources
• The “hub” of Hub-and-spoke architectures
• Heavy database or disk I/O
• Applications are not designed to scale
• It works in single-user testing on a PC, but it will work in
production?
• Scaling is often an afterthought – “it’s the DBA’s problem”
Scaling the Application Tier:
Traditional Approaches
Scale up (or even bigger boxes)
Approach
How
Advantages
Disadvantages
Scale-Up
 Buy Big Boxes
 Expensive
“It’s an
infrastructure
problem”
 Increase Resources (cpu,
memory, hdd capacity, speed
and network, etc)
 Simple (overnight)
 No development
 No impact on internal
design
 By specialized hardware
(Azul, Infiniband…)
 Will hit physical limits
 Will have to redesign
at limit
 Non-graceful
deterioration at limit
 Stop, Add, Restart
required to scale
Bigger box = Much Bigger price tag!!!
• High incremental cost
• Wasted capacity
At some point, even the biggest box has it’s
limits!
Stateless application tier
(or blame the DBA)
Approach
How
Advantages
Disadvantages
Stateless
Scale-Out
 Make application stateless
(eg: stateless sessions)
 Only scales to match
underlying Data Source
performance
“Push state
scale-out into
lower Data
Source layer”
 Use lots of stateless servers
 Easy to develop (not
overnight, but relatively
simple as no state is
managed)
“It’s the
DBA’s
problem”
 Use load-balancing
 Use “big” and “scalable” Data
Source to ensure application
state scale-out
 Scale-out is easy, just
add more servers
 When underlying limit
is reached, have to
redesign
 Network bottlenecks
experienced as data is
moved between layers
Performance Bottleneck Between Tiers
Application
Database
Object
Relational
Java
SQL
A HUGE performance bottleneck:
Volume / Complexity / Frequency of Data Access
Performance Bottleneck Between Tiers
Solution:
Move relevant data to
middle tier
Java
Application
Application Server
Application Server
Application
Server
Memory
Cache
Memory Cache
Memory Cache
Object
Object
Object
Relational Database
• One Solution is to keep the object data in object form
in high-speed distributed memory cache
• Database remains the system of record (persistence)
Caching in the application
Approach
How
Advantages
Disadvantages
Caching
 Application keeps local copies
(in memory or on local disk) of
recently / commonly used state
 Seems simple
 Maintaining
consistency of data
between Local and Data
Source instances can be
difficult
“Keep recent
copies of
state”
“We’ll save
the DB and
DBA by
caching”
 Reduces Data Source
and Network load
 Significant application
performance
improvements
 Require “messaging
infrastructure” to ensure
consistency across a
cluster (and application
development)
 Typically applicable to
“read only” applications
and not “write a lot”
applications
 Easy to get wrong
Local Caching
Can be scaled out…
Farm Caching
Inconsistent Local Cache
Farm Caching
•
•
Benefits:
•
•
Same as Local Cache
May now scale out
Constraints:
•
•
•
•
•
Same as Local Cache - but now worse - across Farm!
Singularity broken between members (Incoherent)
Members have own copies of Entries
No cost savings in making copies to members
Cache capacity doesn’t increase with Farm size
Scale out the Container
(or blame the App Server)
Approach
How
Advantages
Disadvantages
Use an
Application
Container
 Believe the vendors & the
marketing
 Simple
 Typically scales in-thesmall
“Our magical
clustered
container will
scale our
application
infinitely”
 Follow a “scalability
paradigm”
 Use a “Clustering Container”
… It scaled the “Pet Store”
linearly, therefore our X
application will also scale
linearly (where X ≠ “Pet Store)
 Well documented and
communicable paradigm
 Easily scale
development team
 Usually relies on
“scale-up” rather than
“scale-out”
 Requires specialized
skills or products (out
side of the standard
paradigm) to really scale
 Clustering is primarily
about High-Availability,
not Scalability!
Traditional Scale-Out Approaches…
#1. Avoid the challenge of maintaining consensus
• Opt for the “single point of knowledge”
Client + Server Model
(Hub + Spoke)
Active + Passive
(High Availability)
Master + Worker Model
(Grid Agents)
#2. Have crude consensus mechanisms, that typically
fail and result in data integrity issues (including loss)
Traditional Scale-Out Consequences…
• Have unbalanced / unfair load and task management
• Some servers have greater system responsibility than others
• Have Single Points of Bottleneck (SPoB)
• Have Single Points of Failure (SPoF)
• “Micro outages” are magnified as you scale-out
• Exhibit Strong Coupling to Physical Resources
• Software completely dependent on individual physical servers
• Require specialized deployment and operation for
individual Resources
• Some servers require “special attention” to operate
The Coherence Approach
So how does Coherence solve the
problem?
Consensus is the key…
Imagine a team where some
members…
• Have a different impression of the actual members of
the team
• Allocate tasks and information to their members (from
their perspective) but on behalf of the team
• Result?
• Inconsistent views of team information
• Without consensus some information will be inconsistent (at
best) or be unavailable or lost (at worst / common)
Real Madrid before Capello
Membership Consensus
• Consensus between resources is fundamental to
ensure integrity of information (and work) when
scaling-out
Real Madrid after Capello
Coherence relies on Consensus
• Traditional scale-out approaches limit
• Scalability, Availability, Reliability and Performance
• In Coherence…
•
•
•
•
Servers share responsibilities (health, services, data…)
No SPoB
No SPoF
Massively scalable by design
• Logically servers form a “mesh”
• No Masters / Slaves etc.
• Members work together as a team
The result?
Oracle Coherence:
In Memory Data Grid
What is Coherence?
(c) Copyright 2007. Oracle Corporation
Oracle Coherence…
• Is an enabling technology that…
• Allows customers to build bullet proof
applications…
• And achieve high performance and predictable
scalability
Typical Coherence Customers
•
•
•
•
•
•
Online gaming (e.g. trading system)
Telcos (e.g. SMS backbone)
Hospitality (e.g. flight reservation system)
Insurance (e.g. user profile management)
Financial Services (e.g. risk engine)
Public sector (e.g. railway signalling)
Common theme:
Mission – critical, bullet – proof solutions
•
•
•
•
Reliability
Availability
Scalability
Performance
Coherence doesn’t need an app server
There is a .NET client library…and this
is pure .NET
…and…
There is a C++ client library…and this is
pure C++
Where does Coherence fit?
Look at the shape of the data
Application Layers
• Web Server
• App Server
• DB Server
Data “Shape” across tiers
Web Tier
Network
Web
Cache
Web
Servers
Application Tier
Application Coherence
Servers
Database Tier
Times Ten
RAC
HTML Data Structures
in Memory
Java Data Structures
in Memory
SQL Data Structures
in Memory
Web Cache offloads
Web Servers,
Improves Network
Performance via
Compression
Coherence caches
Java Structures in
Memory; Very Fast
Access to Java Data in
Memory across MidTier Grid
Times Ten & RAC
provide Scalability to
Database Data
improving Query &
Transaction Write
Performance
What is Coherence not?
• Plug and play - the application code will need to
change.
• A database – persistent data will need to be written to
a database (Oracle RAC is often an ideal fit).
• A Transaction Processing Monitor.
• A panacea for:
• Inadequate hardware
• Badly written applications
• Poor database design
How Coherence Works
(c) Copyright 2007. Oracle Corporation
Coherence Works by Consensus
• Consensus is key
•
•
•
•
•
Communication is more efficient (peer-to-peer)
No outages for voting (no need – everyone is a peer)
No SPoF, SPoB
No need for broadcast traffic (yelling at each other)
You can do many things once you have “consensus”.
made possible by TCMP
(the “secret sauce”)
Tangosol Cluster Management
Protocol (TCMP)
• Coherence’s own protocol between cluster members
• TCMP utilizes UDP
• Massively scalable
• Asynchronous
• Point-to-point
• UDP Multicast is used for:
• New JVMs to join the cluster automatically
• Maintaining cluster membership
• Multicast is not required; it may be disabled with Well Known Addresses
(WKA)
• UDP Unicast is used for most communication
• Very fast and scalable
• TCMP guarantees packet order and delivery
• TCP/IP connections do not need to be maintained
Distributed caching for your data…
…and go faster stripes for your data
Hardware implications
(Blades not Bludgeons)
Big Iron
• Buy based on predicted growth
• High incremental cost
Low cost clusters
• Buy as you grow
• Small increments at present day
prices & clock speeds
Using Coherence
Building an Application
• Developers use Coherence API to
•
•
•
•
Access Data
Listen for Events
Query Data
Process Data in the Grid
Setting up a grid
•
•
•
•
Coherence clusters to form a grid OOTB
A grid may contain many caches
A cache structure is defined by a scheme
Schemes are defined in config files
Distributed Data Management (access)
The Distributed
Scheme
(one of many)
In-Process
Data
Management
(c) Copyright 2007. Oracle Corporation
Distributed Data Management (update)
(c) Copyright 2007. Oracle Corporation
Distributed Data Management (failover)
(c) Copyright 2007. Oracle Corporation
Distributed Data Management
• Members have logical access to all Entries
•
•
•
•
At most 2 network operations for Access
At most 4 network operations for Update
Regardless of Cluster Size
Deterministic access and update behaviour
(performance can be improved with local caching)
• Predictable Scalability
•
•
•
•
Cache Capacity Increases with Cluster Size
Coherence Load-Balances Partitions across Cluster
Point-to-Point Communication (peer to peer)
No multicast required (sometimes not allowed)
(c) Copyright 2007. Oracle Corporation
Data Distribution: Clients and Servers
“Clients”
with storage
disabled
“Servers”
with storage
enabled
(c) Copyright 2007. Oracle Corporation
Near Caching (L1 + L2) Topology
(c) Copyright 2007. Oracle Corporation
Observing Data Changes
(c) Copyright 2007. Oracle Corporation
Parallel Queries
(c) Copyright 2007. Oracle Corporation
Parallel Processing and Aggregation
(c) Copyright 2007. Oracle Corporation
Data Source Integration (read-through)
(c) Copyright 2007. Oracle Corporation
Data Source Integration (write-through)
(c) Copyright 2007. Oracle Corporation
Data Source Integration (write-behind)
(c) Copyright 2007. Oracle Corporation
Coherence*Extend
WAN Topology
Oracle Coherence in Action
Example Use Cases
•
Mainframe Cost Reduction
•
•
Caching repeated queries
Oracle Coherence with Compute Grid
•
•
Intra – day risk calculation
Oracle Coherence Cloud
•
•
Message – based infrastructure replacement
Eliminating SPoB
•
Trading Exchange Redevelopment
Mainframe Cost Reduction
Taming the MIP Monster
• Retail banking IT provider
•
•
•
•
Supports 400+ banks
4 key systems – repeated queries to mainframe
100,000 queries to mainframe each day
Large recurring cost to the business
• Coherence deployed as distributed cache
• 100,000 queries  1600 queries
• Saving ~€1000000 in 1st year
Oracle Coherence with
Compute Grid
Compute Grid on Database
Traditional Compute Grid
• Emphasis on orchestrating tasks
out to compute nodes in grid
Grid Applications
•Data Set either loaded locally or
pulled off of back end data source
Grid Manager
•Applications Highly Customized for
Grid Environment
Great processing scalability with inevitable data bottlenecking
Orchestration can be point of bottleneck as well
Compute Grid on Data Grid
Traditional Compute Grid with Data Scale Out
High Performance Computing (HPC)
•Oracle Coherence Data Grid
Overlay onto Compute Grid
Grid
Applications
• Compute Grid Scale Out with
Data Fault Tolerance
Grid Manager
Oracle Coherence
Oracle RAC
• Massive Persistent Scale Out
with Oracle RAC
Customer Story: Wachovia
Scenario
• Wachovia Investment Bank introducing “Service Oriented Infrastructure (SOI)”
• Requires absolute data availability for complex Grid Computations
Problem
• Existing Compute Grid infrastructure suffering from data latency and throughput
problems
• Complex calculations so lengthy as to be outdated
Solution
• Data Grid overlay on Compute Grid
• Enable risk calculations to fully utilized the grid hardware by having real time
access to in-memory data as well as parallelization .
• Reduced critical risk computation from 50 days to under 1 hour!
Over 300 CPUs in Production!
Oracle Coherence Cloud
The challenge:
Scale this...
• Domain: Retail Banking Infrastructure
•
•
•
•
Over 500 Banks
100,000+ Teller Staff Desktops Applications
10,000+ Cash Machines (ATMs)
10,000,000’s of Internet Banking Transactions/day
• Current Infrastructure
•
•
•
•
•
Java SE based (no J2EE – apart from Servlets)
Oracle RAC (not an issue – scaling across a WAN  )
Messaging (serious challenges)
Processing Business Tasks (challenges approaching)
30,000,000+ Business Tasks a day – minimum.
• must do 100,000,000 effortlessly per/day before going live
(c) Copyright 2007. Oracle Corporation
The challenge continued:
Scale this...
• Execution of Business Tasks
• Account Balance, Credit/Debit, Funds Transfer, Statement
Processing, Batch Processing, Payment Processing
• Tasks arrive from a variety of clients (thin, rich, crossplatform, mainframes...) – variety of languages
• Goal:
• Tasks are executed by the “cloud”
• Don’t want to build own “cloud” software
The
Cloud
• Their knowledege:
• Massive experience in scale-out. Could build it themselves,
but budget (time/resources/money) will be saved by buying.
(c) Copyright 2007. Oracle Corporation
Architectural issue:
Performance Bottleneck Between Tiers
Application
Database
Object
Relational
Java
SQL
A HUGE performance bottleneck:
Volume / Complexity / Frequency of Data Access
(in some companies, this is would be
time to blame the DBA)
Constraints...
•
•
•
•
No Single Points of Failure
No Simple Points of Bottleneck
No Service Registries
No Masters + Workers
• already got one that is partitioned into
over 200 separate clusters
• No Manual Partitioning
• Keep everything in Memory
• Active + Active Sites
• Across WAN
•
•
•
•
Develop system on a note book
Scale to over 500 servers
No reconfiguration outages
No byte-code manipulation /
proxies
(c) Copyright 2007. Oracle Corporation
• No Data or Task Loss
• During failure
• During server upgrade
• During scale out
•
•
•
•
•
No Transactions (XA)
Support multiple versions
Predictable response times
Predictable scale out costs
Manage via JMX, from any point
in the “Cloud”.
• Pure Java Standard Edition
• Infrastructure add a maximum of
3ms latency to tasks.
• Integrate with existing
applications (Java 1.4.2+)
Approach
• Business Tasks are regular Java objects (pojo)
• Place Business Tasks into Coherence
•
•
•
•
Coherence dynamically distributes Tasks across the Cluster
Tasks are resilient in the Cluster
May use “affinity” to ensure related Tasks processed together
Coherence triggers task processing
• Scaling out Coherence = Scaling out Task Processing
(c) Copyright 2008. Oracle Corporation
List of the Performed tests
 Scalability Test
 Guaranteed Delivery Test
 Failover Test
 Server Joining Test
 Unattended Long Term Test
Results
• While submitting Tasks (regular system load)
• Test 1: Scale from 1 server to over 400
• No reconfiguration
• Test 2: Randomly kill servers
• No reconfiguration
• Test 3: Kill 1, 2, 4, 8, 16, 32, 64, 128, 160 servers at once
• No data loss
• Possible 1,200,000,000 Tasks execution capacity
per/day
• Client may reduce current hardware costs by 75%
(c) Copyright 2008. Oracle Corporation
Eliminating Single Point of
Bottleneck
Trading Exchange
•
•
•
•
•
•
Similar requirements and constraints
Order processing (Foreign Exchange)
1,000’s per second (initial) per currency pair
No manual partitioning
No transactions
10ms max latency for full accept, validate, match,
respond
• Achieved with Coherence using BMLs (< 3ms)
• 14 weeks development (start to go live)
(c) Copyright 2008. Oracle Corporation
Previous Approach
(failed to meet SLA’s)
(c) Copyright 2008. Oracle Corporation
Coherence – based Solution
(c) Copyright 2008. Oracle Corporation
Conclusion
Oracle Coherence…
• Is an in – memory object data grid, providing
•
•
•
•
Scalability
Availability
Reliability
Performance
• Supports many mission – critical apps especially in
Financial Services
• Integrates with and supports other technologies:
• Compute Grids
• Database Grids
• C++, .Net
• Is a key component of Oracle’s XTP platform
Q&A