Best Practices & How to Leverage Analytics Kishore Ramamurthy, BMC Software

Best Practices & How to Leverage Analytics
Kishore Ramamurthy, BMC Software
[email protected]
Legal Notice
› The information contained in this presentation is the confidential
information of BMC, Inc. and is being provided to you with the
express understanding that without the prior written consent of BMC,
customers and partners may not discuss or otherwise disclose this
information to any third party or otherwise make use of this
information for any purpose other than for which BMC intended.
› All of the future product plans and releases described herein are at
the sole discretion of BMC and are subject to change and/or
cancellation, and in no way should these future product plans be
viewed as commitments on BMC’s part. In particular, screen design is
not finalized and may ultimately differ from the prototypes presented
here.
Core Building Blocks of Analytics
› Baselines
– More than a simple moving average
› Intelligent Events
– Leveraging short term trends AND baseline patterns
› Service Models
– Key for service context, but only a piece in the puzzle
› RCA
– Synergy of events + data + service model + configuration changes
› Data Visualization
– Enabling users to find correlations and patterns
Core Building Blocks of Analytics
AlarmPoint Systems
Analytics Engine
Root Cause
BMC Remedy Problem,
Incident & Change
Mgmt
Service Model
Event Rule Engine
Traditional Events
Intelligent Events
BMC Atrium
Orchestrator
Configuration
Change
Baselines
BMC Atrium CMDB
Events
End User
Experience
(Synthetic, Real)
BMC BladeLogic
Configuration
Automation
Performance Data
Agent/KM & Agentless
(System, Application, DB, Network,
Storage, MF)
3rd Party
(HP, CA, IBM, VMware, MSFT,
SAP, etc.)
Monitoring/Events
Business
KPI
Understanding and managing KPIs
› Key Performance Indicators (KPIs) are the metrics that are identified by
customers as important metrics that indicate overall health of the
environment.
Example: the response time for a web transaction may be a KPI
where as total number of bytes downloaded may not.
› Baselines are generated automatically for KPIs and used in RCA.
Understanding and using absolute threshold events
Absolute thresholds are the traditional approach to events.
BMC ProactiveNet recommends using absolute thresholds for the following
requirements only to achieve event reduction.
1. If the metric is an availability type metric, where the value of the metric
is either 0 or 100%. Example – a process is down or up or a device is
down or up
2. If the metric is a business critical metric and it is absolutely important
to know when a threshold is breached. Example – Ecommerce search
page response time should be less than 5sec.
3. BMC ProactiveNet recommends using “Intelligent” events as a
substitute for simple absolute threshold event wherever applicable
Understanding and using Abnormality Events
› As data collection happens, BMC ProactiveNet learns the behavior of
metrics over a period of time and generates baselines. These baselines
are hourly, daily and weekly.
› BMC ProactiveNet will analyze new data against the established baselines
to determine if the metric is within normal range of operation or deviating
from the normal. If the behavior is outside the normal range, BMC
ProactiveNet generates “Abnormality” Events
› Abnormality events are informational events and are primarily used for
RCA purposes
› No explicit thresholds need to be set (OOTB)
› OOTB Abnormalities are generated for KPI metrics
› It is not recommended to change the advanced settings unless there
is a specific need and reviewed with BMC support
Understanding and using Signature Events
› Signature events based on the baselines and advanced settings similar to
“Abnormality” events.
› Abnormality events are informational. If it important to generate events
that are similar to events based on absolute thresholds, enable minor,
major or critical events based on deviations on baseline data.
› It is not recommended to change the advanced settings unless there
is a specific need and reviewed with BMC support.
Predictive Events
› Abnormalities
– Works well as early indicator without causing fire drill by operators
› Predictive Alarms
– Important to avoid false alarms at all cost or benefit is lost
– Requires conservative algorithm
– Only possible with data analytics (leveraging both baseline and data
correlation as validation filters)
– Designed for short-term predictions
Understanding and using intelligent Events
› Intelligent events are based on absolute thresholds and baseline data.
Intelligent thresholds use the absolute threshold and baselines to
determine to generate events..
› As a best practice, combine the absolute thresholds with Auto baselines to
reduce events.
Threshold Management and Event Reduction
Identify the Key Performance Indicators(KPIs) for the environment being monitored
Enable baseline generation for all the KPIs
Identify business critical metrics from the KPIs
Set and review absolute thresholds for KPIs periodically.
Enable use of “Auto Baselines” with absolute thresholds as required.
Review non-KPIs as required and make them KPIs, if required
Threshold Management and Event Reduction
Contd…
Don’t enable any thresholds on non-KPI metrics
Review abnormalities for all KPIs. Don’t attempt to change the abnormality settings under
“Advanced “, unless reviewed by BMC support
Convert the abnormalities to Signature thresholds based on the review of abnormalities
Don’t change the signature threshold settings under “Advanced “, unless reviewed by BMC
support.
Use always “Auto baselines” when using baselines by default.
Root Cause Analysis
› Root Cause Analysis can be defined as the Correlation of Abnormalities to
a primary Alarm through “intelligent correlation”
› RCA = Intelligent Correlation + Events (Abnormalities, Intelligent Events,
External Events)
– Abnormalities = informational events generated using raw data & baseline (not
intended to be consumed by operators)
– Intelligent Events = actionable alerts
– External Events = events deterministically generated from 3rd party sources
(SNMP trap, Change events, etc)
– Intelligent Correlation = Knowledge Base + Service Models + Event Filter +
Data Correlation + Time Correlation
•
•
•
•
•
Knowledge base = Global domain dependency rules
Service Models = Relationship model for a service hierarchy
Event Filter = Relationship model for event processing
Data Correlation = Relationship factor between data objects
Time Correlation = Time factor between related events
Root Cause Analysis
› Should leverage broad scope
of information
› synergy of events + data +
service model + configuration
changes
› Performance RCA needs to do
problem isolation with
imprecise data
› there is never 100% accuracy
in service models, events,
monitoring, configuration info,
etc…
Best Practices to achieve effective RCA
Make sure the data collection is happening end-to-end using the monitors
There are at least 6 data samples during a 30-minute interval
Create impact relationship using SIM
Create devices and assign them correct device types
Make sure the monitors are associated with devices instead of agents they are running on.
Best Practices to achieve effective RCA Contd…
Review events and build user defined knowledge patterns based on previous RCA efforts.
Review KPIs and enable baseline generation on all KPI metrics
Review non-KPI metrics and convert them to KPIs as needed
Capture external events wherever possible if data collection is not available. Example –
SNMP traps
Use top-down approach to RCA. In other words do RCA for a business service instead of
doing RCA on host failure.
Best Practices to achieve effective RCA Contd…
Review events and build user defined knowledge patterns based on previous PCA efforts.
Review KPIs and enable baseline generation on all KPI metrics
Review non-KPI metrics and convert them to KPIs as needed
Capture external events wherever possible if data collection is not available. Example –
SNMP traps
Use top-down approach to PCA. In other words do PCA for a business service instead of
doing PCA on host failure.