Best Practices & How to Leverage Analytics Kishore Ramamurthy, BMC Software [email protected] Legal Notice › The information contained in this presentation is the confidential information of BMC, Inc. and is being provided to you with the express understanding that without the prior written consent of BMC, customers and partners may not discuss or otherwise disclose this information to any third party or otherwise make use of this information for any purpose other than for which BMC intended. › All of the future product plans and releases described herein are at the sole discretion of BMC and are subject to change and/or cancellation, and in no way should these future product plans be viewed as commitments on BMC’s part. In particular, screen design is not finalized and may ultimately differ from the prototypes presented here. Core Building Blocks of Analytics › Baselines – More than a simple moving average › Intelligent Events – Leveraging short term trends AND baseline patterns › Service Models – Key for service context, but only a piece in the puzzle › RCA – Synergy of events + data + service model + configuration changes › Data Visualization – Enabling users to find correlations and patterns Core Building Blocks of Analytics AlarmPoint Systems Analytics Engine Root Cause BMC Remedy Problem, Incident & Change Mgmt Service Model Event Rule Engine Traditional Events Intelligent Events BMC Atrium Orchestrator Configuration Change Baselines BMC Atrium CMDB Events End User Experience (Synthetic, Real) BMC BladeLogic Configuration Automation Performance Data Agent/KM & Agentless (System, Application, DB, Network, Storage, MF) 3rd Party (HP, CA, IBM, VMware, MSFT, SAP, etc.) Monitoring/Events Business KPI Understanding and managing KPIs › Key Performance Indicators (KPIs) are the metrics that are identified by customers as important metrics that indicate overall health of the environment. Example: the response time for a web transaction may be a KPI where as total number of bytes downloaded may not. › Baselines are generated automatically for KPIs and used in RCA. Understanding and using absolute threshold events Absolute thresholds are the traditional approach to events. BMC ProactiveNet recommends using absolute thresholds for the following requirements only to achieve event reduction. 1. If the metric is an availability type metric, where the value of the metric is either 0 or 100%. Example – a process is down or up or a device is down or up 2. If the metric is a business critical metric and it is absolutely important to know when a threshold is breached. Example – Ecommerce search page response time should be less than 5sec. 3. BMC ProactiveNet recommends using “Intelligent” events as a substitute for simple absolute threshold event wherever applicable Understanding and using Abnormality Events › As data collection happens, BMC ProactiveNet learns the behavior of metrics over a period of time and generates baselines. These baselines are hourly, daily and weekly. › BMC ProactiveNet will analyze new data against the established baselines to determine if the metric is within normal range of operation or deviating from the normal. If the behavior is outside the normal range, BMC ProactiveNet generates “Abnormality” Events › Abnormality events are informational events and are primarily used for RCA purposes › No explicit thresholds need to be set (OOTB) › OOTB Abnormalities are generated for KPI metrics › It is not recommended to change the advanced settings unless there is a specific need and reviewed with BMC support Understanding and using Signature Events › Signature events based on the baselines and advanced settings similar to “Abnormality” events. › Abnormality events are informational. If it important to generate events that are similar to events based on absolute thresholds, enable minor, major or critical events based on deviations on baseline data. › It is not recommended to change the advanced settings unless there is a specific need and reviewed with BMC support. Predictive Events › Abnormalities – Works well as early indicator without causing fire drill by operators › Predictive Alarms – Important to avoid false alarms at all cost or benefit is lost – Requires conservative algorithm – Only possible with data analytics (leveraging both baseline and data correlation as validation filters) – Designed for short-term predictions Understanding and using intelligent Events › Intelligent events are based on absolute thresholds and baseline data. Intelligent thresholds use the absolute threshold and baselines to determine to generate events.. › As a best practice, combine the absolute thresholds with Auto baselines to reduce events. Threshold Management and Event Reduction Identify the Key Performance Indicators(KPIs) for the environment being monitored Enable baseline generation for all the KPIs Identify business critical metrics from the KPIs Set and review absolute thresholds for KPIs periodically. Enable use of “Auto Baselines” with absolute thresholds as required. Review non-KPIs as required and make them KPIs, if required Threshold Management and Event Reduction Contd… Don’t enable any thresholds on non-KPI metrics Review abnormalities for all KPIs. Don’t attempt to change the abnormality settings under “Advanced “, unless reviewed by BMC support Convert the abnormalities to Signature thresholds based on the review of abnormalities Don’t change the signature threshold settings under “Advanced “, unless reviewed by BMC support. Use always “Auto baselines” when using baselines by default. Root Cause Analysis › Root Cause Analysis can be defined as the Correlation of Abnormalities to a primary Alarm through “intelligent correlation” › RCA = Intelligent Correlation + Events (Abnormalities, Intelligent Events, External Events) – Abnormalities = informational events generated using raw data & baseline (not intended to be consumed by operators) – Intelligent Events = actionable alerts – External Events = events deterministically generated from 3rd party sources (SNMP trap, Change events, etc) – Intelligent Correlation = Knowledge Base + Service Models + Event Filter + Data Correlation + Time Correlation • • • • • Knowledge base = Global domain dependency rules Service Models = Relationship model for a service hierarchy Event Filter = Relationship model for event processing Data Correlation = Relationship factor between data objects Time Correlation = Time factor between related events Root Cause Analysis › Should leverage broad scope of information › synergy of events + data + service model + configuration changes › Performance RCA needs to do problem isolation with imprecise data › there is never 100% accuracy in service models, events, monitoring, configuration info, etc… Best Practices to achieve effective RCA Make sure the data collection is happening end-to-end using the monitors There are at least 6 data samples during a 30-minute interval Create impact relationship using SIM Create devices and assign them correct device types Make sure the monitors are associated with devices instead of agents they are running on. Best Practices to achieve effective RCA Contd… Review events and build user defined knowledge patterns based on previous RCA efforts. Review KPIs and enable baseline generation on all KPI metrics Review non-KPI metrics and convert them to KPIs as needed Capture external events wherever possible if data collection is not available. Example – SNMP traps Use top-down approach to RCA. In other words do RCA for a business service instead of doing RCA on host failure. Best Practices to achieve effective RCA Contd… Review events and build user defined knowledge patterns based on previous PCA efforts. Review KPIs and enable baseline generation on all KPI metrics Review non-KPI metrics and convert them to KPIs as needed Capture external events wherever possible if data collection is not available. Example – SNMP traps Use top-down approach to PCA. In other words do PCA for a business service instead of doing PCA on host failure.
© Copyright 2024