Visible Ops

“Infobox Book”
name	The Visible Ops Handbook: Implementing ITIL in 4 Practical and Auditable Steps
image
author	Kevin Behr, Gene Kim, George Spafford, Information Technology Process Institute
country	US
language	English
subject	ITIL
genre	Information Technology Management
publisher	Information Technology Process Institute
release_date	June 15, 2005 (Revised 1st Edition)
media_type	Print (paperback)
pages	112
isbn	978-0975568613

What is ITIL?

ITIL = Information Technology Infrastructure Library
A “drastically different approach to IT” (p79)
A “maturity path for IT that is not based on technology” (p79)
A “collection of best practices codified in seven books by the Office of Government Commerce in the U.K.” (p85)
A collection “without prioritization or any prescriptive structure” (p18)
Used by Visible Ops authors as a framework “to normalize terminology” and categorize traits shared across studied high performing organizations (p18-20)

Introduction

(p10-24)
What is Visible Ops?

Highest ROI best practices divided into four prioritized and incremental Phases
All ideas are mapped to ITIL terminology
Intended to be an “on-ramp” to ITIL

Key premises to the Visible Ops rational

80% of unplanned outages are due to ill-planned changes made by administrators (“operations staff”) or developers
80% of Mean Time To Repair (MTTR) is spent determining what changed
With the right processes in place, it is easier, better, and more predictable to rebuild infrastructure than to repair it
Concentrating staff time on pre-production efforts is more efficient and less expensive due to the high cost of repairing defects while in production
Without process controls, pieces of infrastructure often become like unique snowflakes or irreplaceable works of art … only understood by the “rocket scientist” creator who’s time is tied to maintaining it (p41)
“You can not manage what you can not measure” (p59)

Phase One: Stabilize the Patient

(p25-40)
Goals

Identify most critical IT systems generating the most unplanned work
Stabilize infrastructure (prioritizing the most fragile components)
Create a “culture of causality” where all changes are viewed as key risks that need to managed by facts rather than by beliefs
Reduce unplanned work to 25% or less (high performers achieve lower than 5%)
Maximize change success rates (high performers hit 98%)
Minimize Mean Time to Repair (MTTR)
Ensure security specialists become part of the decision process
Shift staff time from “perpetual firefighting to more proactive work that addresses the root causes of problems”
Minimize the IT failures that cause stress and damage IT’s reputation
Increase the overall level of confidence in IT
Collect data to affirm the new processes and foster an understanding that any previous perceptions of nimbleness and speed were not factoring in time spent troubleshooting and doing unplanned work

Recommended key steps (to be implemented on most fragile systems first)

Reduce or eliminate change privileges to fragile infrastructure
Why? Every time a change is made you risk breaking functionality
Create scheduled maintenance windows where all changes are made
Why? Scheduled changes are more visible, and are more likely to be planned and tested before going into production
Automate daily scans to detect and report changes
Why? To automatically verify and log that all scheduled changes were made … and that no other changes were made
Warning: Due to their collected data, the authors strongly recommend that even the most trusted administrators still work under automated detections
Disclosure: One of the authors is the CTO at Tripwire, Inc, the manufacturer of the recommended software for these automated scans….
When troubleshooting incidents, first analyze the recent changes (approved and detected) to isolate likely causes before recommending additional changes
Schedule a weekly Change Advisory Board (CAB) made up of representatives from operations, networking, security and the service desk
Why? To ensure key stakeholders collectively inform and influence change decisions
Create a Change Advisory Board – Emergency Committee (CAB/EC) who can assemble quickly to review emergency change requests
Why? “Emergency changes are the most critical to scrutinize”
Create a Change Request Tracking System to document and track requests for changes (RFCs) through authorization, verification, and implementation processes
Why? To facilitate the change approval process and to generate reports with metrics

Phase Two: Catch & Release and Find Fragile Artifacts

(p41-46)
Goals

Prioritize IT’s most critical services
Identify critical pieces of production infrastructure (hardware and software)
Identify interdependencies between components of production infrastructure
Foster organizational learning
Identify the high-risk “fragile artifacts”

Recommended key steps

Create a prioritized service catalog that documents the most critical services
Create a Configuration Management Database (CMDB) that illustrates mappings between services and infrastructure, and shows the interdependencies between all configuration items (CI)
Freeze all related configurations for an agreed upon change-free window
Why? To ensure an accurate inter-related configuration inventory (see below)
Inventory all equipment and software in the data center, recording the whos, whats, interdependencies and history for each item
Why? To facilitate faster problem management and to inform change decisions
Note: This inventory should be implemented by the most senior staff to ensure the most knowledgeable capturing of configuration details and histories
Identify the “fragile artifacts” that have the worst historical change success rates and/or the least technical mastery by the supporting technicians, and prioritize them by the criticality of the services they provide
Why? To create a prioritized list of servers to rebuild in Phase Three
To the extent possible, place fragile artifacts under a permanent configuration freeze until they can be replaced by complete rebuilds in Phase Three

Phase Three: Establish Repeatable Build Library

(p47-58)
Goals

Remove processes that encourage heroics in rewarding vigilant firefighters
Increase team-level technical mastery of production infrastructure
Shift senior staff from firefighting to fire prevention
Ensure that critical infrastructure can be easily rebuilt
Enable a new troubleshooting process with a short, predictable Mean Time To Repair (MTTR)
Ensure perfect configuration synchronization between pre-production and production servers
Ensure all configurations and build processes are completely documented

Recommended key steps (to be implemented on most fragile systems first)

Create and maintain a versioned, Definitive Software Library (DSL) for all acquired and custom developed software and patches
Note: additions must be approved by the Change Approval Board (CAB)
Exception: at the time of initial creation, all currently used production software will be accepted into the DSL under a one year grace period
Create a team of release management engineers from your most senior operations staff. Only more junior staff will be on the production operations team.
Prevent developers and the release management engineers (previously the senior operations staff) from accessing production infrastructure
Reason 1: Policy encourages recommended changes to be error free with bullet-proof installation and back-out processes in place
Reason 2: Process verifies completeness and accuracy of documentation for installation and operations procedures
Release management engineers create automated, consolidated, integrated, patched, tested, security scanned, layer-able build packages which will then be provisioned onto production infrastructure by the more junior, production operations staff
Reason 1: Consolidates the number of unique configuration counts (and thus increases team mastery of those fewer configurations)
Reason 2: Ensures fully integrated quality assurance tests and security verifications
Updates and even non-emergency patches are then rolled into a new a “golden build” which is then applied to production hardware as a new build
Reason 1: Eliminates the risk of “patch and pray”
Reason 2: Otherwise, over time, break/fix cycles tend to encourage configuration variance between production and pre-production servers … and between similar servers that should be identical
Reason 3: Applying new builds allows for highly accurate predictions of downtime, reduces chances of human error, and is typically faster than applying numerous individual patches and updates
As a general rule, installed build packages will be preceded by erasing the production hard drive (or partition) … the book calls this a “bare-metal build”
Why? This process ensure that production servers do not contain any hidden dependencies, and guarantees that the “golden builds” accurately reflect production systems, enabling perfect synchronization with pre-production servers

Phase Four: Enable Continuous Improvement

(p59-64)
Goals

Continuous increase in technical mastery of production infrastructure by reducing configuration variance
Continuous improvement of change success rates
Continuous increases in effective rate of change
Continuous monitoring to avoid slips in performance

Recommended key steps

Use recommended metrics to hone efforts from the first three Phases. A few selected examples:
Percent of systems that match known good builds (higher is better)
Time to provision known good builds (lower is better)
Percent of builds that have security sign off (higher is better)
Number of authorized changes per week (higher is better)
Change success rate (higher is better)
Strive to implement additional recommended improvement points. A few selected examples:
Segregate the development, test, and production systems to safeguard against any possible unintentional crossovers or hidden dependencies
Enforce a standard build across all similar devices
Define bullet-proof back out processes to recover from failed or unauthorized changes
Internalize the fundamental relationship between Mean Time to Repair (MTTR) and availability. By improving MTTR you also improve overall availability.
Track repeat offenders who circumvent change management policies.

Selected thought-provoking quotes

(Numerous quotes are cited throughout the book)
“Controls don’t slow the business down. Like breaks on a car, controls actually allow you to go faster.” ~ Stephen Katz, former CISO of Citibank
“It is not the strongest of the species that survives, nor the most intelligent, but rather the one most responsive to change.” ~ Charles Darwin
“A vision without a task is but a dream. A task without a vision is but a drudgery; but, a vision and a task are the hope of the world.” ~ author unknown (found on wall of church outside Sussex England, circa 1700)
“If you can’t describe what you are doing as a process, you don’t know what you’re doing.” ~ W. Edwards Deming

External Links

Visible Ops Handbook, The
:Business, Investing, Finance
Visible Ops Handbook, The
Visible Ops Handbook, The
Visible Ops Handbook, The
Visible Ops Handbook, The
Visible Ops Handbook, The

Filed under: Uncategorized - @ 11:13 am

WikiSummaries

Free Book Summaries

Categories

Visible Ops

What is ITIL?

Introduction

Phase One: Stabilize the Patient

Phase Two: Catch & Release and Find Fragile Artifacts

Phase Three: Establish Repeatable Build Library

Phase Four: Enable Continuous Improvement

Selected thought-provoking quotes

External Links

Categories

What is ITIL?

Introduction

Phase One: Stabilize the Patient

Phase Two: Catch & Release and Find Fragile Artifacts

Phase Three: Establish Repeatable Build Library

Phase Four: Enable Continuous Improvement

Selected thought-provoking quotes

External Links

Related posts: