In a data center, 70% of outages are a result of human error – which is counterproductive in an environment where failure is not an option. A well-designed facility operations and maintenance (O&M) program can minimize risk and reduce expenses. To be successful, an O&M program rests upon two pillars: instilling a “mission-critical mindset” and incorporating 12 core elements.
Step 1: Adopting a mission-critical mindset
The fundamental purpose of data center facilities is to provide uninterrupted power, cooling, network and space resources in the right amounts, at the right redundancy level, and at the right time to IT servers, storage and networking gear. Simply put: Failure is not an option.
Encouraging staff to internalize a “mission-critical mindset” is at the heart of an effective O&M program, as defined by these principles:
- Focusing on risk mitigation in all activities, work processes and procedures
- Acting with confidence and patience that results from careful planning and preparation
- Taking an analytical, process-driven approach to risk avoidance and problem solving
- Understanding the function and connections between facility systems and components
- Committing to continuous learning and process improvement
Step 2: Incorporating the 12 core elements
Building upon the mission-critical mindset principles, facilities teams can implement and manage an effective program by factoring in the characteristics of these 12 core elements:
- 1. Environmental health and safety: safety plans/training; personal protective equipment; lockout/tagout procedures; hazard analysis, hazardous material handling, and hazard communications; and compliance with health and safety laws and regulations
- 2. Personnel management: expertise, training, role assignment and staffing level; and a mission-critical mentality
Quick Reference: 12 Essential Elements of an O&M Program
|Environmental health and safety
|Emergency preparedness and response
||Performance monitoring and review
- 3. Emergency preparedness and response: emergency operating procedures for all high-risk failure scenarios; conducting post-event analysis to reduce failures and improve future respons.
- 4. Maintenance management: asset, work order and spare parts management combined with a proactive preventative and predictive maintenance plan
- 5. Change management: operational procedure development and review, risk analysis and communication, structured work practices, and vendor/contractor supervision
- 6. Documentation management: accurate system of organizing critical documents for operational procedures, maintenance records and personnel
- 7. Training: personnelassessment and training to ensure capability in operational and maintenance tasks; includes certification and ongoing education
- 8. Infrastructure management: support for changing IT requirements throughfacility monitoring, capacity management, and IT/Facilities integration
- 9. Quality management: quality assurance to avoid errors through consistent process and procedures; quality control to detect human or system errors through checks, inspections and audits; continuous quality improvement to analyze errors and make modifications to prevent repetition
- 10. Energy management: use ofdata center infrastructure management (DCIM) software to improve energy efficiency, performance benchmarking, efficiency analysis and strategic energy sourcing
- 11. Financial management: close scrutiny of purchasing/delivery/invoicing to ensure critical maintenance and facility projects occur on schedule and avoid downtime
- 12. Performance monitoring and review: the collection and analysis of metrics and key performance indicators (KPIs) to ensure an O&M plan is productive, on target and aligned with business goals
Mistakes related to an operations program can undermine effectiveness and lead to system interruptions, excess expense or staff injury. A maintenance program won’t be effective if it lacks metrics, for instance, and poor training or documentation will create gaps. Ineffective change management or a team that’s stuck in manual mode will hamper a program; and a team that’s overconfident (why change what’s been working) may resist new procedures. Finally, failing to test and evaluate or failing to implement a quality system are pitfalls that will impact a program’s success.
Human error and inattention can compromise the performance of any data center design, but creating and implementing an effective O&M program will reduce operating expenses while maintaining the desired high level of facility performance.