Please write a one page summary of Chapter 2.
The Practice of System and Network Administration Volume 1 Third Edition This page intentionally left blank The Practice of System and Network Administration Volume 1 Third Edition Thomas A. Limoncelli Christina J. Hogan Strata R. Chalup Boston • Columbus • Indianapolis • New York • San Francisco • Amsterdam • Cape Town Dubai • London • Madrid • Milan • Munich • Paris • Montreal • Toronto • Delhi • Mexico City São Paulo • Sydney • Hong Kong • Seoul • Singapore • Taipei • Tokyo Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and the publisher was aware of a trademark claim, the designations have been printed with initial capital letters or in all capitals. The authors and publisher have taken care in the preparation of this book, but make no expressed or implied warranty of any kind and assume no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein. For information about buying this title in bulk quantities, or for special sales opportunities (which may include electronic versions; custom cover designs; and content particular to your business, training goals, marketing focus, or branding interests), please contact our corporate sales department at [email protected] or (800) 382-3419. For government sales inquiries, please contact [email protected]. For questions about sales outside the United States, please contact [email protected]. Visit us on the Web: informit.com/aw Library of Congress Catalog Number: 2016946362 Copyright © 2017 Thomas A. Limoncelli, Christina J. Lear née Hogan, Virtual.NET Inc., Lumeta Corporation All rights reserved. Printed in the United States of America. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, request forms and the appropriate contacts within the Pearson Education Global Rights & Permissions Department, please visit www.pearsoned.com/permissions/. Page 4 excerpt: “Noël,” Season 2 Episode 10. The West Wing. Directed by Thomas Schlamme. Teleplay by Aaron Sorkin. Story by Peter Parnell. Scene performed by John Spencer and Bradley Whitford. Original broadcast December 20, 2000. Warner Brothers Burbank Studios, Burbank, CA. Aaron Sorkin, John Wells Production, Warner Brothers Television, NBC © 2000. Broadcast television. Chapter 26 photos © 2017 Christina J. Lear née Hogan. ISBN-13: 978-0-321-91916-8 ISBN-10: 0-321-91916-5 Text printed in the United States of America. 1 16 Contents at a Glance Contents ix Preface xxxix Acknowledgments xlvii About the Authors li Part I Game-Changing Strategies 1 Chapter 1 Chapter 2 Chapter 3 Chapter 4 Climbing Out of the Hole The Small Batches Principle Pets and Cattle Infrastructure as Code 3 23 37 55 Part II Workstation Fleet Management 77 Chapter 5 Workstation Architecture Chapter 6 Workstation Hardware Strategies Chapter 7 Workstation Software Life Cycle Chapter 8 OS Installation Strategies Chapter 9 Workstation Service Definition Chapter 10 Workstation Fleet Logistics Chapter 11 Workstation Standardization Chapter 12 Onboarding 79 101 117 137 157 173 191 201 Part III Servers 219 Chapter 13 Server Hardware Strategies 221 v vi Contents at a Glance Chapter 14 Server Hardware Features Chapter 15 Server Hardware Specifications 245 265 Part IV Services 281 Chapter 16 Service Requirements Chapter 17 Service Planning and Engineering Chapter 18 Service Resiliency and Performance Patterns Chapter 19 Service Launch: Fundamentals Chapter 20 Service Launch: DevOps Chapter 21 Service Conversions Chapter 22 Disaster Recovery and Data Integrity 283 305 321 335 353 373 387 Part V Infrastructure 397 Chapter 23 Network Architecture Chapter 24 Network Operations Chapter 25 Datacenters Overview Chapter 26 Running a Datacenter 399 431 449 459 Part VI Helpdesks and Support 483 Chapter 27 Customer Support Chapter 28 Handling an Incident Report Chapter 29 Debugging Chapter 30 Fixing Things Once Chapter 31 Documentation 485 505 529 541 551 Part VII Change Processes 565 Chapter 32 Change Management Chapter 33 Server Upgrades Chapter 34 Maintenance Windows Chapter 35 Centralization Overview Chapter 36 Centralization Recommendations Chapter 37 Centralizing a Service 567 587 611 639 645 659 Part VIII Service Recommendations 669 Chapter 38 Service Monitoring Chapter 39 Namespaces Chapter 40 Nameservices Chapter 41 Email Service 671 693 711 729 Contents at a Glance vii Chapter 42 Print Service Chapter 43 Data Storage Chapter 44 Backup and Restore Chapter 45 Software Repositories Chapter 46 Web Services 749 759 793 825 851 Part IX Management Practices 871 Chapter 47 Ethics Chapter 48 Organizational Structures Chapter 49 Perception and Visibility Chapter 50 Time Management Chapter 51 Communication and Negotiation Chapter 52 Being a Happy SA Chapter 53 Hiring System Administrators Chapter 54 Firing System Administrators 873 891 913 935 949 963 979 1005 Part X Being More Awesome 1017 Chapter 55 Operational Excellence Chapter 56 Operational Assessments 1019 1035 Epilogue 1063 Part XI Appendices 1065 Appendix A What to Do When . . . 1067 Appendix B The Many Roles of a System Administrator 1089 Bibliography 1115 Index 1121 This page intentionally left blank Contents Preface xxxix Acknowledgments xlvii About the Authors li Part I 1 Game-Changing Strategies 1 Climbing Out of the Hole 1.1 Organizing WIP 1.1.1 Ticket Systems 1.1.2 Kanban 1.1.3 Tickets and Kanban 1.2 Eliminating Time Sinkholes 1.2.1 OS Installation and Configuration 1.2.2 Software Deployment 1.3 DevOps 1.4 DevOps Without Devs 1.5 Bottlenecks 1.6 Getting Started 1.7 Summary Exercises 3 5 5 8 12 12 13 15 16 16 18 20 21 22 2 The Small Batches Principle 2.1 The Carpenter Analogy 2.2 Fixing Hell Month 23 23 24 ix x Contents 2.3 Improving Emergency Failovers 2.4 Launching Early and Often 2.5 Summary Exercises 26 29 34 34 3 Pets and Cattle 3.1 The Pets and Cattle Analogy 3.2 Scaling 3.3 Desktops as Cattle 3.4 Server Hardware as Cattle 3.5 Pets Store State 3.6 Isolating State 3.7 Generic Processes 3.8 Moving Variations to the End 3.9 Automation 3.10 Summary Exercises 37 37 39 40 41 43 44 47 51 53 53 54 4 Infrastructure as Code 4.1 Programmable Infrastructure 4.2 Tracking Changes 4.3 Benefits of Infrastructure as Code 4.4 Principles of Infrastructure as Code 4.5 Configuration Management Tools 4.5.1 Declarative Versus Imperative 4.5.2 Idempotency 4.5.3 Guards and Statements 4.6 Example Infrastructure as Code Systems 4.6.1 Configuring a DNS Client 4.6.2 A Simple Web Server 4.6.3 A Complex Web Application 4.7 Bringing Infrastructure as Code to Your Organization 4.8 Infrastructure as Code for Enhanced Collaboration 4.9 Downsides to Infrastructure as Code 4.10 Automation Myths 4.11 Summary Exercises 55 56 57 59 62 63 64 65 66 67 67 67 68 71 72 73 74 75 76 Contents Part II Workstation Fleet Management xi 77 5 Workstation Architecture 5.1 Fungibility 5.2 Hardware 5.3 Operating System 5.4 Network Configuration 5.4.1 Dynamic Configuration 5.4.2 Hardcoded Configuration 5.4.3 Hybrid Configuration 5.4.4 Applicability 5.5 Accounts and Authorization 5.6 Data Storage 5.7 OS Updates 5.8 Security 5.8.1 Theft 5.8.2 Malware 5.9 Logging 5.10 Summary Exercises 79 80 82 82 84 84 85 85 85 86 89 93 94 94 95 97 98 99 6 Workstation Hardware Strategies 6.1 Physical Workstations 6.1.1 Laptop Versus Desktop 6.1.2 Vendor Selection 6.1.3 Product Line Selection 6.2 Virtual Desktop Infrastructure 6.2.1 Reduced Costs 6.2.2 Ease of Maintenance 6.2.3 Persistent or Non-persistent? 6.3 Bring Your Own Device 6.3.1 Strategies 6.3.2 Pros and Cons 6.3.3 Security 6.3.4 Additional Costs 6.3.5 Usability 101 101 101 102 103 105 106 106 106 110 110 111 111 112 112 xii Contents 6.4 Summary Exercises 113 114 7 Workstation Software Life Cycle 7.1 Life of a Machine 7.2 OS Installation 7.3 OS Configuration 7.3.1 Configuration Management Systems 7.3.2 Microsoft Group Policy Objects 7.3.3 DHCP Configuration 7.3.4 Package Installation 7.4 Updating the System Software and Applications 7.4.1 Updates Versus Installations 7.4.2 Update Methods 7.5 Rolling Out Changes . . . Carefully 7.6 Disposal 7.6.1 Accounting 7.6.2 Technical: Decommissioning 7.6.3 Technical: Data Security 7.6.4 Physical 7.7 Summary Exercises 117 117 120 120 120 121 122 123 123 124 125 128 130 131 131 132 132 134 135 8 OS Installation Strategies 8.1 Consistency Is More Important Than Perfection 8.2 Installation Strategies 8.2.1 Automation 8.2.2 Cloning 8.2.3 Manual 8.3 Test-Driven Configuration Development 8.4 Automating in Steps 8.5 When Not to Automate 8.6 Vendor Support of OS Installation 8.7 Should You Trust the Vendor’s Installation? 8.8 Summary Exercises 137 138 142 142 143 145 147 148 152 152 154 154 155 Contents 9 Workstation Service Definition 9.1 Basic Service Definition 9.1.1 Approaches to Platform Definition 9.1.2 Application Selection 9.1.3 Leveraging a CMDB 9.2 Refresh Cycles 9.2.1 Choosing an Approach 9.2.2 Formalizing the Policy 9.2.3 Aligning with Asset Depreciation 9.3 Tiered Support Levels 9.4 Workstations as a Managed Service 9.5 Summary Exercises xiii 157 157 158 159 160 161 161 163 163 165 168 170 171 10 Workstation Fleet Logistics 10.1 What Employees See 10.2 What Employees Don’t See 10.2.1 Purchasing Team 10.2.2 Prep Team 10.2.3 Delivery Team 10.2.4 Platform Team 10.2.5 Network Team 10.2.6 Tools Team 10.2.7 Project Management 10.2.8 Program Office 10.3 Configuration Management Database 10.4 Small-Scale Fleet Logistics 10.4.1 Part-Time Fleet Management 10.4.2 Full-Time Fleet Coordinators 10.5 Summary Exercises 173 173 174 175 175 177 178 179 180 180 181 183 186 186 187 188 188 11 191 192 193 193 Workstation Standardization 11.1 Involving Customers Early 11.2 Releasing Early and Iterating 11.3 Having a Transition Interval (Overlap) xiv Contents 11.4 Ratcheting 11.5 Setting a Cut-Off Date 11.6 Adapting for Your Corporate Culture 11.7 Leveraging the Path of Least Resistance 11.8 Summary Exercises 194 195 195 196 198 199 12 Onboarding 12.1 Making a Good First Impression 12.2 IT Responsibilities 12.3 Five Keys to Successful Onboarding 12.3.1 Drive the Process with an Onboarding Timeline 12.3.2 Determine Needs Ahead of Arrival 12.3.3 Perform the Onboarding 12.3.4 Communicate Across Teams 12.3.5 Reflect On and Improve the Process 12.4 Cadence Changes 12.5 Case Studies 12.5.1 Worst Onboarding Experience Ever 12.5.2 Lumeta’s Onboarding Process 12.5.3 Google’s Onboarding Process 12.6 Summary Exercises 201 201 203 203 204 206 207 208 209 212 212 213 213 215 216 217 Part III 219 Servers 13 Server Hardware Strategies 13.1 All Eggs in One Basket 13.2 Beautiful Snowflakes 13.2.1 Asset Tracking 13.2.2 Reducing Variations 13.2.3 Global Optimization 13.3 Buy in Bulk, Allocate Fractions 13.3.1 VM Management 13.3.2 Live Migration 13.3.3 VM Packing 221 222 224 225 225 226 228 229 230 231 Contents xv 13.3.4 Spare Capacity for Maintenance 13.3.5 Unified VM/Non-VM Management 13.3.6 Containers 13.4 Grid Computing 13.5 Blade Servers 13.6 Cloud-Based Compute Services 13.6.1 What Is the Cloud? 13.6.2 Cloud Computing’s Cost Benefits 13.6.3 Software as a Service 13.7 Server Appliances 13.8 Hybrid Strategies 13.9 Summary Exercises 232 234 234 235 237 238 239 239 241 241 242 243 244 14 Server Hardware Features 14.1 Workstations Versus Servers 14.1.1 Server Hardware Design Differences 14.1.2 Server OS and Management Differences 14.2 Server Reliability 14.2.1 Levels of Redundancy 14.2.2 Data Integrity 14.2.3 Hot-Swap Components 14.2.4 Servers Should Be in Computer Rooms 14.3 Remotely Managing Servers 14.3.1 Integrated Out-of-Band Management 14.3.2 Non-integrated Out-of-Band Management 14.4 Separate Administrative Networks 14.5 Maintenance Contracts and Spare Parts 14.5.1 Vendor SLA 14.5.2 Spare Parts 14.5.3 Tracking Service Contracts 14.5.4 Cross-Shipping 14.6 Selecting Vendors with Server Experience 14.7 Summary Exercises 245 246 246 248 249 250 250 252 253 254 254 255 257 258 258 259 260 261 261 263 263 xvi Contents 15 Server Hardware Specifications 15.1 Models and Product Lines 15.2 Server Hardware Details 15.2.1 CPUs 15.2.2 Memory 15.2.3 Network Interfaces 15.2.4 Disks: Hardware Versus Software RAID 15.2.5 Power Supplies 15.3 Things to Leave Out 15.4 Summary Exercises 265 266 266 267 270 274 275 277 278 278 279 Part IV 281 Services 16 Service Requirements 16.1 Services Make the Environment 16.2 Starting with a Kick-Off Meeting 16.3 Gathering Written Requirements 16.4 Customer Requirements 16.4.1 Describing Features 16.4.2 Questions to Ask 16.4.3 Service Level Agreements 16.4.4 Handling Difficult Requests 16.5 Scope, Schedule, and Resources 16.6 Operational Requirements 16.6.1 System Observability 16.6.2 Remote and Central Management 16.6.3 Scaling Up or Out 16.6.4 Software Upgrades 16.6.5 Environment Fit 16.6.6 Support Model 16.6.7 Service Requests 16.6.8 Disaster Recovery 16.7 Open Architecture 16.8 Summary Exercises 283 284 285 286 288 288 289 290 290 291 292 292 293 294 294 295 296 297 298 298 302 303 Contents xvii 17 Service Planning and Engineering 17.1 General Engineering Basics 17.2 Simplicity 17.3 Vendor-Certified Designs 17.4 Dependency Engineering 17.4.1 Primary Dependencies 17.4.2 External Dependencies 17.4.3 Dependency Alignment 17.5 Decoupling Hostname from Service Name 17.6 Support 17.6.1 Monitoring 17.6.2 Support Model 17.6.3 Service Request Model 17.6.4 Documentation 17.7 Summary Exercises 305 306 307 308 309 309 309 311 313 315 316 317 317 318 319 319 18 Service Resiliency and Performance Patterns 18.1 Redundancy Design Patterns 18.1.1 Masters and Slaves 18.1.2 Load Balancers Plus Replicas 18.1.3 Replicas and Shared State 18.1.4 Performance or Resilience? 18.2 Performance and Scaling 18.2.1 Dataflow Analysis for Scaling 18.2.2 Bandwidth Versus Latency 18.3 Summary Exercises 321 322 322 323 324 325 326 328 330 333 334 19 Service Launch: Fundamentals 19.1 Planning for Problems 19.2 The Six-Step Launch Process 19.2.1 Step 1: Define the Ready List 19.2.2 Step 2: Work the List 19.2.3 Step 3: Launch the Beta Service 19.2.4 Step 4: Launch the Production Service 335 335 336 337 340 342 343 xviii Contents 19.2.5 Step 5: Capture the Lessons Learned 19.2.6 Step 6: Repeat 19.3 Launch Readiness Review 19.3.1 Launch Readiness Criteria 19.3.2 Sample Launch Criteria 19.3.3 Organizational Learning 19.3.4 LRC Maintenance 19.4 Launch Calendar 19.5 Common Launch Problems 19.5.1 Processes Fail in Production 19.5.2 Unexpected Access Methods 19.5.3 Production Resources Unavailable 19.5.4 New Technology Failures 19.5.5 Lack of User Training 19.5.6 No Backups 19.6 Summary Exercises 343 345 345 345 346 347 347 348 349 349 349 349 350 350 351 351 351 20 Service Launch: DevOps 20.1 Continuous Integration and Deployment 20.1.1 Test Ordering 20.1.2 Launch Categorizations 20.2 Minimum Viable Product 20.3 Rapid Release with Packaged Software 20.3.1 Testing Before Deployment 20.3.2 Time to Deployment Metrics 20.4 Cloning the Production Environment 20.5 Example: DNS/DHCP Infrastructure Software 20.5.1 The Problem 20.5.2 Desired End-State 20.5.3 First Milestone 20.5.4 Second Milestone 20.6 Launch with Data Migration 20.7 Controlling Self-Updating Software 20.8 Summary Exercises 353 354 355 355 357 359 359 361 362 363 363 364 365 366 366 369 370 371 Contents xix 21 Service Conversions 21.1 Minimizing Intrusiveness 21.2 Layers Versus Pillars 21.3 Vendor Support 21.4 Communication 21.5 Training 21.6 Gradual Roll-Outs 21.7 Flash-Cuts: Doing It All at Once 21.8 Backout Plan 21.8.1 Instant Roll-Back 21.8.2 Decision Point 21.9 Summary Exercises 373 374 376 377 378 379 379 380 383 384 384 385 385 22 Disaster Recovery and Data Integrity 22.1 Risk Analysis 22.2 Legal Obligations 22.3 Damage Limitation 22.4 Preparation 22.5 Data Integrity 22.6 Redundant Sites 22.7 Security Disasters 22.8 Media Relations 22.9 Summary Exercises 387 388 389 390 391 392 393 394 394 395 395 Part V Infrastructure 397 23 Network Architecture 23.1 Physical Versus Logical 23.2 The OSI Model 23.3 Wired Office Networks 23.3.1 Physical Infrastructure 23.3.2 Logical Design 23.3.3 Network Access Control 23.3.4 Location for Emergency Services 399 399 400 402 402 403 405 405 xx Contents 23.4 Wireless Office Networks 23.4.1 Physical Infrastructure 23.4.2 Logical Design 23.5 Datacenter Networks 23.5.1 Physical Infrastructure 23.5.2 Logical Design 23.6 WAN Strategies 23.6.1 Topology 23.6.2 Technology 23.7 Routing 23.7.1 Static Routing 23.7.2 Interior Routing Protocol 23.7.3 Exterior Gateway Protocol 23.8 Internet Access 23.8.1 Outbound Connectivity 23.8.2 Inbound Connectivity 23.9 Corporate Standards 23.9.1 Logical Design 23.9.2 Physical Design 23.10 Software-Defined Networks 23.11 IPv6 23.11.1 The Need for IPv6 23.11.2 Deploying IPv6 23.12 Summary Exercises 24 Network Operations 24.1 Monitoring 24.2 Management 24.2.1 Access and Audit Trail 24.2.2 Life Cycle 24.2.3 Configuration Management 24.2.4 Software Versions 24.2.5 Deployment Process 24.3 Documentation 24.3.1 Network Design and Implementation 24.3.2 DNS 406 406 406 408 409 412 413 414 417 419 419 419 420 420 420 421 422 423 424 425 426 426 427 428 429 431 431 432 433 433 435 436 437 437 438 439 Contents xxi 24.3.3 CMDB 24.3.4 Labeling 24.4 Support 24.4.1 Tools 24.4.2 Organizational Structure 24.4.3 Network Services 24.5 Summary Exercises 439 439 440 440 443 445 446 447 25 Datacenters Overview 25.1 Build, Rent, or Outsource 25.1.1 Building 25.1.2 Renting 25.1.3 Outsourcing 25.1.4 No Datacenter 25.1.5 Hybrid 25.2 Requirements 25.2.1 Business Requirements 25.2.2 Technical Requirements 25.3 Summary Exercises 449 450 450 450 451 451 451 452 452 454 456 457 26 Running a Datacenter 26.1 Capacity Management 26.1.1 Rack Space 26.1.2 Power 26.1.3 Wiring 26.1.4 Network and Console 26.2 Life-Cycle Management 26.2.1 Installation 26.2.2 Moves, Adds, and Changes 26.2.3 Maintenance 26.2.4 Decommission 26.3 Patch Cables 26.4 Labeling 26.4.1 Labeling Rack Location 26.4.2 Labeling Patch Cables 26.4.3 Labeling Network Equipment 459 459 461 462 464 465 465 465 466 466 467 468 471 471 471 474 xxii Contents 26.5 26.6 26.7 Console Access Workbench Tools and Supplies 26.7.1 Tools 26.7.2 Spares and Supplies 26.7.3 Parking Spaces 26.8 Summary Exercises Part VI 475 476 477 478 478 480 480 481 Helpdesks and Support 483 27 Customer Support 27.1 Having a Helpdesk 27.2 Offering a Friendly Face 27.3 Reflecting Corporate Culture 27.4 Having Enough Staff 27.5 Defining Scope of Support 27.6 Specifying How to Get Help 27.7 Defining Processes for Staff 27.8 Establishing an Escalation Process 27.9 Defining “Emergency” in Writing 27.10 Supplying Request-Tracking Software 27.11 Statistical Improvements 27.12 After-Hours and 24/7 Coverage 27.13 Better Advertising for the Helpdesk 27.14 Different Helpdesks for Different Needs 27.15 Summary Exercises 485 485 488 488 488 490 493 493 494 495 496 498 499 500 501 502 503 28 Handling an Incident Report 28.1 Process Overview 28.2 Phase A—Step 1: The Greeting 28.3 Phase B: Problem Identification 28.3.1 Step 2: Problem Classification 28.3.2 Step 3: Problem Statement 28.3.3 Step 4: Problem Verification 505 506 508 509 510 511 513 Contents 28.4 xxiii Phase C: Planning and Execution 28.4.1 Step 5: Solution Proposals 28.4.2 Step 6: Solution Selection 28.4.3 Step 7: Execution 28.5 Phase D: Verification 28.5.1 Step 8: Craft Verification 28.5.2 Step 9: Customer Verification/Closing 28.6 Perils of Skipping a Step 28.7 Optimizing Customer Care 28.7.1 Model-Based Training 28.7.2 Holistic Improvement 28.7.3 Increased Customer Familiarity 28.7.4 Special Announcements for Major Outages 28.7.5 Trend Analysis 28.7.6 Customers Who Know the Process 28.7.7 An Architecture That Reflects the Process 28.8 Summary Exercises 515 515 516 517 518 518 519 519 521 521 522 522 522 523 524 525 525 527 29 Debugging 29.1 Understanding the Customer’s Problem 29.2 Fixing the Cause, Not the Symptom 29.3 Being Systematic 29.4 Having the Right Tools 29.4.1 Training Is the Most Important Tool 29.4.2 Understanding the Underlying Technology 29.4.3 Choosing the Right Tools 29.4.4 Evaluating Tools 29.5 End-to-End Understanding of the System 29.6 Summary Exercises 529 529 531 532 533 534 534 535 537 538 540 540 30 Fixing Things Once 30.1 Story: The Misconfigured Servers 30.2 Avoiding Temporary Fixes 30.3 Learn from Carpenters 30.4 Automation 541 541 543 545 547 xxiv 31 Contents 30.5 Summary Exercises 549 550 Documentation 31.1 What to Document 31.2 A Simple Template for Getting Started 31.3 Easy Sources for Documentation 31.3.1 Saving Screenshots 31.3.2 Capturing the Command Line 31.3.3 Leveraging Email 31.3.4 Mining the Ticket System 31.4 The Power of Checklists 31.5 Wiki Systems 31.6 Findability 31.7 Roll-Out Issues 31.8 A Content-Management System 31.9 A Culture of Respect 31.10 Taxonomy and Structure 31.11 Additional Documentation Uses 31.12 Off-Site Links 31.13 Summary Exercises 551 552 553 554 554 554 555 555 556 557 559 559 560 561 561 562 562 563 564 Part VII Change Processes 32 Change Management 32.1 Change Review Boards 32.2 Process Overview 32.3 Change Proposals 32.4 Change Classifications 32.5 Risk Discovery and Quantification 32.6 Technical Planning 32.7 Scheduling 32.8 Communication 32.9 Tiered Change Review Boards 32.10 Change Freezes 565 567 568 570 570 571 572 573 574 576 578 579 Contents 32.11 Team Change Management 32.11.1 Changes Before Weekends 32.11.2 Preventing Injured Toes 32.11.3 Revision History 32.12 Starting with Git 32.13 Summary Exercises 33 Server Upgrades 33.1 The Upgrade Process 33.2 Step 1: Develop a Service Checklist 33.3 Step 2: Verify Software Compatibility 33.3.1 Upgrade the Software Before the OS 33.3.2 Upgrade the Software After the OS 33.3.3 Postpone the Upgrade or Change the Software 33.4 Step 3: Develop Verification Tests 33.5 Step 4: Choose an Upgrade Strategy 33.5.1 Speed 33.5.2 Risk 33.5.3 End-User Disruption 33.5.4 Effort 33.6 Step 5: Write a Detailed Implementation Plan 33.6.1 Adding Services During the Upgrade 33.6.2 Removing Services During the Upgrade 33.6.3 Old and New Versions on the Same Machine 33.6.4 Performing a Dress Rehearsal 33.7 Step 6: Write a Backout Plan 33.8 Step 7: Select a Maintenance Window 33.9 Step 8: Announce the Upgrade 33.10 Step 9: Execute the Tests 33.11 Step 10: Lock Out Customers 33.12 Step 11: Do the Upgrade with Someone 33.13 Step 12: Test Your Work 33.14 Step 13: If All Else Fails, Back Out 33.15 Step 14: Restore Access to Customers 33.16 Step 15: Communicate Completion/Backout xxv 581 581 583 583 583 585 585 587 587 588 591 591 592 592 592 595 596 597 597 597 598 598 598 599 599 600 600 602 603 604 605 605 605 606 606 xxvi 34 Contents 33.17 Summary Exercises 608 610 Maintenance Windows 34.1 Process Overview 34.2 Getting Management Buy-In 34.3 Scheduling Maintenance Windows 34.4 Planning Maintenance Tasks 34.5 Selecting a Flight Director 34.6 Managing Change Proposals 34.6.1 Sample Change Proposal: SecurID Server Upgrade 34.6.2 Sample Change Proposal: Storage Migration 34.7 Developing the Master Plan 34.8 Disabling Access 34.9 Ensuring Mechanics and Coordination 34.9.1 Shutdown/Boot Sequence 34.9.2 KVM, Console Service, and LOM 34.9.3 Communications 34.10 Change Completion Deadlines 34.11 Comprehensive System Testing 34.12 Post-maintenance Communication 34.13 Reenabling Remote Access 34.14 Be Visible the Next Morning 34.15 Postmortem 34.16 Mentoring a New Flight Director 34.17 Trending of Historical Data 34.18 Providing Limited Availability 34.19 High-Availability Sites 34.19.1 The Similarities 34.19.2 The Differences 34.20 Summary Exercises 611 612 613 614 615 616 617 618 619 620 621 622 622 625 625 628 628 630 631 631 631 632 632 633 634 634 635 636 637 35 Centralization Overview 35.1 Rationale for Reorganizing 35.1.1 Rationale for Centralization 35.1.2 Rationale for Decentralization 639 640 640 640 Contents 35.2 Approaches and Hybrids 35.3 Summary Exercises xxvii 642 643 644 36 Centralization Recommendations 36.1 Architecture 36.2 Security 36.2.1 Authorization 36.2.2 Extranet Connections 36.2.3 Data Leakage Prevention 36.3 Infrastructure 36.3.1 Datacenters 36.3.2 Networking 36.3.3 IP Address Space Management 36.3.4 Namespace Management 36.3.5 Communications 36.3.6 Data Management 36.3.7 Monitoring 36.3.8 Logging 36.4 Support 36.4.1 Helpdesk 36.4.2 End-User Support 36.5 Purchasing 36.6 Lab Environments 36.7 Summary Exercises 645 645 645 646 647 648 648 649 649 650 650 651 652 653 653 654 654 655 655 656 656 657 37 Centralizing a Service 37.1 Understand the Current Solution 37.2 Make a Detailed Plan 37.3 Get Management Support 37.4 Fix the Problems 37.5 Provide an Excellent Service 37.6 Start Slowly 37.7 Look for Low-Hanging Fruit 37.8 When to Decentralize 37.9 Managing Decentralized Services 659 660 661 662 662 663 663 664 665 666 xxviii Contents 37.10 Summary Exercises Part VIII Service Recommendations 667 668 669 38 Service Monitoring 38.1 Types of Monitoring 38.2 Building a Monitoring System 38.3 Historical Monitoring 38.3.1 Gathering the Data 38.3.2 Storing the Data 38.3.3 Viewing the Data 38.4 Real-Time Monitoring 38.4.1 SNMP 38.4.2 Log Processing 38.4.3 Alerting Mechanism 38.4.4 Escalation 38.4.5 Active Monitoring Systems 38.5 Scaling 38.5.1 Prioritization 38.5.2 Cascading Alerts 38.5.3 Coordination 38.6 Centralization and Accessibility 38.7 Pervasive Monitoring 38.8 End-to-End Tests 38.9 Application Response Time Monitoring 38.10 Compliance Monitoring 38.11 Meta-monitoring 38.12 Summary Exercises 671 672 673 674 674 675 675 676 677 679 679 682 682 684 684 684 685 685 686 687 688 689 690 690 691 39 Namespaces 39.1 What Is a Namespace? 39.2 Basic Rules of Namespaces 39.3 Defining Names 39.4 Merging Namespaces 693 693 694 694 698 Contents 39.5 39.6 39.7 Life-Cycle Management Reuse Usage 39.7.1 Scope 39.7.2 Consistency 39.7.3 Authority 39.8 Federated Identity 39.9 Summary Exercises xxix 699 700 701 701 704 706 708 709 710 40 Nameservices 40.1 Nameservice Data 40.1.1 Data 40.1.2 Consistency 40.1.3 Authority 40.1.4 Capacity and Scaling 40.2 Reliability 40.2.1 DNS 40.2.2 DHCP 40.2.3 LDAP 40.2.4 Authentication 40.2.5 Authentication, Authorization, and Accounting 40.2.6 Databases 40.3 Access Policy 40.4 Change Policies 40.5 Change Procedures 40.5.1 Automation 40.5.2 Self-Service Automation 40.6 Centralized Management 40.7 Summary Exercises 711 711 712 712 713 713 714 714 717 718 719 719 720 721 723 724 725 725 726 728 728 41 Email Service 41.1 Privacy Policy 41.2 Namespaces 41.3 Reliability 41.4 Simplicity 729 730 730 731 733 xxx 42 Contents 41.5 Spam and Virus Blocking 41.6 Generality 41.7 Automation 41.8 Monitoring 41.9 Redundancy 41.10 Scaling 41.11 Security Issues 41.12 Encryption 41.13 Email Retention Policy 41.14 Communication 41.15 High-Volume List Processing 41.16 Summary Exercises 735 736 737 738 738 739 742 743 743 744 745 746 747 Print Service 42.1 Level of Centralization 42.2 Print Architecture Policy 42.3 Documentation 42.4 Monitoring 42.5 Environmental Issues 42.6 Shredding 42.7 Summary Exercises 749 750 751 754 755 756 757 758 758 43 Data Storage 43.1 Terminology 43.1.1 Key Individual Disk Components 43.1.2 RAID 43.1.3 Volumes and File Systems 43.1.4 Directly Attached Storage 43.1.5 Network-Attached Storage 43.1.6 Storage-Area Networks 43.2 Managing Storage 43.2.1 Reframing Storage as a Community Resource 43.2.2 Conducting a Storage-Needs Assessment 43.2.3 Mapping Groups onto Storage Infrastructure 43.2.4 Developing an Inventory and Spares Policy 759 760 760 761 763 764 764 764 765 765 766 768 769 Contents 43.2.5 Planning for Future Storage 43.2.6 Establishing Storage Standards 43.3 Storage as a Service 43.3.1 A Storage SLA 43.3.2 Reliability 43.3.3 Backups 43.3.4 Monitoring 43.3.5 SAN Caveats 43.4 Performance 43.4.1 RAID and Performance 43.4.2 NAS and Performance 43.4.3 SSDs and Performance 43.4.4 SANs and Performance 43.4.5 Pipeline Optimization 43.5 Evaluating New Storage Solutions 43.5.1 Drive Speed 43.5.2 Fragmentation 43.5.3 Storage Limits: Disk Access Density Gap 43.5.4 Continuous Data Protection 43.6 Common Data Storage Problems 43.6.1 Large Physical Infrastructure 43.6.2 Timeouts 43.6.3 Saturation Behavior 43.7 Summary Exercises 44 Backup and Restore 44.1 Getting Started 44.2 Reasons for Restores 44.2.1 Accidental File Deletion 44.2.2 Disk Failure 44.2.3 Archival Purposes 44.2.4 Perform Fire Drills 44.3 Corporate Guidelines 44.4 A Data-Recovery SLA and Policy 44.5 The Backup Schedule xxxi 770 771 772 773 773 775 777 779 780 780 781 782 782 783 784 785 785 786 787 787 788 788 789 789 790 793 794 795 796 797 797 798 799 800 801 xxxii Contents 44.6 Time and Capacity Planning 44.6.1 Backup Speed 44.6.2 Restore Speed 44.6.3 High-Availability Databases 44.7 Consumables Planning 44.7.1 Tape Inventory 44.7.2 Backup Media and Off-Site Storage 44.8 Restore-Process Issues 44.9 Backup Automation 44.10 Centralization 44.11 Technology Changes 44.12 Summary Exercises 807 807 808 809 809 811 812 815 816 819 820 821 822 45 Software Repositories 45.1 Types of Repositories 45.2 Benefits of Repositories 45.3 Package Management Systems 45.4 Anatomy of a Package 45.4.1 Metadata and Scripts 45.4.2 Active Versus Dormant Installation 45.4.3 Binary Packages 45.4.4 Library Packages 45.4.5 Super-Packages 45.4.6 Source Packages 45.5 Anatomy of a Repository 45.5.1 Security 45.5.2 Universal Access 45.5.3 Release Process 45.5.4 Multitiered Mirrors and Caches 45.6 Managing a Repository 45.6.1 Repackaging Public Packages 45.6.2 Repackaging Third-Party Software 825 826 827 829 829 830 830 831 831 831 832 833 834 835 836 836 837 838 839 Contents 45.6.3 Service and Support 45.6.4 Repository as a Service 45.7 Repository Client 45.7.1 Version Management 45.7.2 Tracking Conflicts 45.8 Build Environment 45.8.1 Continuous Integration 45.8.2 Hermetic Build 45.9 Repository Examples 45.9.1 Staged Software Repository 45.9.2 OS Mirror 45.9.3 Controlled OS Mirror 45.10 Summary Exercises 46 Web Services 46.1 Simple Web Servers 46.2 Multiple Web Servers on One Host 46.2.1 Scalable Techniques 46.2.2 HTTPS 46.3 Service Level Agreements 46.4 Monitoring 46.5 Scaling for Web Services 46.5.1 Horizontal Scaling 46.5.2 Vertical Scaling 46.5.3 Choosing a Scaling Method 46.6 Web Service Security 46.6.1 Secure Connections and Certificates 46.6.2 Protecting the Web Server Application 46.6.3 Protecting the Content 46.6.4 Application Security 46.7 Content Management 46.8 Summary Exercises xxxiii 839 840 841 841 843 843 844 844 845 845 847 847 848 849 851 852 853 853 854 854 855 855 856 857 858 859 860 862 863 864 866 868 869 xxxiv Contents Part IX Management Practices 871 47 Ethics 47.1 Informed Consent 47.2 Code of Ethics 47.3 Customer Usage Guidelines 47.4 Privileged-Access Code of Conduct 47.5 Copyright Adherence 47.6 Working with Law Enforcement 47.7 Setting Expectations on Privacy and Monitoring 47.8 Being Told to Do Something Illegal/Unethical 47.9 Observing Illegal Activity 47.10 Summary Exercises 873 873 875 875 877 878 881 885 887 888 889 889 48 Organizational Structures 48.1 Sizing 48.2 Funding Models 48.3 Management Chain’s Influence 48.4 Skill Selection 48.5 Infrastructure Teams 48.6 Customer Support 48.7 Helpdesk 48.8 Outsourcing 48.9 Consultants and Contractors 48.10 Sample Organizational Structures 48.10.1 Small Company 48.10.2 Medium-Size Company 48.10.3 Large Company 48.10.4 E-commerce Site 48.10.5 Universities and Nonprofit Organizations 48.11 Summary Exercises 891 892 894 897 898 900 902 904 904 906 907 908 908 908 909 909 911 911 49 Perception and Visibility 49.1 Perception 49.1.1 A Good First Impression 49.1.2 Attitude, Perception, and Customers 913 913 914 918 Contents 49.1.3 Aligning Priorities with Customer Expectations 49.1.4 The System Advocate 49.2 Visibility 49.2.1 System Status Web Page 49.2.2 Management Meetings 49.2.3 Physical Visibility 49.2.4 Town Hall Meetings 49.2.5 Newsletters 49.2.6 Mail to All Customers 49.2.7 Lunch 49.3 Summary Exercises xxxv 920 921 925 925 926 927 927 930 930 932 933 934 50 Time Management 50.1 Interruptions 50.1.1 Stay Focused 50.1.2 Splitting Your Day 50.2 Follow-Through 50.3 Basic To-Do List Management 50.4 Setting Goals 50.5 Handling Email Once 50.6 Precompiling Decisions 50.7 Finding Free Time 50.8 Dealing with Ineffective People 50.9 Dealing with Slow Bureaucrats 50.10 Summary Exercises 935 935 936 936 937 938 939 940 942 943 944 944 946 946 51 Communication and Negotiation 51.1 Communication 51.2 I Statements 51.3 Active Listening 51.3.1 Mirroring 51.3.2 Summary Statements 51.3.3 Reflection 51.4 Negotiation 51.4.1 Recognizing the Situation 51.4.2 Format of a Negotiation Meeting 949 949 950 950 951 952 953 954 954 955 xxxvi Contents 51.4.3 Working Toward a Win-Win Outcome 51.4.4 Planning Your Negotiations 51.5 Additional Negotiation Tips 51.5.1 Ask for What You Want 51.5.2 Don’t Negotiate Against Yourself 51.5.3 Don’t Reveal Your Strategy 51.5.4 Refuse the First Offer 51.5.5 Use Silence as a Negotiating Tool 51.6 Further Reading 51.7 Summary Exercises 956 956 958 958 958 959 959 960 960 961 961 52 Being a Happy SA 52.1 Happiness 52.2 Accepting Criticism 52.3 Your Support Structure 52.4 Balancing Work and Personal Life 52.5 Professional Development 52.6 Staying Technical 52.7 Loving Your Job 52.8 Motivation 52.9 Managing Your Manager 52.10 Self-Help Books 52.11 Summary Exercises 963 963 965 965 966 967 968 969 970 972 976 976 977 53 Hiring System Administrators 53.1 Job Description 53.2 Skill Level 53.3 Recruiting 53.4 Timing 53.5 Team Considerations 53.6 The Interview Team 53.7 Interview Process 53.8 Technical Interviewing 53.9 Nontechnical Interviewing 53.10 Selling the Position 979 980 982 983 985 987 990 991 994 998 1000 Contents 53.11 Employee Retention 53.12 Getting Noticed 53.13 Summary Exercises xxxvii 1000 1001 1002 1003 54 Firing System Administrators 54.1 Cooperate with Corporate HR 54.2 The Exit Checklist 54.3 Removing Access 54.3.1 Physical Access 54.3.2 Remote Access 54.3.3 Application Access 54.3.4 Shared Passwords 54.3.5 External Services 54.3.6 Certificates and Other Secrets 54.4 Logistics 54.5 Examples 54.5.1 Amicably Leaving a Company 54.5.2 Firing the Boss 54.5.3 Removal at an Academic Institution 54.6 Supporting Infrastructure 54.7 Summary Exercises 1005 1006 1007 1007 1008 1008 1009 1009 1010 1010 1011 1011 1012 1012 1013 1014 1015 1016 Part X 1017 Being More Awesome 55 Operational Excellence 55.1 What Does Operational Excellence Look Like? 55.2 How to Measure Greatness 55.3 Assessment Methodology 55.3.1 Operational Responsibilities 55.3.2 Assessment Levels 55.3.3 Assessment Questions and Look-For’s 55.4 Service Assessments 55.4.1 Identifying What to Assess 55.4.2 Assessing Each Service 1019 1019 1020 1021 1021 1023 1025 1025 1026 1026 xxxviii Contents 55.4.3 Comparing Results Across Services 55.4.4 Acting on the Results 55.4.5 Assessment and Project Planning Frequencies 55.5 Organizational Assessments 55.6 Levels of Improvement 55.7 Getting Started 55.8 Summary Exercises 1027 1028 1028 1029 1030 1031 1032 1033 56 Operational Assessments 56.1 Regular Tasks (RT) 56.2 Emergency Response (ER) 56.3 Monitoring and Metrics (MM) 56.4 Capacity Planning (CP) 56.5 Change Management (CM) 56.6 New Product Introduction and Removal (NPI/NPR) 56.7 Service Deployment and Decommissioning (SDD) 56.8 Performance and Efficiency (PE) 56.9 Service Delivery: The Build Phase 56.10 Service Delivery: The Deployment Phase 56.11 Toil Reduction 56.12 Disaster Preparedness 1035 1036 1039 1041 1043 1045 1047 1049 1051 1054 1056 1058 1060 Epilogue 1063 Part XI Appendices 1065 A What to Do When . . . 1067 B The Many Roles of a System Administrator B.1 Common Positive Roles B.2 Negative Roles B.3 Team Roles B.4 Summary Exercises 1089 1090 1107 1109 1112 1112 Bibliography 1115 Index 1121 Preface This is an unusual book. This is not a technical book. It is a book of strategies and frameworks and anecdotes and tacit knowledge accumulated from decades of experience as system administrators. Junior SAs focus on learning which commands to type and which buttons to click. As you get more advanced, you realize that the bigger challenge is understanding why we do these things and how to organize our work. That’s where strategy comes in. This book gives you a framework—a way of thinking about system administration problems—rather than narrow how-to solutions to particular problems. Given a solid framework, you can solve problems every time they appear, regardless of the operating system (OS), brand of computer, or type of environment. This book is unique because it looks at system administration from this holistic point of view, whereas most other books for SAs focus on how to maintain one particular product. With experience, however, all SAs learn that the big-picture problems and solutions are largely independent of the platform. This book will change the way you approach your work as an SA. This book is Volume 1 of a series. Volume 1 focuses on enterprise infrastructure, customer support, and management issues. Volume 2, The Practice of Cloud System Administration (ISBN: 9780321943187), focuses on web operations and distributed computing. These books were born from our experiences as SAs in a variety of organizations. We have started new companies. We have helped sites to grow. We have worked at small start-ups and universities, where lack of funding was an issue. We have worked at midsize and large multinationals, where mergers and spinoffs gave rise to strange challenges. We have worked at fast-paced companies that do business on the Internet and where high-availability, high-performance, and scaling issues were the norm. We have worked at slow-paced companies at which “high tech” meant cordless phones. On the surface, these are very different environments with diverse challenges; underneath, they have the same building blocks, and the same fundamental principles apply. xxxix xl Preface Who Should Read This Book This book is written for system administrators at all levels who seek a deeper insight into the best practices and strategies available today. It is also useful for managers of system administrators who are trying to understand IT and operations. Junior SAs will gain insight into the bigger picture of how sites work, what their roles are in the organizations, and how their careers can progress. Intermediate-level SAs will learn how to approach more complex problems, how to improve their sites, and how to make their jobs easier and their customers happier. Whatever level you are at, this book will help you understand what is behind your day-to-day work, learn the things that you can do now to save time in the future, decide policy, be architects and designers, plan far into the future, negotiate with vendors, and interface with management. These are the things that senior SAs know and your OS’s manual leaves out. Basic Principles In this book you will see a number of principles repeated throughout: • Automation: Using software to replace human effort. Automation is critical. We should not be doing tasks; we should be maintaining the system that does tasks for us. Automation improves repeatability and scalability, is key to easing the system administration burden, and eliminates tedious repetitive tasks, giving SAs more time to improve services. Automation starts with getting the process well defined and repeatable, which means documenting it. Then it can be optimized by turning it into code. • Small batches: Doing work in small increments rather than large hunks. Small batches permit us to deliver results faster, with higher quality, and with less stress. • End-to-end integration: Working across teams to achieve the best total result rather than performing local optimizations that may not benefit the greater good. The opposite is to work within your own silo of control, ignoring the larger organization. • Self-service systems: Tools that empower others to work independently, rather than centralizing control to yourself. Shared services should be an enablement platform, not a control structure. • Communication: The right people can solve more problems than hardware or software can. You need to communicate well with other SAs and with your customers. It is your responsibility to initiate communication. Communication ensures that everyone is working toward the same goals. Lack of Preface xli communication leaves people concerned and annoyed. Communication also includes documentation. Documentation makes systems easier to support, maintain, and upgrade. Good communication and proper documentation also make it easier to hand off projects and maintenance when you leave or take on a new role. These principles are universal. They apply at all levels of the system. They apply to physical networks and to computer hardware. They apply to all operating systems running at a site, all protocols used, all software, and all services provided. They apply at universities, nonprofit institutions, government sites, businesses, and Internet service sites. What Is an SA? If you asked six system administrators to define their jobs, you would get seven different answers. The job is difficult to define because system administrators do so many things. An SA looks after computers, networks, and the people who use them. An SA may look after hardware, operating systems, software, configurations, applications, or security. An SA influences how effectively other people can or do use their computers and networks. A system administrator sometimes needs to be a business-process consultant, corporate visionary, janitor, software engineer, electrical engineer, economist, psychiatrist, mindreader, and, occasionally, bartender. As a result, companies give SAs different titles. Sometimes, they are called network administrators, system architects, system engineers, system programmers, operators, and so on. This book is for “all of the above.” We have a very general definition of system administrator: one who manages computer and network systems on behalf of another, such as an employer or a client. SAs are the people who make things work and keep it all running. System Administration Matters System administration matters because computers and networks matter. Computers are a lot more important than they were years ago. Software is eating the world. Industry after industry is being taken over by software. Our ability to make, transport, and sell real goods is more dependent on software than on any other single element. Companies that are good at software are beating competitors that aren’t. All this software requires operational expertise to deploy and keep it running. In turn, this expertise is what makes SAs special. xlii Preface For example, not long ago, manual processes were batch oriented. Expense reports on paper forms were processed once a week. If the clerk who processed them was out for a day, nobody noticed. This arrangement has since been replaced by a computerized system, and employees file their expense reports online, 24/7. Management now has a more realistic view of computers. Before they had PCs on their desktops, most people’s impressions of computers were based on how they were portrayed in films: big, all-knowing, self-sufficient, miracle machines. The more people had direct contact with computers, the more realistic people’s expectations became. Now even system administration itself is portrayed in films. The 1993 classic Jurassic Park was the first mainstream movie to portray the key role that system administrators play in large systems. The movie also showed how depending on one person is a disaster waiting to happen. IT is a team sport. If only Dennis Nedry had read this book. In business, nothing is important unless the CEO feels that it is important. The CEO controls funding and sets priorities. CEOs now consider IT to be important. Email was previously for nerds; now CEOs depend on email and notice even brief outages. The massive preparations for Y2K also brought home to CEOs how dependent their organizations have become on computers, how expensive it can be to maintain them, and how quickly a purely technical issue can become a serious threat. Most people do not think that they simply “missed the bullet” during the Y2K change, but rather recognize that problems were avoided thanks to tireless efforts by many people. A CBS Poll shows 63 percent of Americans believe that the time and effort spent fixing potential problems was worth it. A look at the news lineups of all three major network news broadcasts from Monday, January 3, 2000, reflects the same feeling. Previously, people did not grow up with computers and had to cautiously learn about them and their uses. Now people grow up using computers. They consume social media from their phones (constantly). As a result they have higher expectations of computers when they reach positions of power. The CEOs who were impressed by automatic payroll processing are being replaced by people who grew up sending instant messages all day long. This new wave of management expects to do all business from their phones. Computers matter more than ever. If computers are to work, and work well, system administration matters. We matter. Organization of This Book This book is divided into the following parts: • Part I, “Game-Changing Strategies.” This part describes how to make the next big step, for both those who are struggling to keep up with a deluge of work, and those who have everything running smoothly. Preface xliii • Part II, “Workstation Fleet Management.” This part covers all aspects of laptops and desktops. It focuses on how to optimize workstation support by treating these machines as mass-produced commodity items. • Part III, “Servers.” This part covers server hardware management—from the server strategies you can choose, to what makes a machine a server and what to consider when selecting server hardware. • Part IV, “Services.” This part covers designing, building, and launching services, converting users from one service to another, building resilient services, and planning for disaster recovery. • Part V, “Infrastructure.” This part focuses on the underlying infrastructure. It covers network architectures and operations, an overview of datacenter strategies, and datacenter operations. • Part VI, “Helpdesks and Support.” This part covers everything related to providing excellent customer service, including documentation, how to handle an incident report, and how to approach debugging. • Part VII, “Change Processes.” This part covers change management processes and describes how best to manage big and small changes. It also covers optimizing support by centralizing services. • Part VIII, “Service Recommendations.” This part takes an in-depth look at what you should consider when setting up some common services. It covers monitoring, nameservices, email, web, printing, storage, backups, and software depositories. • Part IX, “Management Practices.” This part is for managers and nonmanagers. It includes such topics as ethics, organizational structures, perception, visibility, time management, communication, happiness, and hiring and firing SAs. • Part X, “Being More Awesome.” This part is essential reading for all managers. It covers how to assess an SA team’s performance in a constructive manner, using the Capability Maturity Model to chart the way forward. • Part XI, “Appendices.” This part contains two appendices. The first is a checklist of solutions to common situations, and the second is an overview of the positive and negative team roles. What’s New in the Third Edition The first two editions garnered a lot of positive reviews and buzz. We were honored by the response. However, the passing of time made certain chapters look passé. Most of our bold new ideas are now considered common-sense practices in the industry. xliv Preface The first edition, which reached bookstores in August 2001, was written mostly in 2000 before Google was a household name and modern computing meant a big Sun multiuser system. Many people did not have Internet access, and the cloud was only in the sky. The second edition was released in July 2007. It smoothed the rough edges and filled some of the major holes, but it was written when DevOps was still in its embryonic form. The third edition introduces two dozen entirely new chapters and many highly revised chapters; the rest of the chapters were cleaned up and modernized. Longer chapters were split into smaller chapters. All new material has been rewritten to be organized around choosing strategies, and DevOps and SRE practices were introduced where they seem to be the most useful. If you’ve read the previous editions and want to focus on what is new or updated, here’s where you should look: • Part I, “Game-Changing Strategies” (Chapters 1–4) • Part II, “Workstation Fleet Management” (Chapters 5–12) • Part III, “Servers” (Chapters 13–15) • Part IV, “Services” (Chapters 16–20 and 22) • Chapter 23, “Network Architecture,” and Chapter 24, “Network Operations” • Chapter 32, “Change Management” • Chapter 35, “Centralization Overview,” Chapter 36, “Centralization Recommendations,” and Chapter 37, “Centralizing a Service” • Chapter 43, “Data Storage” • Chapter 45, “Software Repositories,” and Chapter 46, “Web Services” • Chapter 55, “Operational Excellence,” and Chapter 56, “Operational Assessments” Books, like software, always have bugs. For a list of updates, along with news and notes, and even a mailing list you can join, visit our web site: www.EverythingSysAdmin.com Preface xlv What’s Next Each chapter is self-contained. Feel free to jump around. However, we have carefully ordered the chapters so that they make the most sense if you read the book from start to finish. Either way, we hope that you enjoy the book. We have learned a lot and had a lot of fun writing it. Let’s begin. Thomas A. Limoncelli Stack Overflow, Inc. [email protected] Christina J. Hogan [email protected] Strata R. Chalup Virtual.Net, Inc. [email protected] Register your copy of The Practice of System and Network Administration, Volume 1, Third Edition, at informit.com for convenient access to downloads, updates, and corrections as they become available. To start the registration process, go to informit.com/register and log in or create an account. Enter the product ISBN (9780321919168) and click Submit. Once the process is complete, you will find any available bonus content under “Registered Products.” This page intentionally left blank Acknowledgments For the Third Edition Everyone was so generous with their help and support. We have so many people to thank! Thanks to the people who were extremely generous with their time and gave us extensive feedback and suggestions: Derek J. Balling, Stacey Frye, Peter Grace, John Pellman, Iustin Pop, and John Willis. Thanks to our friends, co-workers, and industry experts who gave us support, inspiration, and cool stories to use: George Beech, Steve Blair, Kyle Brandt, Greg Bray, Nick Craver, Geoff Dalgas, Michelle Fredette, David Fullerton, Dan Gilmartin, Trey Harris, Jason Harvey, Mark Henderson, Bryan Jen, Gene Kim, Thomas Linkin, Shane Madden, Jim Maurer, Kevin Montrose, Steve Murawski, Xavier Nicollet, Dan O’Boyle, Craig Peterson, Jason Punyon, Mike Rembetsy, Neil Ruston, Jason Shantz, Dagobert Soergel, Kara Sowles, Mike Stoppay, and Joe Youn. Thanks to our team at Addison-Wesley: Debra Williams Cauley, for her guidance; Michael Thurston, our developmental editor who took this sow’s ear and made it into a silk purse; Kim Boedigheimer, who coordinated and kept us on schedule; Lori Hughes, our LATEX wizard; Julie Nahil, our production editor; Jill Hobbs, our copy editor; and Ted Laux for making our beautiful index! Last, but not least, thanks and love to our families who suffered for years as we ignored other responsibilities to work on this book. Thank you for understanding! We promise this is our last book. Really! For the Second Edition In addition to everyone who helped us with the first edition, the second edition could not have happened without the help and support of Lee Damon, Nathan Dietsch, Benjamin Feen, Stephen Harris, Christine E. Polk, Glenn E. Sieb, Juhani Tali, and many people at the League of Professional System Administrators (LOPSA). Special 73s and 88s to Mike Chalup for love, loyalty, and support, and xlvii xlviii Acknowledgments especially for the mountains of laundry done and oceans of dishes washed so Strata could write. And many cuddles and kisses for baby Joanna Lear for her patience. Thanks to Lumeta Corporation for giving us permission to publish a second edition. Thanks to Wingfoot for letting us use its server for our bug-tracking database. Thanks to Anne Marie Quint for data entry, copyediting, and a lot of great suggestions. And last, but not least, a big heaping bowl of “couldn’t have done it without you” to Mark Taub, Catherine Nolan, Raina Chrobak, and Lara Wysong at Addison-Wesley. For the First Edition We can’t possibly thank everyone who helped us in some way or another, but that isn’t going to stop us from trying. Much of this book was inspired by Kernighan and Pike’s The Practice of Programming and John Bentley’s second edition of Programming Pearls. We are grateful to Global Networking and Computing (GNAC), Synopsys, and Eircom for permitting us to use photographs of their datacenter facilities to illustrate real-life examples of the good practices that we talk about. We are indebted to the following people for their helpful editing: Valerie Natale, Anne Marie Quint, Josh Simon, and Amara Willey. The people we have met through USENIX and SAGE and the LISA conferences have been major influences in our lives and careers. We would not be qualified to write this book if we hadn’t met the people we did and learned so much from them. Dozens of people helped us as we wrote this book—some by supplying anecdotes, some by reviewing parts of or the entire book, others by mentoring us during our careers. The only fair way to thank them all is alphabetically and to apologize in advance to anyone whom we left out: Rajeev Agrawala, Al Aho, Jeff Allen, Eric Anderson, Ann Benninger, Eric Berglund, Melissa Binde, Steven Branigan, Sheila Brown-Klinger, Brent Chapman, Bill Cheswick, Lee Damon, Tina Darmohray, Bach Thuoc (Daisy) Davis, R. Drew Davis, Ingo Dean, Arnold de Leon, Jim Dennis, Barbara Dijker, Viktor Dukhovni, Chelle-Marie Ehlers, Michael Erlinger, Paul Evans, Rémy Evard, Lookman Fazal, Robert Fulmer, Carson Gaspar, Paul Glick, David “Zonker” Harris, Katherine “Cappy” Harrison, Jim Hickstein, Sandra Henry-Stocker, Mark Horton, Bill “Whump” Humphries, Tim Hunter, Jeff Jensen, Jennifer Joy, Alan Judge, Christophe Kalt, Scott C. Kennedy, Brian Kernighan, Jim Lambert, Eliot Lear, Steven Levine, Les Lloyd, Ralph Loura, Bryan MacDonald, Sherry McBride, Mark Mellis, Cliff Miller, Hal Miller, Ruth Milner, D. Toby Morrill, Joe Morris, Timothy Murphy, Ravi Narayan, Nils-Peter Nelson, Evi Nemeth, William Ninke, Cat Okita, Jim Paradis, Pat Parseghian, David Parter, Acknowledgments xlix Rob Pike, Hal Pomeranz, David Presotto, Doug Reimer, Tommy Reingold, Mike Richichi, Matthew F. Ringel, Dennis Ritchie, Paul D. Rohrigstamper, Ben Rosengart, David Ross, Peter Salus, Scott Schultz, Darren Shaw, Glenn Sieb, Karl Siil, Cicely Smith, Bryan Stansell, Hal Stern, Jay Stiles, Kim Supsinkas, Ken Thompson, Greg Tusar, Kim Wallace, The Rabbit Warren, Dr. Geri Weitzman, Glen Wiley, Pat Wilson, Jim Witthoff, Frank Wojcik, Jay Yu, and Elizabeth Zwicky. Thanks also to Lumeta Corporation and Lucent Technologies/Bell Labs for their support in writing this book. Last, but not least, the people at Addison-Wesley made this a particularly great experience for us. In particular, our gratitude extends to Karen Gettman, Mary Hart, and Emily Frey. This page intentionally left blank About the Authors Thomas A. Limoncelli is an internationally recognized author, speaker, and system administrator. During his seven years at Google NYC, he was an SRE for projects such as Blog Search, Ganeti, and internal enterprise IT services. He now works as an SRE at Stack Overflow. His first paid system administration job was as a student at Drew University in 1987, and he has since worked at small and large companies, including AT&T/Lucent Bell Labs and Lumeta. In addition to this book series, he is known for his book Time Management for System Administrators (O’Reilly, 2005). His hobbies include grassroots activism, for which his work has been recognized at state and national levels. He lives in New Jersey. Christina J. Hogan has 20 years of experience in system administration and network engineering, from Silicon Valley to Italy and Switzerland. She has gained experience in small start-ups, midsize tech companies, and large global corporations. She worked as a security consultant for many years; in that role, her customers included eBay, Silicon Graphics, and SystemExperts. In 2005, she and Tom shared the USENIX LISA Outstanding Achievement Award for the first edition of this book. Christina has a bachelor’s degree in mathematics, a master’s degree in computer science, a doctorate in aeronautical engineering, and a diploma in law. She also worked for six years as an aerodynamicist in a Formula 1 racing team and represented Ireland in the 1988 Chess Olympiad. She lives in Switzerland. Strata R. Chalup has been leading and managing complex IT projects for many years, serving in roles ranging from project manager to director of operations. She started administering VAX Ultrix and Unisys Unix in 1983 at MIT and spent the dot-com years in Silicon Valley building Internet services for clients like iPlanet and Palm. She joined Google in 2015 as a technical project manager. She has served on the BayLISA and SAGE boards. Her hobbies include being a master gardener and working with new technologies such as Arduino and 2D CAD/CAM devices. She lives in Santa Clara County, California. li This page intentionally left blank Part I Game-Changing Strategies This page intentionally left blank Chapter 1 Climbing Out of the Hole Some system administration teams are struggling. They’re down in a hole trying to climb out. If this sounds like your team, this chapter will help you. If this doesn’t sound like your team, you’ll still find a lot of useful information about how successful teams stay successful. There are two things we see at all successful sites that you won’t see at sites that are struggling. Get those things right and everything else falls into place. These two things are pretty much the same for all organizations—and we really mean all. We visit sites around the world: big and small, for-profit and nonprofit, well funded and nearly broke. Some are so large they have thousands of system administrators (SAs); others are so tiny that they have a part-time information technology (IT) person who visits only once a week. No matter what, these two attributes are found at the successful sites and are missing at the ones that struggle. Most of this book is fairly aspirational: It provides the best way to design, build, and run IT services and infrastructure. The feedback that we get online and at conferences is usually along the lines of “Sounds great, but how can I do any of that if we’re struggling with what we have?” If we don’t help you fix those problems first, the rest of the book isn’t so useful. This chapter should help you gain the time you need so you can do the other things in this book. So what are these two things? The first is that successful SA teams have a way to track and organize their work in progress (WIP). WIP is requests from customers1 plus all the other tasks you need, want, or have to do, whether planned or unplanned. Successful teams have some kind of request-tracking system and a plan for how WIP flows through it. Struggling organizations don’t, and requests frequently get lost or forgotten. The second thing is that successful SA teams have eliminated the two biggest time sinkholes that all IT teams face. A time sinkhole is an issue that consumes 1. We user the term customers rather than users. We find this brings about a positive attitude shift. IT is a service industry, supporting the needs of people and the business. 3 4 Chapter 1 Climbing Out of the Hole large amounts of time and causes a domino effect of other problems. Eliminating a time sinkhole is a game-changer not just because it saves time, but because it also prevents other trouble. In IT, those time sinkholes are operating system (OS) installation and the software deployment process. Struggling teams invariably are doing those two things manually. This approach not only unnecessarily saps their time but also creates new problems. Some problems are due to mistakes and variations between how different people do the same process. Some problems are due to how the same person does it while feeling refreshed on Monday versus tired on Friday afternoon. Any struggling SA team can begin their turnaround by adopting better ways to • Track and control WIP: Install some kind of helpdesk automation software, request-tracking system, ticket queue, or Kanban board. • Automate OS installation: Start every machine in the same state by automating the installation and configuration of the OS and applications. • Adopt CI/CD for software pushes: The process of pushing new software releases into production should be automated by using DevOps concepts such as continuous integration and delivery (CI/CD). Obviously not all readers of this book are members of struggling organizations, but if you are, take care of those two things first. Everything else will become easier. The remainder of this chapter is a quick-start guide for achieving those two goals. They are covered more in depth later in the book, but consider this chapter a way to crawl out of the hole you are in. This guy’s walking down the street when he falls in a hole. The walls are so steep he can’t get out. A doctor passes by and the guy shouts up, “Hey! Can you help me out?” The doctor writes a prescription, throws it down in the hole, and moves on. Then a priest comes along and the guy shouts up, “Father, I’m down in this hole. Can you help me out?” The priest writes out a prayer, throws it down in the hole, and moves on. Then a friend walks by. “Hey, Joe, it’s me. Can you help me out?” And the friend jumps in the hole. Our guy says, “Are you stupid? Now we’re both down here.” The friend says, “Yeah, but I’ve been down here before and I know the way out!” — Leo McGarry, The West Wing, “Noël,” Season 2, Episode 10 1.1 Organizing WIP 5 1.1 Organizing WIP Successful organizations have some system for collecting requests, organizing them, tracking them, and seeing them through to completion. Nearly every organization does it differently, but the point is that they have a system and they stick with it. Operations science uses the term “work in progress” to describe requests from customers plus all the other tasks you need, want, or have to do, whether planned or unplanned. WIP comes from direct customer requests as well as tasks that are part of projects, emergency requests, and incidents such as outages and alerts from your monitoring system. Struggling organizations are unorganized at best and disorganized at worst. Requests get lost or forgotten. Customers have no way of knowing the status of their requests, so they spend a lot of time asking SAs, “Is it done yet?”—or, even worse, they don’t ask and assume incompetence. When the team is overloaded with too much WIP, they go into panic mode and deal with the loudest complainers instead of the most important WIP. Some teams have been in this overloaded state for so long they don’t know there is any other way. Whatever strategy is employed to organize WIP, it must have the characteristic that quick requests get a fast response, longer requests get attention, and emergencies get the immediate attention and resources required to resolve them. A typical quick request is a customer who needs a password reset. Performing a password reset takes very little SA time, and not doing it blocks all other work for the customer; a customer can’t do anything if he or she can’t log in. If you were to do all WIP in the order in which it arrives, the password reset might not happen for days. That is unreasonable. Therefore we need a mechanism for such requests to be treated with higher priority. The two most popular strategies for managing WIP are ticket systems and Kanban boards. Either is better than not having any system. If you are having difficulty deciding, default to a simple ticket system. You will have an easier time getting people to adopt it because it is the more common solution. Later you can add a Kanban board for project-oriented work. 1.1.1 Ticket Systems A ticket system permits customers to submit requests electronically. The request is stored in a database where it can be prioritized, assigned, and tracked. As the issue is worked on, the ticket is updated with notes and information. The system also records communication between the customer and the SA as the request is discussed or clarified. 6 Chapter 1 Climbing Out of the Hole Without such a system, WIP gets lost, forgotten, or confused. A customer is left wondering what the status is, and the only way to find out is to bother the SA, which distracts the SA from trying to get work done. A ticket system permits a team to share work better. Team members can see the history of the ticket, making it easier to hand off a ticket from one SA to another. A solo system administrator benefits from using a ticket system because it is like having an assistant assure that requests stay organized. This reduces the cognitive load involved in remembering the requests, their prioritization, and so on. Since the ticket system keeps the history of each WIP, you can return to an issue and more easily pick up where you left off. The manager of the SA team can examine the queue, or list tickets, and balance workloads among team members. Management can also intervene to reject requests that are out of scope for the team, and see tickets that have stalled or need attention. Thus the manager can proactively solve problems rather than waiting for them to blow up. A ticket system also introduces a degree of reality into the SA–customer relationship. Sometimes customers feel as if they’ve been waiting forever for a request to be completed and decide to complain to management. Without a ticket system, the SAs would bear the brunt of the customer’s anger whether or not the impatience was justified. In this scenario, it is the customer’s word against the SAs’. With a ticket system the accusation of slow support can be verified. If true, the ticket history can be used to review the situation and a constructive conversation can be had. If the customer was just being impatient the manager has the opportunity to explain the service level agreement (SLA) and appropriate expectations. That said, we’ve seen situations where the customer claimed that the problem has been around a long time but the customer only recently had time to open a ticket about it. The manager can then decide how to politely and gracefully explain that “We can’t help you, unless you help us help you.” System administrators should have foresight, but they can’t be mind-readers. Now we must find a way to assure that quick requests get fast responses. Imagine your list of tickets sorted with the fast/urgent items at one end and the big/far-reaching items at the other. If you work on only the items from the fast/urgent end, you will please your customers in the short term, but you’ll never have time for the big items that have far-reaching impact. If you work on only the items from the far-reaching end, your customers will never receive the direct support they need. You must find a way to “eat from both ends of the queue” by assigning different people to each, or alternating between the two. Commonly organizations create a tiered support structure, where one tier of IT support handles quick requests and forwards larger requests to a second tier of people. 1.1 Organizing WIP 7 A large organization may have a helpdesk team (Tier 1) that receives all new requests. They handle the quick requests themselves, but other requests are prioritized and assigned to the system administration team (Tier 2). Often Tier 1 is expected to handle 80 percent of all requests; the remaining require Tier 2’s technically deeper and more specialized knowledge. This assures quick requests get the immediate response they deserve. In a medium-size team Tier 1 and Tier 2 are members of the same team, but Tier 1 requests are handled by the more junior SAs. In environments where WIP is mostly project oriented, and little arrives directly from customers, the team may have a single person who handles Tier 1 requests or team members may take turns in a weekly rotation. A solo system administrator has a special challenge: There is never time for project work if you are always dealing with interruptions from customers. Project work takes sustained focus for hours at a time. Balancing this with the need to handle quick requests quickly requires strategy. In this case, use a strategy of dividing the day into segments: customer-focused hours where you work on quick requests and tickets, and project-focused hours where you focus on projects. During project-focused hours anyone visiting with a quick request is politely asked to open a ticket, which will get attention later. Use polite phrases like, “I’m in the middle of an important project for my boss. Could you open a ticket and I’ll get to it this afternoon?” Allocate the first two and last two hours of the day to be customer-focused work, leaving a full half-day of every day for project-focused work. This strategy works well because most quick requests happen early in the morning as people are starting their day. Two things help make this strategy a success. The first is management support. Ideally your boss will see the value in a strategy that enables customers to receive attention when they typically need it most, yet enables you to complete important projects, too. Managers can make it policy that people should file tickets when possible and steer otherwise unavoidable personal visits to the first and last two hours of the day. The second key to success is that no matter which type of hour you are in, you should make an exception for service outages and ultra-quick requests. It should be obvious that reacting to an outage or other emergency should take priority. Unfortunately, this may result in all requests being described as emergencies. Supportive management can prevent this by providing a written definition of an emergency. Ultra-quick requests are ones that can be accomplished faster than it would take a person to file a ticket. This includes things like password resets and answering very simple questions. Also, if the issue is that the printer is broken, this can 8 Chapter 1 Climbing Out of the Hole become an ultra-quick request by simply asking the customer to use a different printer for now. This schedule should be the same every day. This consistency helps you stick to it, and is less confusing to customers. Expecting them to remember your schedule is a stretch; expecting them to remember how it shifts and changes every day is unreasonable. Sample Emergency Definitions Here are some example definitions of “emergency” we’ve seen businesses use: • A server or service being down • A user-facing issue that affects ten or more people • A technical problem causing the company to “lose money fast” • Customer proprietary data actively being leaked • (At a newspaper) anything that will directly prevent the newspapers from being printed and loaded onto trucks by 4:30 AM There are many WIP-tracking systems to choose from. Open source software products include Request Tracker (from Best Practical), Trac, and OTRS. For Windows, SpiceWorks is a good option and is available at no cost. This section is just an overview. Chapter 27, “Customer Support,” explains how to manage a helpdesk itself, with Section 27.10 focusing directly on how to get the most out of request-tracking software. Chapter 28, “Handling an Incident Report,” focuses on how to process a single request really well. 1.1.2 Kanban Kanban is another way of organizing WIP. It is more appropriate for projectfocused SA teams. Kanban has the benefit of providing better transparency to stakeholders outside the SA team while at the same time reducing the mad crush of work that some IT organizations live under. We’ll start with a simple example and work up to more sophisticated ones. Figure 1.1 depicts a simple Kanban board example. Unlike a ticket system, which organizes WIP into a single long list, Kanban divides WIP into three lists, or columns. The columns are used as follows: • Backlog: The waiting room where new WIP arrives • Active: Items being worked on • Completed: Items that are finished 1.1 Organizing WIP BACKLOG ACTIVE COMPLETED Use Kanban Learn about Kanban Get some s cky notes! Try Kanban tool 9 Get a whiteboard Figure 1.1: A simple Kanban board All new WIP is recorded and placed in the backlog. The tasks are called cards because Kanban was originally done using index cards on a pin board. Each card represents one unit of work. Large projects are broken down into smaller cards, or steps, that can be reasonably achieved in one week. Each Monday morning the team has a meeting where each member picks three cards from the backlog column to work on during that week. Those cards are moved to the Active column. By Friday all three tasks should be finished and the cards moved to the Completed column. On Monday another round of items is selected and the process begins again. Monday’s meeting also includes a retrospective. This is a discussion of what went right and wrong during the previous week. Did everyone complete all their cards or was there more work to do? This review is not done to blame anyone for not finishing their items, but to troubleshoot the situation and improve. Does the team need to get better at estimating how much can be done in a week? Does the team need to break projects into smaller chunks? At the end of the retrospective, items in the Completed column are moved to an accomplishments box. Picking three cards per week sets a cadence, or drumbeat, that assures progress happens at a smooth and predictable pace. Not all teams pick three cards each week. Some establish guidelines for what constitutes small, medium, and large tasks and then expect each person to pick perhaps one large and two small tasks, or two medium tasks, and so on. Keep it as simple as possible. People should have more than one card per week so that when one task is blocked, they have other tasks they can work on. A person should not have too many cards at a time, as the cognitive load this creates reduces productivity. Allocating the same number 10 Chapter 1 Climbing Out of the Hole of cards each week makes the system more predictable and less chaotic. Chaos creates stress in workers, which also reduces productivity. It is better to adjust the amount of work a card represents than to change how many cards each person takes each week. Some teams use more than three columns. For example, a Verifying column may be inserted after the Active column to hold items waiting to be verified. Some teams put new WIP in a New column, but move it to the backlog once it has been prioritized. In addition to columns, rows can be added to further segregate the items. These are called swimlanes because they guide the item as it swims across the columns to the finish line. A team might have two swimlanes, one for development work and another for operational tasks. A large team might have a swimlane for each subteam, a configuration that is particularly useful if some people straddle many subteams. This is depicted in Figure 1.2. As mentioned earlier, there must be a way for quick requests to get attention quickly. Kanban wouldn’t work very well if someone needing a password reset had to wait until Monday to have that task picked by someone. A simple way to deal with this is to have one card each week labeled Quick Request Card (QRC). Whoever takes this card handles all quick requests and other interruptions for the week. Depending on the volume of quick requests in your environment, this might be the only card the person takes. This role is often called the interrupt sponge, but we prefer to call it hero of the week, a moniker coined at New Relic (Goldfuss 2015). BACKLOG ACTIVE VERIFYING COMPLETED Linux Team Windows Team Network Team Figure 1.2: A Kanban board with separate swimlanes for Linux, Windows, and network projects 1.1 Organizing WIP 11 Kanban works better than ticket systems for a number of reasons. First, the system becomes “pull” instead of “push.” A ticket system is a push system. Work gets accomplished at the rate that it can be pushed through the system. To get more work done, you push the employees harder. If there is a large backlog of incomplete tickets, the employees are pushed to work harder to get all tickets done. In practice, this system usually fails. Tasks cannot be completed faster just because we want them to. In fact, people are less productive when they are overwhelmed. When people are pushed, they naturally reduce quality in an effort to get more done in the same amount of time. Quality and speed could be improved by automating more, but it is difficult to set time aside for such projects when there is an overflow of tickets. A Kanban system is a pull system: WIP is pulled through the system so that it gets completed by the weekly deadline. You select a fixed amount of work to be done each week, and that work is pulled through the system. Because there is a fixed amount of time to complete the task, people can focus on doing the best job in that amount of time. If management needs more WIP completed each week, it can make this possible by hiring more people, allocating resources to fix bottlenecks, providing the means to do the work faster, and so on. Second, Kanban improves transparency. When customers can see where their card is in relation to other cards, they get a better sense of company priorities. If they disagree with the priorities, they can raise the issue with management. This is better than having the debate brought to the SAs who have little control over priorities and become defensive. It is also better than having people just wonder why things are taking so long; in the absence of actual information they tend to assume the IT team is incompetent. One organization we interviewed hosted a monthly meeting where the managers of the departments they supported reviewed the Kanban board and discussed priorities. By involving all stakeholders, they created a more collaborative community. Getting started with Kanban requires very few resources. Many teams start with index cards on a wall or whiteboard. Electronic Kanban systems improve logging of history and metrics, as well as make it easier for remote workers to be involved. There are commercial, web-based systems that are free or inexpensive, such as Trello and LeanKit. Open source solutions exist as well, such as Taiga and KanBoard. Customize the Kanban board to the needs of your team over time. There is no one perfect combination of columns and swimlanes. Start simple. Add and change them as the team grows, matures, and changes. Our description of Kanban is very basic. There are ways to handle blocks, rate limits, and other project management concepts. A more complete explanation can be found in Personal Kanban: Mapping Work | Navigating Life by Benson & Barry (2011) and Kanban: Successful Evolutionary Change for Your Technology Business by 12 Chapter 1 Climbing Out of the Hole Anderson (2010). Project managers may find Agile Project Management with Kanban by Brechner (2015) to be useful. Meat Grinder The best-quality ground beef can be made by turning the crank of the meat grinder at a certain speed and stopping to clean the device every so often. A push-based management philosophy focuses on speed. More meat can be ground by pushing harder and forcing it through the system, bruising the meat in the process, but achieving the desired output quantity. Speed can be further improved by not pausing to clean the machine. A pull-based management philosophy focuses on quality. A certain amount of meat is ground each day, not exceeding the rate that quality ground meat can be achieved. You can think of the meat as being pulled along at a certain pace or cadence. Management knows that to produce more ground meat, you need to use more grinding machines, or to develop new techniques for grinding and cleaning. A helpdesk that is expected to complete all tickets by the end of each day is a push system. A helpdesk is a pull system if it accepts the n highest-priority tickets each week, or a certain number of Kanban cards per person each week. Another method is to accept a certain number of 30-minute appointments each day, as they do at the Apple Genius Bar. 1.1.3 Tickets and Kanban Ticket systems and Kanban boards are not mutually exclusive. Many organizations use a ticket system for requests from customers and Kanban boards for projectrelated work. Some helpdesk automation systems are able to visually display the same set of requests as tickets in a queue or Kanban boards in columns. Kanban is a way to optimize workflow that can be applied to any kind of process. 1.2 Eliminating Time Sinkholes Struggling teams always have a major time sinkhole that prevents them from getting around to the big, impactful projects that will fix the problem at its source. They’re spending so much time mopping the floor that they don’t have time to fix the leaking pipe. Often this sinkhole has been around so long that the team has become acclimated to it. They have become blind to its existence or believe that there is no solution. 1.2 Eliminating Time Sinkholes 13 Successful IT organizations are operating on a different level. They’ve eliminated the major time sinkholes and now have only minor inefficiencies or bottlenecks. They regularly identify the bottlenecks and assign people to eliminate them. We could spend pages discussing how to identify your team’s time sinkhole, but nine times out of ten it is OS installation and configuration. If it isn’t that, it is the process of pushing new software releases into production, often called the software delivery life cycle. 1.2.1 OS Installation and Configuration We recently visited a site that was struggling. The site was a simple desktop environment with about 100 workstations (desktops and laptops), yet three three SAs were running around 50 hours a week and still unable to keep up with the flow of WIP. The source of the problem was that each workstation was configured slightly differently, causing a domino effect of other problems. The definition of a broken machine is one that is in an unknown or undesired state. If each machine starts in an unknown state, then it is starting out broken. It cannot be fixed because fixing it means bringing it back to the original state, and we don’t know what that original state is or should be. Maintaining a fleet of machines that are in unknown states is a fool’s errand. Each desktop was running the OS that had been preinstalled by the vendor. This was done, in theory, to save time. This made sense when the company had 3 desktops, but now it had 100. The SAs were now maintaining many different versions of Microsoft Windows, each installed slightly differently. This made each and every customer support request take twice as long, negating the time saved by using the vendor-installed OS. Any new software that was deployed would break in certain machines but not in others, making the SAs wary of deploying anything new. If a machine’s hard drive died, one SA would be occupied for an entire day replacing the disk, reloading the OS, and guessing which software packages needed to be installed for the customer. During one wave of malware attacks, the entire team was stuck wiping and reloading machines. All other work came to a halt. In the rush, no two machines turned out exactly the same, even if they were installed by the same person. The SA team was always in firefighting mode. The SAs were always stressed out and overworked, which made them unhappy. Their customers were unhappy because their requests were being ignored by the overloaded SAs. It became a vicious cycle: Nobody had time to solve the big problem because the big problem kept them from having time to work on it. This created stress and 14 Chapter 1 Climbing Out of the Hole unhappiness, which made it more difficult to think about the big problems, which led to more stress and unhappiness. Breaking the cycle was difficult. The company would have to assign one SA to focus exclusively on automating OS installations. This frightened the team because with one fewer person, more WIP would accumulate. The customers might be even less happy. It took courageous management to make this decision. It turns out that if customers are used to being ignored for days at a time, they can’t tell if they’re being ignored for slightly longer. Soon OS installation was automated and a new batch of machines arrived. Their disks were wiped and OS reinstalled, starting all these machines in the same known state. They were more reliable because their hardware was newer but also because they could be supported better. Most importantly, setting up all these machines took hours instead of days. Reinstalling a machine after a malware attack now took minutes. Moreover, because the new installation had properly installed anti-malware software, the machine stayed virus free. Developers, being more technically inclined, were taught how to wipe and reinstall their own machines. Now, instead of waiting for an SA to do the task, developers were self-sufficient. This changed their workflows to include creating virtual machines on demand for testing purposes, which resulted in new testing methodologies that improved their ability to make quality software. Soon the vicious cycle was broken and the SA team had more time for more important tasks. The fix didn’t solve all their problems, but most of their problems couldn’t be solved until they took care of this one major sinkhole. Over time the old mix of configurations disappeared and all machines had the new, unified configuration. This uniformity made them, for the most part, interchangeable. In business terms the machines became fungible resources: Any one unit could be substituted for any other. Customers benefited from the ability to move between machines easily. Customer support improved as SAs focused all effort on a single configuration instead of having their attention diluted. Support requests were satisfied faster because less time was spent ascertaining what was unique about a machine. New applications could be introduced more rapidly because testing did not have to be repeated for each variation. The net effect was that all aspects of IT became higher quality and more efficient. There were fewer surprises. If your team is suffering from this kind of time sinkhole and could benefit from better OS configuration and software installation methods, Chapter 8, “OS Installation Strategies,” provides more detail, especially Section 8.4. You may also find motivation in the success story described in Section 20.2. 1.2 Eliminating Time Sinkholes 15 1.2.2 Software Deployment Once OS installation and configuration is conquered, the next most common time sinkhole is deploying new software releases. Deploying and updating software updates is a task much like the one in the story of Sisyphus in Greek mythology: rolling an immense boulder up a hill, only to watch it roll back down, and repeating this cycle for eternity. Not long after a software update is deployed, it seems as if a new update is inevitably announced. Desktop software usually has a relatively simple upgrade process, which becomes more complex and laborious the larger the number of workstations a company has. Luckily it can be automated easily. This is the subject of Chapter 7, “Workstation Software Life Cycle.” For Microsoft Windows environments this typically means installing Windows Server Update Services (WSUS). Server software has a more complex upgrade process. The more people who rely on the service, the more critical the service, and the more pressure for each upgrade to be a success. Extensive testing is required to detect and squash bugs before they reach production. More sophisticated upgrade processes are required to minimize or eliminate service disruptions. A service with ten users can be taken down for upgrades. A service with millions of users must be upgraded in place. This is like changing the tires on a car while it is barreling down the highway at full speed. The phrase software delivery life cycle (SDLC) refers to the entire process of taking software from source code, through testing, packaging, and beta, and finally to installation in production. It requires a combined effort from developers and operations, which is where the term DevOps comes from. Each successful iteration through the SDLC results in a new software release running in production. In some environments the SDLC process would be a hilarious month of missteps if it weren’t happening to you. It is more likely to be a hellish month of stress and pain. In many organizations Hell Month arrives every 30 days. Automating SDLC Phases SDLC has the following three phases: 1. Integration: Compiling the software, performing basic tests 2. Delivery: Producing installable packages that are fully tested and ready to push into production, if the organization chooses to do so 3. Deployment: Pushing the packages into a test or production environment 16 Chapter 1 Climbing Out of the Hole Each phase is predicated on the successful completion of the prior one. Each can be done manually or automated, or somewhere in between. In continuous integration (CI), the automated process runs once for each change to the source code. In continuous delivery (CD), the delivery automation is triggered by any successful CI run. In continuous deployment, the deployment automation is triggered by any successful CD run. Continuous deployment is frequently used to push releases to a test environment, not all the way to production. Many companies, however, have improved their SDLC processes to the point where they can confidently use continuous deployment all the way to production. A complete example of eliminating Hell Month can be found in Section 2.2. This DevOps, or rapid-release, technique is covered in detail in Chapter 20, “Service Launch: DevOps.” Volume 2 of this book series is a deep dive into DevOps and service management in general. The definitive book on the topic is Continuous Delivery: Reliable Software Releases Through Build, Test, and Deployment Automation by Humble & Farley (2010). 1.3 DevOps The DevOps movement is focused on making the SDLC as smooth and unhellish as possible. A highly automated SDLC process brings with it a new level of confidence in the company’s ability to successfully launch new software releases. This confidence means a company can push new releases into production more frequently, eliminating the friction that prevents companies from experimenting and trying new things. This ability to experiment then makes the company more competitive as new features appear more frequently, bugs disappear faster, and even minor annoyances are quickly alleviated. It enables a culture of innovation. At the same time it makes life better for SAs. After working in an environment where such stress has been eliminated, people do not want to work anywhere else. It is as if a well-run DevOps environment is not just a business accelerator, but a fringe benefit for the employees. As author and DevOps advocate Gene Kim wrote, “The opposite of DevOps is despair.” 1.4 DevOps Without Devs Organizations without developers will benefit from the DevOps principles, too. They apply to any complex process, and the system administration field is full of them. The DevOps principles are as follows: 1.4 DevOps Without Devs 17 • Turn chaotic processes into repeatable, measurable ones. Processes should be documented until they are consistent, then automated to make them efficient and self-service. To make IT scale we must not do IT tasks but be the people who maintain the system that does IT tasks for us. • Talk about problems and issues within and between teams. Don’t hide, obscure, muddle through, or silently suffer through problems. Have channels for communicating problems and issues. Processes should amplify feedback and encourage communication across silos. • Try new things, measure the results, keep the successes, and learn from the failures. Create a culture of experimentation and learning by using technology that empowers creativity and management practices that are blameless and reward learning. This is also called “Start, Stop, Continue, Change” management. Start doing the things you need to do. Stop doing things that don’t work well. Continue doing things that do work well. Change the things that need improving. • Do work in small batches so we can learn and pivot along the way. It is better to deliver some results each day than to hold everything back and deliver the entire result at the end. This strategy enables us to get feedback sooner, fix problems, and avoid wasted effort. • Drive configuration and infrastructure from machine-readable sources kept under a source code control system. When we treat infrastructure as code (IaC), it becomes flexible and testable, and can benefit from techniques borrowed from software engineering. • Always be improving. We are always identifying the next big bottleneck and experimenting to find ways to fix it. We do not wait for the next major disaster. • Pull, don’t push. Build processes to meet demand, not accommodate supply. Determine the desired weekly output, allocate the required resources, and pull the work through the system to completion. • Build community. The IT department is part of the larger organism that is our company and we must be an active participant in its success. Likewise, we are part of the world’s IT community and have a duty to be involved in sharing knowledge and encouraging processional development. You’ll see the DevOps principles demonstrated throughout this book, whether or not developers are involved. The best introduction to DevOps is The Phoenix Project: A Novel About IT, DevOps, and Helping Your Business Win by Kim, Behr & Spafford (2013). We also recommend The DevOps Handbook: How to Create World-Class Agility, Reliability, and Security in Technology Organizations by Kim, Humble, Debois & Willis (2016) as a way to turn these ideas into action. 18 Chapter 1 Climbing Out of the Hole 1.5 Bottlenecks If OS installation and software deployment are not your time sinkhole, how can you identify what is? Bring up the subject at lunch or a team meeting, and usually just a short amount of discussion will identify the biggest pain points, time wasters, inefficiencies, and so on. Solicit opinions from the stakeholders and teams you partner with. They already know your biggest problem. A more scientific approach is to identify the bottleneck. In operational theory, the term bottleneck refers to the point in a system where WIP accumulates. Following are some examples. Suppose your team is involved in deploying new PCs. Customers request them, and the hardware is subsequently purchased. Once they arrive the OS and applications are installed, and finally the new machine is delivered to the customer. If employees are standing in line at someone’s desk waiting to talk to that person about ordering a machine, the bottleneck is the ordering process. If customers are waiting weeks for the hardware to arrive, the bottleneck is in the purchasing process. If purchased machines sit in their boxes waiting to be installed, the bottleneck is the installation step. A team that produces a web-based application has different members performing each task: writing the code, testing it, deploying it to a beta environment, and deploying it to the production environment. Watch what happens when a bug is reported. If it sits in the bug tracking system, untouched and unassigned to anyone, then the bottleneck is the triage process. If it gets assigned to someone but that person doesn’t work on the issue, the bottleneck is at the developer step. If the buggy code is fixed quickly but sits unused because it hasn’t been put into testing or production, then the bottleneck is in the testing or deployment phase. Once the bottleneck is identified, it is important to focus on fixing the bottleneck itself. There may be problems and things we can improve all throughout the system, but directing effort toward anything other than optimizations at the bottleneck will not help the total throughput of the system. Optimizations prior to the bottleneck simply make WIP accumulate faster at the bottleneck. Optimizations after the bottleneck simply improve the part of the process that is starved for work. Take the PC deployment example. Suppose the bottleneck is at the OS installation step. A technician can install and configure the OS on ten machines each week, but more than ten machines are requested each week. In this case, we will see the requests accumulate at the point the machines are waiting to have their OS installed. Writing a web-based application to manage the requests will not change the number of machines installed each week. It is an improvement that people will like, it superficially enhances the system and it may even be fun to code, but, sadly, it will 1.5 Bottlenecks 19 not change the fact that ten machines are being completed each week. Likewise, we could hire more people to deliver the machines to people’s offices, but we know that downstream steps are idle, starved for work. Therefore, those improvements would be silly. Only by expending effort at the bottleneck will the system…
Collepals.com Plagiarism Free Papers
Are you looking for custom essay writing service or even dissertation writing services? Just request for our write my paper service, and we'll match you with the best essay writer in your subject! With an exceptional team of professional academic experts in a wide range of subjects, we can guarantee you an unrivaled quality of custom-written papers.
Get ZERO PLAGIARISM, HUMAN WRITTEN ESSAYS
Why Hire Collepals.com writers to do your paper?
Quality- We are experienced and have access to ample research materials.
We write plagiarism Free Content
Confidential- We never share or sell your personal information to third parties.
Support-Chat with us today! We are always waiting to answer all your questions.
