The site reliability workbook : practical ways to implement SRE (eBook, 2018) []
skip to content
The site reliability workbook : practical ways to implement SRE Preview this item
ClosePreview this item

The site reliability workbook : practical ways to implement SRE

Author: Betsy Beyer; Niall Richard Murphy; David K Rensin; Kent Kawahara; Stephen Thorne
Publisher: Sebastopol, CA : O'Reilly Media : O'Reilly Media, 2018.
Edition/Format:   eBook : Document : EnglishView all editions and formats
An expansion on the understanding of Google SRE, providing 'worked examples' for each essential facet of this area of IT prepared in co-operation with Google cloud customers based on their experiences. Instructs on methodology for running services at scale and starting SRE in greenfield or brownfield fashion.

(not yet rated) 0 with reviews - Be the first.

More like this

Find a copy in the library

&AllPage.SpinnerRetrieving; Finding libraries that hold this item...


Genre/Form: Electronic books
Additional Physical Format: Print version:
Material Type: Document, Internet resource
Document Type: Internet Resource, Computer File
All Authors / Contributors: Betsy Beyer; Niall Richard Murphy; David K Rensin; Kent Kawahara; Stephen Thorne
ISBN: 9781492029472 1492029475 9781492029458 1492029459
OCLC Number: 1046634047
Description: 1 online resource
Contents: How SRE relates to DevOps --
Foundations. Implementing SLOs --
SLO engineering case studies --
Alerting on SLOs --
Eliminating toil --
Simplicity --
Practices. On-call --
Incident response --
Postmortem culture: learning from failure --
Managing load --
Introducing non-abstract large system design --
Data processing pipelines --
Configuration design and best practices --
Configuration specifics --
Canarying releases --
Processes. Identifying and recovering from overload --
SRE engagement model --
SRE: reaching beyond your walls --
SRE team lifecycles --
Organizational change management in SRE --
A. Example SLO document --
B. Example error budget policy --
C. Results of postmortem analysis. Intro; Copyright; Table of Contents; Foreword I; Foreword II; Preface; Conventions Used in This Book; Using Code Examples; O'Reilly Safari; How to Contact Us; Acknowledgments; Chapter 1. How SRE Relates to DevOps; Background on DevOps; No More Silos; Accidents Are Normal; Change Should Be Gradual; Tooling and Culture Are Interrelated; Measurement Is Crucial; Background on SRE; Operations Is a Software Problem; Manage by Service Level Objectives (SLOs); Work to Minimize Toil; Automate This Year's Job Away; Move Fast by Reducing the Cost of Failure; Share Ownership with Developers Use the Same Tooling, Regardless of Function or Job TitleCompare and Contrast; Organizational Context and Fostering Successful Adoption; Narrow, Rigid Incentives Narrow Your Success; It's Better to Fix It Yourself; Don't Blame Someone Else; Consider Reliability Work as a Specialized Role; When Can Substitute for Whether; Strive for Parity of Esteem: Career and Financial; Conclusion; Part I. Foundations; Chapter 2. Implementing SLOs; Why SREs Need SLOs; Getting Started; Reliability Targets and Error Budgets; What to Measure: Using SLIs; A Worked Example Moving from SLI Specification to SLI ImplementationMeasuring the SLIs; Using the SLIs to Calculate Starter SLOs; Choosing an Appropriate Time Window; Getting Stakeholder Agreement; Establishing an Error Budget Policy; Documenting the SLO and Error Budget Policy; Dashboards and Reports; Continuous Improvement of SLO Targets; Improving the Quality of Your SLO; Decision Making Using SLOs and Error Budgets; Advanced Topics; Modeling User Journeys; Grading Interaction Importance; Modeling Dependencies; Experimenting with Relaxing Your SLOs; Conclusion; Chapter 3. SLO Engineering Case Studies Evernote's SLO StoryWhy Did Evernote Adopt the SRE Model?; Introduction of SLOs: A Journey in Progress; Breaking Down the SLO Wall Between Customer and Cloud Provider; Current State; The Home Depot's SLO Story; The SLO Culture Project; Our First Set of SLOs; Evangelizing SLOs; Automating VALET Data Collection; The Proliferation of SLOs; Applying VALET to Batch Applications; Using VALET in Testing; Future Aspirations; Summary; Conclusion; Chapter 4. Monitoring; Desirable Features of a Monitoring Strategy; Speed; Calculations; Interfaces; Alerts; Sources of Monitoring Data; Examples Managing Your Monitoring SystemTreat Your Configuration as Code; Encourage Consistency; Prefer Loose Coupling; Metrics with Purpose; Intended Changes; Dependencies; Saturation; Status of Served Traffic; Implementing Purposeful Metrics; Testing Alerting Logic; Conclusion; Chapter 5. Alerting on SLOs; Alerting Considerations; Ways to Alert on Significant Events; 1: Target Error Rate ≥ SLO Threshold; 2: Increased Alert Window; 3: Incrementing Alert Duration; 4: Alert on Burn Rate; 5: Multiple Burn Rate Alerts; 6: Multiwindow, Multi-Burn-Rate Alerts; Low-Traffic Services and Error Budget Alerting
Responsibility: edited by Betsy Beyer [and 4 others].


Google's Site Reliability Engineering book ignited an industry discussion on what it means to run production services today. Now, Google engineers who worked on that bestseller introduce The Site  Read more...


User-contributed reviews
Retrieving GoodReads reviews...
Retrieving DOGObooks reviews...


Be the first.
Confirm this request

You may have already requested this item. Please select Ok if you would like to proceed with this request anyway.

Close Window

Please sign in to WorldCat 

Don't have an account? You can easily create a free account.