Sitecore CM Azure Scaling : Incident Analysis & Step-by-Step Runbook
Scaling CM horizontally without addressing session state can cause session timeouts and lost authoring state. This post explains the root cause, a step-by-step troubleshooting & remediation method, and recommended approaches for safely increasing CM availability.
Background
Sitecore's recommended approach for authoring environments is to run a single CM instance to avoid issues with session state, locking, and authoring workflows. Teams sometimes scale CM horizontally to increase availability or capacity. When doing that, you must ensure that session state and any local state are handled consistently across instances; otherwise, authors will experience session loss or timeouts.
This document summarizes a real incident (redacted of dates and resource names) and turns it into a clear, practical runbook and guidance for scaling Sitecore CM safely.
Summary of what happened (high level)
- We scaled an environment from one CM instance to multiple instances to increase redundancy.
- After scaling, authors began experiencing session timeouts and lost editing sessions.
- Investigation showed sessions were stored locally and requests were routed to different instances; without sticky sessions or a centralized session store, this caused session loss.
- Temporary mitigation used session affinity (sticky sessions). The long-term solution is to centralize session state or design CM to be stateless.
Root Cause (concise)
- Immediate cause: Session state was kept in-process (or otherwise instance-bound) while load balancing distributed requests across instances. Requests for a single user hit different instances, causing session mismatch and timeouts.
- Underlying cause: The environment was scaled without configuring a shared session store or ensuring all temporary/stateful resources were centralized.
Step-by-step troubleshooting & remediation (runbook)
Use this as a checklist when you encounter CM session issues after scaling.
1) Detect & confirm the problem
- Check application logs, web server logs, and Sitecore logs for session-related errors, exceptions, and frequent login or session renewal events.
- Observe symptom patterns: users being logged out between requests, lost editing state, or inconsistent behavior tied to particular instances.
- Correlate with infrastructure changes: identify whether instance counts, load balancer settings, or deployment changes were recently applied.
- I was having issues like below.
2) Isolate environment differences
- Compare environment configurations (instance count, session affinity settings, cache/Redis availability, storage endpoints) between the environment showing issues and the stable environment.
- Note any divergence in instance count or affinity settings.
3) Quick temporary mitigation (if you need immediate relief)
- Enable session affinity on your App Service to keep a user's requests bound to the same instance. This can immediately reduce session loss.
- Alternatively, temporarily revert to a single CM instance to restore parity with the stable environment.
4) Run targeted tests
- Reproduce the problem in a staging/perf environment: scale to multiple instances and run authoring workflows to confirm session churn.
- Use synthetic transactions that exercise the Sitecore authoring UI (login → edit item → save) and log whether the same instance handles subsequent requests.
5) Implement long-term fixes
Choose one or a combination of the following approaches based on your availability and reliability requirements:
- Centralize session state: Use an out-of-process session store (Azure Cache for Redis or SQL-backed session state). Update Sitecore and app configuration to use this store for session and any session-backed components.
- Centralize shared files/state: Ensure media, temporary files, and any instance-local resources are stored in a common place (Azure Blob Storage, NAS, or other shared storage).
- Coordinate caching and locks: Make sure Sitecore caches and locking mechanisms are configured for a scaled environment (use out-of-process cache providers where required).
- Design for statelessness where possible: Minimize reliance on in-memory session for critical authoring flows.
6) Validate at scale
- After making the changes, run a full validation in staging/perf with autoscale and multiple instances. Test authoring workflows under expected load and during instance failover scenarios.
- Monitor session churn, login rates, and error rates to confirm the problem is resolved.
7) Roll out to production with monitoring & rollback plan
- Deploy changes during a maintenance window if possible.
- Keep the mitigation (session affinity) available to toggle rapidly if needed, and maintain an easy rollback (revert instance count or config) plan.
- Monitor closely after rollout and be ready to revert if a serious regression appears.
Trade-offs & considerations
Session Affinity (sticky sessions)
- Pros: Fast, low-effort mitigation to restore authoring experience.
- Cons: Reduces true fault tolerance and can mask the need for a proper shared-state architecture. Not ideal for autoscaling or load re-distribution.
Single CM instance (recommended default)
- Pros: Simplest and safest for authoring — avoids race conditions and session sync issues.
- Cons: Single point of failure; requires careful operational processes to minimize downtime.
Scale-out with shared state
- Pros: Higher availability and capacity if implemented correctly.
- Cons: Requires investment: centralized session store (Redis), centralized media/state, cache coordination, and thorough testing.
Recommended best practices
- Default to a single CM instance for authoring unless you have a clear business need for multiple instances.
- If you must scale CM: implement centralized session state (Redis or SQL), centralize media and temporary files, and ensure Sitecore cache/locking is configured for distributed operation.
- Test scale and failover in non-production before applying to production.
- Keep environment parity between production, perf, and staging to avoid environment-specific surprises.
- Document runbooks and quick rollbacks so operators can confidently toggle session affinity or revert instance count when needed.
Suggested runbook quick commands (examples)
These are high-level examples — adapt to your IaC, orchestration, or cloud provider tooling.
- Toggle session affinity in your load balancer or App Service settings (use the cloud provider console or IaC template to update the setting).
- Scale instance counts up/down using your cloud CLI (e.g.,
az appservice plan update --number-of-workersor your autoscale rules). - Configure an out-of-process session store (e.g., add Redis connection string to config and set the session provider in the web config).
Postmortem insights
- Environment parity matters. Small differences between environments (instance count, affinity) can hide or trigger issues.
- Scaling stateful apps requires deliberate design. Don’t assume scaling out is transparent for applications that use in-memory state.
- ARR/session affinity can help temporarily but don’t rely on it long-term. Use it only while you implement shared session architecture.