Brad Dickinson | How Azure.com operates on Azure part 2: Technology and architecture

The content below is taken from the original ( How Azure.com operates on Azure part 2: Technology and architecture), to continue reading please visit the site. Remember to respect the Author & Copyright.

When you’re the company that builds the cloud platforms used by millions of people, your own cloud content needs be served up fast. Azure.com—a complex, cloud-based application that serves millions of people every day—is built entirely from Azure components and runs on Azure.

Microsoft culture has always been about using our own tools to run our business. Azure.com serves as an example of the convenient platform-as-a-service (PaaS) option that Azure provides for agile web development. We trust Azure to run Azure.com with 99.99-percent availability across a global network capable of a round-trip time (RTT) of less than 100 milliseconds per request.

In part two of our two-part series we share our blueprint, so you can learn from our experience building a website on planetary scale and move forward with your own website transformation.

This post will help you get a technical perspective on the infrastructure and resources that make up Azure.com. For details about our design principles, read Azure.com operates on Azure part 1: Design principles and best practices.

The architecture of a global footprint

With Azure.com, our goal is to run a world-class website in a cost-effective manner at planetary scale. To do this, we currently run more than 25 Azure services. (See Services in Azure.com below.)

This blog examines the role of the main services, such as Azure Front Door, which routes HTTP requests to the web front end, and Azure App Service, a fully managed platform for creating and deploying cloud applications.

The following diagram shows you a high-level view of the global Azure.com architecture.

On the left, networking services provide the secure endpoints and connectivity that give users instant access, no matter where they are in the world.
On the right, developers use Azure DevOps services to run a continuous integration (CI) and continuous deployment (CD) pipeline that delivers updates and features with zero downtime.
In between, a variety of PaaS options that provide compute, storage, security, monitoring, and more.

Azure.com global architecture: A high-level look at the Azure services and dataflow.

Host globally, deliver regionally

The Azure.com architecture is hosted globally but runs locally in multiple regions for high availability. Azure App Service hosts Azure.com from the nearest global datacenter infrastructure, and its automatic scaling features ensure that Azure.com meets changing demands.

The diagram below shows a close-up of the regional architecture hosted in App Service. We use deployment slots to deploy to development, staging, and production environments. Deployment slots are live apps with their own host names. We can swap content and configurations between the slots while maintaining application availability.

Azure.com regional architecture: App Service hosts regional instances in slots.

A look at the key PaaS components behind Azure.com

Azure.com is a complex, multi-tier web application. We use PaaS options as much as possible because managed services save us time. Less time spent on infrastructure and operations means more time to create a world-class customer experience. The platform performs OS patching, capacity provisioning, and load balancing, so we’re free to focus elsewhere.

Azure DNS

Azure DNS enables self-service quick edits to DNS records, global nameservers with 100-percent availability, and blazing fast DNS response times via Anycast addressing. We use Azure DNS aliases for both CNAME and ANAME record types.

Azure Front Door Service

Azure Front Door Service enables low-latency TCP-splitting, HTTP/2 multiplexing and concurrency, and performance based global routing. We saw a reduction in RTT to less than 100 milliseconds per request, as clients only need to connect to edge nodes, not directly to the origin.

For business continuity, Azure Front Door Service supports backend health probes, a resiliency pattern, that in effect removes unhealthy regions when they are misbehaving. In addition, to enable a backup site, Azure.com uses priority-based traffic routing. In the event our primary service backend goes offline, this method enables Azure Front Door Service to support ringed failovers.

Azure Front Door Service also acts as a reverse proxy, enabling pattern-based URL rewriting or request forwarding to handle dynamic traffic changes.

Web Application Firewall

Web Application Firewall (WAF) helps improve the platform’s security posture by providing load shedding bad bots and protection against OWASP top 10 attacks at the application layer. WAF forces developers to pay more attention to their data payloads, such as cookies, request URLs, form post parameters, and request headers.

We use WAF custom rules to block traffic to certain geographies, IPs, URLs, and other request properties. Rules offload traffic at the network edge from reaching your origin.

Content Delivery Network

To reduce load times, Azure.com uses Content Delivery Network (CDN) for load shedding to origin. CDN helps us lower the consumed bandwidth and keep costs down. CDN also improves performance by caching static assets at the Point of Presence (POP) edge nodes and reducing RTT latency. Without CDN, our origin nodes would have to handle every request for static assets.

CDN also supports DDoS protection, improving app security. We enable CDN compression and HTTP/2 to optimize delivery for static payloads. Using CDN is also a sustainable approach to optimizing network traffic because it reduces the data movement across a network.

Azure App Service

We use App Service horizontal autoscaling to handle burst traffic. The Autoscale feature is simple to use and is based on Azure Monitor metrics for requests per second (RPS) per node. We also reduced our Azure expenses by 50 percent by using elastic compute—a benefit that directly reduces our carbon consumption.

Azure.com uses several other handy App Service features:

Always On means there’s no idle timeout.
Application initialization provides custom warmup and validation.
VIP swap blue-green deployment pattern supports zero-downtime deployments.
To reduce network latency to the edge, we run our app in 12 geographically separate datacenters. This practice supports geo-redundancy should one or more datacenters go dark.
To improve app performance, we use the App Service DaaS – .NET profiler. This feature identifies node bottlenecks and hotspots for weak performing code blocks or slow dependencies.
For disaster recovery and improved mean time to recovery (MTTR), we use slot swap. In the event that an app deployment exception is not caught by our PPE testing, we can quickly roll back to last stable version.

App Service is also a PaaS service, which means we don’t have to worry about the virtual machine (VM) infrastructure, OS updates, app frameworks, and the downtime associated with managing these. We follow the paired region concept when choosing our datacenters to mitigate against any rolling infrastructure updates and ensure improved isolation and resiliency.

As a final note, it’s important to choose the right App Service plan tier so that you can right-size your vertical scaling. The plan you choose also affects sustainable energy proportionality, which means running instances at a higher utilization rate to maximize carbon efficiency.

DaaS – .NET Profiler: identifying code bottlenecks and measuring improvements. In this case we found our HTML whitespace “minifier” was saturating our compute nodes. After disabling it, we verified response times, and CPU usage improved significantly.

Azure Monitor

Azure Monitor enables passive health monitoring over Application Insights, Log Analytics, and Azure Data Explorer data sources. We rely on these query monitor alerts to build configuration-based health models based on our telemetry logs so we know when our app is misbehaving before our customers tell us.

For example, we monitor CPU consumption by datacenter as the following screenshot shows. If we see sustained, high CPU usage for our app metrics, Monitor can trigger a notification to our response team, who can quickly respond, triage the problem, and help improve MTTR. We also receive proactive notifications if a client-browser is misbehaving or throwing console errors, such as when Safari changes a specific push and replace state pattern.

Performance counters: We are alerted if CPU spikes are sustained for more than five minutes.

Application Insights

Application Insights, a feature of Monitor, is used for client– and server-side Application Performance Management (APM) telemetry logging. It monitors page performance, exceptions, slow dependencies, and offers cross-platform profiling. Customers typically use Application Insights in break-fix scenarios to improve MTTR and to quickly triage failed requests and application exceptions.

We recommend enabling telemetry sampling so you don’t exhaust your data volume storage quota. We set up daily storage quota alerts to capture any telemetry saturation before it shuts off our logging pipeline.

Application Insights also provides OpenTelemetry support for distributed tracing across app domain boundaries and dependencies. This feature enables traceability from the client side all the way to the backend data or service tier.

Data volume capacity alert: Example showing that the data storage threshold is exceeded, which is useful for tracking runaway telemetry logs.

Developing with Azure DevOps

A big team works on Azure.com, and we use Azure DevOps Services to coordinate our efforts. We create internal technical docs with Azure Wikis, track work items using Azure Boards, build CI/CD workflows using Azure Pipelines, and manage application packages using Azure Artifacts. For software configuration management and quality gates, we use GitHub, which works well with Azure Boards.

We submit hundreds of daily pull requests as part of our build process, and the CI/CD pipeline deploys multiple updates every day to the production site. Having a single tool to manage the entire software development life cycle (SDLC) simplifies the learning curve for the engineering team and our internal customers.

To stay on top of what’s coming, we do a lot of planning in Delivery Plans. It’s a great tool for viewing incremental tasks and creating forecasts for the major events that affect Azure.com traffic, such as Microsoft Build, Microsoft Ignite, and Microsoft Ready.

What’s next

As the Azure platform evolves, so does Azure.com. But some things stay the same—the need for a reliable, scalable, sustainable, and cost-effective platform. That’s why we trust Azure.

Microsoft offers many resources and best practices for cloud developers, please see our additional resources below. To get started, create your Azure free account today.

Services in Azure.com

For more information about the services that make up Azure.com, check out the following resources.

Compute

Networking

Storage

Access provisioning

Application life cycle