Simplicity is the first casualty in the age of cloud

A couple of weeks earlier, one of Microsoft cloud data centers shut down and many clients had difficulty to Access the services.

The explanation of this shut down is many lightnings hitting the electrical grid and and resulting in the failure of cooling systems that caused damage on many hardware devices.

Such disasters could be predicted and have been predicted by Microsoft which created a redundant infrastructure.

So, such a disaster should not be big problem but it was.


A part of the explanation from Microsoft reads:

“Customer impact began at 4 September at 11:00 UTC. One of the AAD sites for North America is based in the South Central US region, specifically in the affected datacenter. The design for AAD includes globally distributed sites for high availability so, when infrastructure began to shutdown in the impacted datacenter, authentication traffic began automatically routing to other sites. Shortly thereafter, there was a significantly increased rate in authentication requests. Our automatic throttling mechanisms engaged, so some customers continued to experience high latencies and timeouts. Adjustments to traffic routing, limited IP address blocking, and increased capacity in alternate sites improved overall system responsiveness – restoring service levels. Customer impact was fully mitigated at 14:40 UTC on 4 September”

An unbelievable explanation! One of the aims of keeping multiple DCs and multiple sites is to route authentication requests to some DCs in case of nearby DCs’ failure.

And we learn that Microsoft throttles the process of authentication requests for some obscure reasons, resulting in the aggrevation of the problems.

In cloud or premises, we shouldn’t forget not to apply such measures and try to keep the infrastructure as simple as possible.

As for the Active Directory:

We should limit the number of DCs.

We should limit the number of sites: Sites are for slow communications and nowadays, we have almost no slow communication lines.

We should decrease site connector replication intervals.

We should not create site link bridges.

We should not create artifical traffic routing rules, IP address blockings, etc.

In short, our systems shouldn’t be so complex at all. Complexity kills availability.

Bir Cevap Yazın

Aşağıya bilgilerinizi girin veya oturum açmak için bir simgeye tıklayın: Logosu hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap /  Değiştir )

Google fotoğrafı

Google hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap /  Değiştir )

Twitter resmi

Twitter hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap /  Değiştir )

Facebook fotoğrafı

Facebook hesabınızı kullanarak yorum yapıyorsunuz. Çıkış  Yap /  Değiştir )

Connecting to %s

%d blogcu bunu beğendi: