AMD Cloud

Between 1996 and 2008 I was part of the team that designed and built the internal AMD cloud that scaled from 40 Sun workstations to 70,000 cores of 64-bit AMD Opteron servers. This was unprecedented in size, and was driven by the ever-increasing demand for high performance computing for microprocessor design and verification. We started with a home-grown job scheduler, and ultimately used Platform LSF, Cisco networking, Netapp filers, and many homegrown tools and services to scale to size.

I designed the cloud architecture that allowed AMD to leverage compute power across datacenters. This enabled relatively small engineering teams to execute at speed, competing with Intel.

I recall expanding the cloud infrastructure to use 64-bit X86 CPUs, at a time when no one else in the world was doing so.

These were truly the good old days - I was part of a creative and hard working team that was tasked with doing whatever was needed to support AMD’s ravenous compute appetite. It is to Clive Dawson’s credit that we were able experiment, learn and build this infrastructure. Clive was my boss, and my mentor, and we all learned a lot from him. And from each other. Also, AMD design engineering had a great culture of getting shit done. Mistakes were ok. Repeating the same mistakes was not OK.

Along the way we ‘invented’ infrastructure as code - first using bash, then perl, and CFengine version 1, which did not work. Later we used CFengine version 2, which scaled as needed. With substantial care and feeding. We went from hand-assembling servers in racks, with bruised and bloody knuckles, to automated server rack deployments. Servers would roll in, already racked, and when booted on the network they would be automatically provisioned with the correct software, and following acceptance testing, would be ready for use.