Zinc Whiskers and Their Role in Data Centre Failures
Steel is coated with a layer of zinc to stop the steel from corroding. Two methods are commonly used to apply the zinc:
A zinc atom is heavier (shown in white), and therefore physically slightly larger, than an iron atom (mild steel). Both coating processes attempt to place one zinc atom on top of each iron atom and then build this structure up to form a continuous film.
In hot-dip galvanising, the perfect one-to-one matching is occasionally interrupted by a gap, allowing movement and releasing compression in the film. In electroplating, the surface zinc atoms are squeezed together as they are forced down onto the underlying array of iron atoms in a strict one-to-one pairing, forming a film under compressive stress.
Not all electroplated materials are prone to whiskers however. Under some electroplating conditions, the packing of zinc atoms on top of the array of iron atoms is not perfect. This type of plating is less stressed, but is also more susceptible to corrosion under adverse environmental conditions such as high humidity. For this reason, the plating industry regarded this as less satisfactory for corrosion protection.
In over 20 years experience, we have seen only a single incidence of numerous whiskers growing on hot-dipped galvanised structures, and then under a very unique circumstance.
Rates and conditions of whisker growth. Typical zinc whiskers are 1.2 micron in diameter, and grow to millimetres in length. Above about half a millimetre, zinc whiskers are fragile and can readily break off and be carried in directed air flows, but if they are brushed off at a shorter length, they can still cause problems.
The rate of growth appears to be solely a function of plating conditions: current density, nature of solution, depletion of bath. Indeed, the same component plated in the same bath at different times can show completely different characteristics of whisker growth, from rapid to absent, from vanishingly short to millimetres in length. There is no evidence of a significant environmental influence on rates and lengths of growth. We know that the rate of growth diminishes over time as the stress is released, but even this affect cannot be predicted.
In a data centre, cable trays are usually hot-dip coated. Stringers and pedestals which support data centre floors are frequently electroplated. The underside of floor tiles may have either type of zinc finish, and it is not uncommon to find plated zinc films used in the construction of power supply housings, server racks, support frames and so on; potential whisker sources right where they may cause the damage.
In the ‘70s and ‘80s, data centres were planned and built with electroplated sheets of mild steel on the under side of raised floor tiles. Frequently these tiles became rich sources of zinc whiskers, growing to millimetre lengths and waving in directed cooling air flows. When disturbed by even the smallest event, the whiskers became airborne and were carried in the cooling flow. This was not a problem in the ‘80s but created a disaster in the ‘90s due to advances in IT technology.
In the ‘90s, more and more function was placed on a single chip, requiring more output connections to the circuit card. To communicate this function, more dense packaging solutions were designed, which increased the amount of heat that had to be removed from these more complex chip packages. Thus, airflows were designed and directed to cool the higher power chips, which dissipated more than half their heat through the legs connecting the chip package to the circuit card.
A second development was the wide adoption of CMOS technology driven by shrinking chip design rules and reducing power consumption. Spacing between the connections of chip packages were using 250 micron ground rules (the spacing between adjacent conductors) in surface mount technology. The stage was set for a problem. Higher power circuits were not affected, whiskers just fused and did little damage but with CMOS, small voltage disturbances were translated into unrealistic commands in downstream circuitry, power supplies were driven into overload, actuators in disk files were driven into the end stops with resultant head crashes, and after this brief interaction, little evidence of the offending whisker was ever found. The problem was further compounded as disk storage became ‘RAIDED’, elements of a single data set spread across many individual disk files (DASD). Add the whisker dimension when several such DASDs could fail simultaneously, and some serious problems with significant data loss ensued.
By the ‘90s these factors, whiskers, CMOS, packaging ground rules, cooling airflow had all converged. The whisker issue had arrived.
Vulnerability varies. The greater the requirement for high cooling air flow, the greater the risk. Legacy equipment working at higher power levels is frequently less vulnerable as the whisker fuses before it can cause circuit damage. Most new IT equipment (post 2010) is protected by design and technology innovation. For example, flip chip technology is displacing J-lead attachment and therefore reducing the potential for whisker-induced short circuits. Air flow labryinths at entrances to servers impede the access of zinc whiskers to internal circuitry. The most damaging effects are seen in smart power supplies and DASD devices, as complete data loss frequently results. By contrast, tape libraries appear seldom affected by whiskers, but they are affected by dust, dirt and magnetic material such as fine rust.
Suspect a zinc whisker problem if you have an unusually high rate of power supply failures, often of recently replaced units or multiple simultaneous DASD failure leading to data loss. To cause such effects, there has to be a source of whiskers of adequate length, and many airborne at the same instant. The distance between source and target has to be short and the packaging technology has to allow whiskers to short closely spaced conductors. The most dangerous times for whisker induced damage are when affected tiles are being moved around during installation activity, particularly near the front of servers and racks. But with care even this can be achieved without data loss.
Detection. The situation in any affected or suspected data centre needs to be monitored by experts. Zinc whiskers can be identified by careful inspection under bright light, but to be sure, samples need to be examined in an electron microscope to measure length. Simultaneously, the X-ray Emission Spectrum will verify that the sample is zinc and not, for instance, fibreglass strands which may also be present in a data centre environment. Only with this firm information can the magnitude of the problem be gauged.
The best solution is to replace all affected components, an expensive and disruptive proposition. Dangerous too, because contaminated tiles have to be lifted and removed, potentially releasing a shower of whiskers into nearby equipment. To change pedestals is almost impossible in an active data centre.
Alternatively, you could consider a combined solution involving specialised cleaning, monitoring and tile replacement. Whiskers are a consequence of the compressive pressure in the plated film. The higher the stress, the faster and longer the whiskers grow. But, as they grow, the pressure is released and the rate of growth falls. So although you cannot predict the number of cleans required, each clean will always improve the long term prognosis. In some instances, a single clean is enough. In other situations, two or three cleans may be required. Initial rates of growth vary considerably, from zero to as high as 1 mm a month. This means that sometimes whiskers of a damaging length can already be present on zinc plated components before the item is delivered to its customer. Cleaning, like zinc whisker detection, is best carried out by experts experienced in selective decontamination, who have a record of successfully removing whiskers from live IT environments.
On-site Zinc Whisker Survey
Zinc Whisker Sample Analysis (By Post)