Machine learning improves production test
For cost and capacity reasons in high volume production, test of difficult-to-measure parameters needs to be eliminated. In these situations, it is common to rely on R&D validation data captured from a set of samples. This can lead to quality issues if the sample set is not representative of the product, or low yield if a large guard band is added to ensure the specification is met with high confidence.
To mitigate these risks, we’ve adopted a machine learning approach. In addition, below we propose an enhanced machine learning (EML) model to prevent quality escapes. We’ll use a case study to show how these methods work and the impact it can have on a production line.
A recent McKinsey study showed that only 20-30% of machine learning value add has been captured in the manufacturing sector . One of the primary barriers is skepticism towards application of machine learning in this sector. Here we show a case study where an enhanced machine learning model leads to yield increase, reduced product cost, and improved product quality.
One of the main aspects of product cost optimization is to reduce the number of parameters that need to be measured as part of manufacturing test, especially those which are difficult to measure. The most common approach for eliminating manufacturing test for a targeted parameter is to guarantee it is meeting the specification by design; this is done by using the data collected during R&D validation tests (DVT) on a sample of the product.
There are two concerns with this approach:
a) The derived value is at least four standard deviations away from the mean value as it normally has to meet a Cpk>1.33 and therefore it can negatively affect the yield.
b) The derived value is based on a small number of early production samples, and if there are component or process changes down the road, those could possibly alter the value of the targeted parameter. This change may go unnoticed, affecting product quality.
To address both of these concerns, we describe an enhanced machine learning model which is capable of reliably predicting parameters that are difficult to measure and yet can have significant effect on product yield and quality.
We will describe this methodology on a hypothetical Ethernet switch. Suppose this switch is composed of a PHY chip that does not have a temperature sensor, but in order to guarantee reliability, the temperature of this chip should always stay below 85°C. Let’s assume that the factors affecting the PHY chip temperature are ambient temperature, fan speed, lot-to-lot variability, temperature of adjacent components (like other ICs and processors), PHY chip power consumption, and switch power consumption. In this case the only way to directly measure the PHY chip’s temperature is to place a thermocouple on the chip; this is difficult to do on an already assembled unit during the final manufacturing test. As such, the general approach is to directly measure the temperature of this chip on at least 30 design validation test (DVT) samples under worst case condition (maximum specified ambient temperature), calculate the mean and standard deviation of the temperature distribution, and from there calculate the Cpk using below equation:
USL is upper spec limit which in this case is 85°C.
Let’s assume that in the first attempt the mean and σ (standard deviation) of the 30 DVT samples are 70°C and 5°C respectively; this gives a Cpk of 1, which is below the target of 1.33. One way to resolve this issue is to increase the fan speed, which results to lower chip temperature. Let’s say that by increasing the fan speed from 700 rpm to 1000 rpm, a Cpk of 1.33 is achieved, and therefore this higher fan speed becomes the default setting for all units going forward.
Although this approach satisfied the Cpk requirement, the elevated fan speed is not needed for 99.994% of the population (Cpk = 1.33). The end result is higher power consumption for all the switches, and some units can exceed the maximum power consumption specification causing yield loss. Thus, traditional methods have a chicken-and-egg problem, causing sub-optimal solutions.
To reduce this yield loss, the unit power consumption should be lowered, which can be accomplished by lowering the default fan speed. The only way to reduce the fan speed is to be able to measure or predict the temperature of the PHY chip. Here we show how a methodology based on machine learning can be used to predict the PHY chip temperature using a few parameters that can be easily measured during final manufacturing test.
Methodology for creating the prediction model
To create the predictive model we used a three-step approach:
1. We made sure that manufacturing is capable of collecting reliable data on all the easy to measure parameters relevant to PHY chip temperature: ambient temperature, temperature readings from all devices with a temperature sensor, PHY chip power consumption, and switch power consumption.
2. We used principal component analysis (PCA) to find best correlated parameters (which are measured during manufacturing test) to PHY chip temperature (which is directly measured during DVT). Table 1 shows a hypothetical example of multivariate results from the manufacturing data set and directly measured PHY chip temperature (the data is constructed by scaling the real data obtained for our product); we chose the three highest correlated parameters to PHY chip temperature which are power consumption of the PHY chip, power consumption of the switch, and temperature sensor reading of the board controller sitting close to the PHY chip. Domain expertise is required at this point to ensure that correlation results are supported by possible causality rather than random chance.
Table 1 Multivariable analysis of input data
3. Machine learning algorithms were used to find the best model between the three chosen parameters and PHY chip temperature. Among many different machine learning/deep learning algorithms used, neural network was the best in terms of lowest root mean square error (RMSE). Thus by using a traditional machine learning model we have been able to create a linkage between a difficult to measure parameter and easily available manufacturing data.
Enhanced machine learning model
The output of the neural network model generates a chip temperature estimated value (chip_temp_estimated), which is equal to the sum of directly measured chip temperatures, and an estimation error which is composed of positive and negative values. However to maintain product reliability we need to make sure that the predicted values are always higher than the measured chip temperature. To achieve this we used distribution of the temperature estimation error and calculated an upper bound error based on the following equation:
Upper bound chip temp error = upper 95% mean of estimation error + 3σ of estimation error (2)
The final chip temperature prediction is then obtained using the following equation:
chip_temp_predicted = chip_temp_estimated + upper bound chip temperature error (3)
The upper bound error factor in equation (3) adds a safety margin to the final predicted value by guaranteeing that the predicted value for 99.73% of the population is higher that the measured value. As more data becomes available, the EML model continuously learns the σ value and applies it to equation (2).
Figure 1, the scaled version of the real data obtained for our product, provides a comparison between chip_temp_estimated and chip_temp_predicted versus measured chip temperature for a sample of units. The figure shows that all the predicted values (which are shown by green circles) are higher than the measured values; this ensures our neural network model together with the error margin calculated using equation 2 can be used for reliably predicting the actual PHY chip temperature. The figure also shows that the predicted values on average are substantially lower than the Cpk value of 85°C (shown as a red solid line); this enables manufacturing to lower the fan speed of switches that fail the power consumption specification as long as the predicted PHY temperature stays below 85°C, hence, all or a portion of the failed switches can now pass the specification, leading to yield improvement.
Figure 1 Estimated and predicted PHY chip temperature vs. directly measured values
We described a methodology where difficult to measure parameters can be reliably predicted from routinely measured parameters during manufacturing test using a machine learning algorithm combined with an error compensation margin. The predicted values can then be used to dynamically set other yield-affecting parameters during manufacturing test. This methodology also enables detection of quality excursions that occur due to component or process changes, as it does not rely on fixed specifications derived from early R&D DVT testing.
Although application of this workflow was discussed for a specific case, the concept can be applied to other verticals in manufacturing processes, resulting in reduced cost, throughput enhancement, and prevention of quality excursions.
 The Age of Analytics: Competing in a data driven world, McKinsey Institute, December 2016
—Rohit Mittal is a director of engineering at Intel.