Blog

Improving Reliability In Chips

Experts at the Table: Tracing device quality throughout its expected lifetime.

Semiconductor Engineering sat down to discuss changes in test that address tracing device quality throughout a product’s lifetime with Tom Katsioulas, CEO at Archon Design Solutions and U.S. Department of Commerce IoT advisory board member; Ming Zhang, vice president of R&D Acceleration at PDF Solutions; and Uzi Baruch, chief strategy officer at proteanTecs. What follows are excerpts of that conversation, which was held in front of a live audience at SEMICON West’s Test Vision Symposium. 1.27 Mm Pogo Pin

Improving Reliability In Chips

[L-R] Ming Zhang, PDF Solutions; Uzi Baruch, proteanTecs; Tom Katsioulas, Archon Design Solutions. Source: Semiconductor Engineering/Susan Rambo

SE: How do you square the need to minimize test cost with the need to potentially perform testing from cradle to grave to ensure reliability? What areas are most in need of change?

Baruch: When you drive the car, you want to know that the device was tested. But you also want to know that it’s performing well over its lifetime, which requires a different approach. It’s the idea that implementing complete monitoring features into a silicon device brings to the world a new set of data; monitors of different types and high-resolution measurements go into a silicon device. Once the device comes back from the foundry, it enables you to get telemetry data of the actual device performance during test. You can literally ask a device how it’s doing and get a set of data outside of the chip that describes everything it was doing when an application was running. By going back to test, they can leverage that data in conjunction with the standard test data that exists today. Companies like PDF Solutions are leveraging this data. And we work with customers to enable them at test to make decisions now on a new set of data that complements existing sets of information. With a test you get pass/fail information. But the really interesting thing is to understand, in a parametric way, ‘What’s your distance to failure?’ Monitors provide that visibility.

Zhang: I’m going to talk about three things. Number one, devices are trending toward domain-specific architecture, meaning very specific chips rather than very general-purpose chips. Trend number two is heterogeneous integration of chiplets, substrates, interconnects, and components. Devices are being vertically integrated into customized subsystems. Number three is us working with our partners to address these challenges that impact test. Heterogeneous integration means more sources of defects and variation coming from different components and different vendors. When you design domain-specific chips, that means there is not a gen-minus-one product. There’s a lot of knowledge you can get from the previous design generation. Finally, with vertical integration as companies try to take ownership of a really large subsystem, the line between component and system tests is becoming a bit blurry. SLT becomes complicated. With some of the partners, we’re trying to bridge between design, manufacturing, and deployment. And one of the pieces is analytics. When you have that connection you get more data from more sources, whether it’s from design or test, and it helps analysis.

Katsioulas: Does your company sell your test data anywhere? [If not,] you are leaving a lot of money on the table. Take one of the major vendors. They’re going to buy the chips and they’re going to test them. But if the test data is available, they will pay for it instead of investing in engineers. It’s very important because the test data — whether it’s the probe test or final test or PCB test or self-test, anywhere in the supply chain — is gold. And as we move forward to a connected world, I contend the world will evolve to be a world of data producers and data consumers. And when it comes to the product supply chain, the data that’s attached to test is the most valuable operation in the world.

Now I encountered a security problem. The problem with security is everybody wants it and nobody wants to pay for it. The goal is basically to build security inside the chip for the purpose of classic things like secure boot, secure storage, secure access, etc., but nobody was doing secure access. Add that together with a secure fingerprint on-chip creates a security subsystem. When you go to test, you can do an operation called zero-touch enrollment during the probe test, which ties data to a unique fingerprint. That’s extremely important, because once you create security and a fingerprint per chip, that means that every time you turn power on in the supply chain, you get data and now security. If you tie your test process with each and every die, you have the fundamental infrastructure for establishing a digital data thread. The digital data thread, starting from design and manufacturing, enables a marketplace with data consumers and producers. In addition, if you want to be very secure in that process, you add identifiers linked to the delivery of the asset. So the asset is the chip, which is a physical asset, but the actual process to manufacture that chip is a bunch of digital identifiers. And if you can link all that together, we can ship something with a certificate.

SE: This sounds fantastic if the chipmaker has the available die space, but not everything is a big digital core. Consider analog/mixed-signal devices with 70,000 chips per wafer. Can you securely track those chips?

Katsioulas: When you go to the dollar chips, there is no way to put that sophisticated security inside the chip. You can only do traceability of the asset without necessarily being able to securely get data in and out. Behind the die you potentially can put another root of trust with a beacon that can trace the chip.

Zhang: As a circuit designer, your concerns really resonate. But it’s really all about value. Can I use it multiple times and in multiple places to essentially get more value? Analytics, for instance, would get more information and draw more conclusions out of the same information.

SE: Are you working to design tools that could help determine if the information gained from on-chip monitors will pay off?

Baruch: That’s a great question. If you understand the value of monitoring at test, at system-level test and over the lifetime, you understand there are several dynamics that are happening here from different angles. But to answer your question directly, we have invested to provide all the tools that are seamlessly integrated into the standard flow that companies use for their design process. There are 12 different types of monitors. We’ve never increased the die size of any of the designs. And the most important thing is, ‘What kind of a problem you want to solve?’ Is it to detect silent data errors in data centers, or for automotive concerns with functional safety? One can argue that you’re testing it. But actually, in practice, there are failures and problems in the field. Now, as to whether or not it’s worth the silicon space, if I want to expand on the physical size, I would be out of business way before I even start. So you need sophistication in inserting it in a smart way with the right coverage to actually solve the problems.

SE: Is the industry trading off here between testing over time, bit-by-bit, versus doing extra testing upfront? If it’s the chip in my car, it should 150% better than it needs to be to last. Is this a test earlier problem?

Zhang: It’s a ‘data early’ issue, but it may or may not be manufacturing test data. It could be data that I extract from the design during the ramp or at the steady state of manufacturing, or in the field. That will require in-circuit agents to extract data, and it will also require analyzing systematic defects from layouts. And more importantly, it’s not just data. It’s how you arrive at insight with all this data.

Katsioulas: You should do complete testing, no question about that. Whatever you do subsequently is not necessarily for the purpose of testing. It is for the purpose of using the test structures to get the data. That’s the distinction. And if you have a comprehensive test infrastructure, both inside the chip and outside the chip, then you actually can do predictive maintenance. Because if the performance of the chip deteriorates over time, being able to get that data from the chip, when it’s in the field, you can potentially predict failures. Some applications include functional safety for cars. Another application is if you have a chip that has memory and processor and the bandwidth gets bottlenecked in a traffic situation.

Baruch: The physics of those devices is they age, they degrade. There’s operational stress that is running on those devices that often cannot be accounted for. When you test at ATE you have no clue what kind of final application, what kind of workload. Because of that, you design it with super-high margins to account for all possibilities or there would be failures. If you ask any automotive company about mission profile — how much the device would actually work, the different operating conditions, with the degradation and aging effect with operations for which it’s running — let’s see how many of them will give you an answer. Yes, it should work for the 10 to 15 years. So actually covering it with more tests doesn’t necessarily tell you what causes changes in the real world.

Katsioulas: We’re seeing that in data centers right now. If there’s a failure in the box, I want to get all the way down to the chip and the IP inside of the chip to read that value. And that’s the thread that the customer wants to create. Otherwise, they wouldn’t be able to quickly identify the fix.

To assure improving reliability I would want to first use qual and production data from prior designs that represent similar design IP and packaging. Then invest in process development vehicles (test chips for packaging and design library) so as to establish a product recipe that falls within a qualified design and materials envelope. From there, I’d insist on conservative design flow using margin from aging models. I would qualify the new product by stress testing to failure accounting for end use conditions, and have that data be a basis for ATE V & F test margin and/or planned design revisions. I would include V, F & T stress test at ATE to activate early life latent defects. I would perform stress test based reliability monitoring over the production lifetime to flag manufacturing variances. These steps are proven in our industry for reliability demonstration and management. It’s of concern then that the panel appears to be advocating for methodology that is not inclusive of these methods. Adding test circuitry to a chip adds complexity, reduces yield and only informs one some time in the future that the product may or may not meet lifetime expectations.

Name* (Note: This name will be displayed publicly)

Improving Reliability In Chips

Probe Test Email* (This will not be displayed publicly)