Bus contention and catastrophic errors
The data on the DDR bus is only present on the signal lines for a short period of time. The DDR data bus is shared among the different DIMMs/SODIMMs in a channel and DRAM parts on a DIMM/SODIMM. It is imperative that once Read or Write data is on the bus, the next Read or Write data wait until the bus is clear before the new data is put on those same signal lines. It is like a traffic intersection. Don’t enter the intersection if there are already cars in that intersection because if the light turns you might experience a collision.
A collision of data on the DDR data bus leads to corruption. Some of this corruption is detectable and correctable but some is catastrophic and will result in a system crash or, worse yet, undetectable data corruption. The JEDEC specification tells designers of both memory controllers and DRAM parts what the timing between events can be for correct operation. For the most part these are minimum timings; that is, they do not want events occurring too close together where the logic is not ready or bus contention can occur. The JEDEC standard for DDR3, for example, describes the separation that must occur on the bus between a Write and a Read command to the same rank.
The JEDEC specification is detailed in its timing requirements to prevent data collision on the DDR data bus. Even so, we quickly found this violation on a brand new motherboard with the real-time protocol violation and analysis tool (see figure 2).
Figure 2: DDR3 Detective violation screen shows error count and Rank for a new motherboard with the DDR3 running at 1867 MT/s.
Looking at the source of the violation
It can be very helpful to get a look at the traffic occurring around the violation. There are several approaches available with different levels of information. First, and simplest, is a real-time protocol violation and analysis tool with an internal state listing option where Address, Command and Control traffic can be captured and observed.
A second approach, especially helpful if using BGA probing, is the State Output Option of the protocol violation tool. This is where the Address/Command/Control signals are brought out to the external logic analyzer after they have been buffered.
A third and most powerful approach is to use a DDR3 interposer probe (DIMM and SODIMM available) to provide simultaneous real-time protocol violation and analysis with logic analyzer capture of the entire DDR bus including Address/Command/Control and Data. This is accomplished by double probing Address/Command/Control signals, sending one copy of these signals to the tool for protocol violation testing, and a second copy to the external logic analyzer. Data signals can also sent directly to the logic analyzer for analysis (see figure 3). The logic analyzer trace on the left side of Figure 3, shows deep capture, initiated by a violation cross trigger from the DDR3 Detective, initiated from the Write to Read minimum time requirement violation. Notice logic analyzer time markers show that the Read happened too soon after the Write, and we can see the exact data values during this violation. The designer can explore whether data corruption is associated with this protocol violation.
Figure 3: Logic analyzer trace shows Write to Read to the same Rank too close together on the logic analyzer trace (left) and the real-time protocol violation tool (right).
So what is the possible effect on the system if a violation like this occurs? In this particular example, it could result in bus contention or the DRAM could deliver bad Read data as it is trying to react to a Read command while still recovering from the Write. The behavior should then be considered a statistical problem: If the memory controller does this hundreds of times per second, how long will it take before an error shows up? These types of protocol violations may not ever cause a problem with some DRAM parts or they may show up as a failure months or years after systems are deployed into the field.
With the real-time protocol violation and analysis tool, a memory system can be checked for protocol violations while the system is running any type of software or application. If detected, the errors can be counted and flagged as coming from particular ranks of the memory. The tool can be configured to test for over 45 possible protocol violations on all ranks and banks for a total of over 400 unique checks.