PCI Express System: Non-Posted Memory Read Transaction

Example of a Non-Posted Memory Read Transaction
Let us put our knowledge so far to describe the set of events that take place from the time a requester device initiates a memory read request, until it obtains the requested data from a completer device. Given that such a transaction is a non-posted transaction, there are two phases to the read process. The first phase is the transmission of a memory read request TLP from requester to completer. The second phase is the reception of a completion with data from the completer.

Memory Read Request Phase
Refer to Figure 2-31. The requester Device Core or Software Layer sends the following information to the Transaction Layer:

Figure 2-31. Memory Read Request Phase

32-bit or 64-bit memory address, transaction type of memory read request, amount of data to read calculated in doublewords, traffic class if other than TC0, byte enables, attributes to indicate if 'relaxed ordering' and 'no snoop' attribute bits should be set or clear.

The Transaction layer uses this information to build a MRd TLP. The exact TLP packet format is described in a later chapter. A 3 DW or 4 DW header is created depending on address size (32-bit or 64-bit). In addition, the Transaction Layer adds its requester ID (bus#, device#, function#) and an 8-bit tag to the header. It sets the TD (transaction digest present) bit in the TLP header if a 32-bit End-to-End CRC is added to the tail portion of the TLP. The TLP does not have a data payload. The TLP is placed in the appropriate virtual channel buffer ready for transmission. The flow control logic confirms there are sufficient "credits" available (obtained from the completer device) for the virtual channel associated with the traffic class used.

Only then the memory read request TLP is sent to the Data Link Layer. The Data Link Layer adds a 12-bit sequence ID and a 32-bit LCRC which is calculated based on the entire packet. A copy of the TLP with sequence ID and LCRC is stored in the replay buffer.

This packet is forwarded to the Physical Layer which tags on a Start symbol and an End symbol to the packet. The packet is byte striped across the available Lanes, scrambled and 10 bit encoded. Finally the packet is converted to a serial bit stream on all Lanes and transmitted differentially across the Link to the neighbor completer device.

The completer converts the incoming serial bit stream back to 10b symbols while assembling the packet in an elastic buffer. The 10b symbols are converted back to bytes and the bytes from all Lanes are de-scrambled and un-striped. The Start and End symbols are detected and removed. The resultant TLP is sent to the Data Link Layer.

The completer Data Link Layer checks for LCRC errors in the received TLP and checks the Sequence ID for missing or out-of-sequence TLPs. Assume no error. The Data Link Layer creates an ACK DLLP which contains the same sequence ID as contained in the memory read request TLP received. A 16-bit CRC is added to the ACK DLLP. The DLLP is sent back to the Physical Layer which transmits the ACK DLLP to the requester.

The requester Physical Layer reformulates the ACK DLLP and sends it up to the Data Link Layer which evaluates the sequence ID and compares it with TLPs stored in the replay buffer. The stored memory read request TLP associated with the ACK received is discarded from the replay buffer. If a NAK DLLP was received by the requester instead, it would re-send a copy of the stored memory read request TLP.

In the mean time the Data Link Layer of the completer strips the sequence ID and LCRC field from the memory read request TLP and forwards it to the Transaction Layer.

The Transaction Layer receives the memory read request TLP in the appropriate virtual channel buffer associated with the TC of the TLP. The Transaction layer checks for ECRC error. It forwards the contents of the header (address, requester ID, memory read transaction type, amount of data requested, traffic class etc.) to the completer Device Core/Software Layer.

Completion with Data Phase
Refer to Figure 2-32 on page 99 during the following discussion. To service the memory read request, the completer Device Core/Software Layer sends the following information to the Transaction Layer:

Figure 2-32. Completion with Data Phase

Requester ID and Tag copied from the original memory read request, transaction type of completion with data (CplD), requested amount of data with data length field, traffic class if other than TC0, attributes to indicate if 'relaxed ordering' and 'no snoop' bits should be set or clear (these bits are copied from the original memory read request). Finally, a completion status of successful completion (SC) is sent.

The Transaction layer uses this information to build a CplD TLP. The exact TLP packet format is described in a later chapter. A 3 DW header is created. In addition, the Transaction Layer adds its own completer ID to the header. The TD (transaction digest present) bit in the TLP header is set if a 32-bit End-to-End CRC is added to the tail portion of the TLP. The TLP includes the data payload. The flow control logic confirms sufficient "credits" are available (obtained from the requester device) for the virtual channel associated with the traffic class used.

Only then the CplD TLP is sent to the Data Link Layer. The Data Link Layer adds a 12-bit sequence ID and a 32-bit LCRC which is calculated based on the entire packet. A copy of the TLP with sequence ID and LCRC is stored in the replay buffer.

This packet is forwarded to the Physical Layer which tags on a Start symbol and an End symbol to the packet. The packet is byte striped across the available Lanes, scrambled and 10 bit encoded. Finally the CplD packet is converted to a serial bit stream on all Lanes and transmitted differentially across the Link to the neighbor requester device.

The requester converts the incoming serial bit stream back to 10b symbols while assembling the packet in an elastic buffer. The 10b symbols are converted back to bytes and the bytes from all Lanes are de-scrambled and un-striped. The Start and End symbols are detected and removed. The resultant TLP is sent to the Data Link Layer.

The Data Link Layer checks for LCRC errors in the received CplD TLP and checks the Sequence ID for missing or out-of-sequence TLPs. Assume no error. The Data Link Layer creates an ACK DLLP which contains the same sequence ID as contained in the CplD TLP received. A 16-bit CRC is added to the ACK DLLP. The DLLP is sent back to the Physical Layer which transmits the ACK DLLP to the completer.

The completer Physical Layer reformulates the ACK DLLP and sends it up to the Data Link Layer which evaluates the sequence ID and compares it with TLPs stored in the replay buffer. The stored CplD TLP associated with the ACK received is discarded from the replay buffer. If a NAK DLLP was received by the completer instead, it would re-send a copy of the stored CplD TLP.

In the mean time, the requester Transaction Layer receives the CplD TLP in the appropriate virtual channel buffer mapped to the TLP TC. The Transaction Layer uses the tag in the header of the CplD TLP to associate the completion with the original request. Transaction layer checks for ECRC error. It forwards the header contents and data payload including the Completion Status to the requester Device Core/Software Layer. Memory read transaction DONE.

Hot Plug
PCI Express supports native hot-plug though hot-plug support in a device is not mandatory. Some of the elements found in a PCI Express hot plug system are:

Indicators which show the power and attention state of the slot.

Manually-operated Retention Latch (MRL) that holds add-in cards in place.

MRL Sensor that allow the port and system software to detect the MRL being opened.

Electromechanical Interlock which prevents removal of add-in cards while slot is powered.

Attention Button that allows user to request hot-plug operations.

Software User Interface that allows user to request hot-plug operations.

Slot Numbering for visual identification of slots.

When a port has no connection or a removal event occurs, the port transmitter moves to the electrical high impedance detect state. The receiver remains in the electrical low impedance state.

PCI Express Performance and Data Transfer Efficiency
As of May 2003, no realistic performance and efficiency numbers were available. However, Table 2-3 shows aggregate bandwidth numbers for various Link widths after factoring the overhead of 8b/10b encoding.

Table 2-3. PCI Express Aggregate Throughput for Various Link Widths PCI Express Link Width
x1
x2
x4
x8
x12
x16
x32

Aggregate Bandwidth (GBytes/sec)
0.5
1
2
4
6
8
16

DLLPs are 2 doublewords in size. The ACK/NAK and flow control protocol utilize DLLPs, but it is not expected that these DLLPs will use up a significant portion of the bandwidth.

The remainder of the bandwidth is available for TLPs. Between 6-7 doublewords of the TLP is overhead associated with Start and End framing symbols, sequence ID, TLP header, ECRC and LCRC fields. The remainder of the TLP contains between 0-1024 doublewords of data payload. It is apparent that the bus efficiency is significantly low if small size packets are transmitted. The efficiency numbers are very high if TLPs contain significant amounts of data payload.

Packets can be transmitted back-to-back without the Link going idle. Thus the Link can be 100% utilized.

The switch does not introduce any arbitration overhead when forwarding incoming packets from multiple ingress ports to one egress port. However, it is yet to be seen what the effect is of the Quality of Service protocol on actual bandwidth numbers for given applications.

There is overhead associated with the split transaction protocol, especially for read transactions. For a read request TLP, the data payload is contained in the completion. This factor has to be accounted for when determining the effective performance of the bus. Posted write transactions improve the efficiency of the fabric.

Switches support cut-through mode. That is to say that an incoming packet can be immediately forwarded to an egress port for transmission without the switch having to buffer up the packet. The latency for packet forwarding through a switch can be very small allowing packets to travel from one end of the PCI Express fabric to another end with very small latency.

PCI Express System

Monday, December 10, 2007

Non-Posted Memory Read Transaction

No comments:

Blog Archive

About Me