Monday, December 10, 2007

PCI Express Device Layers

PCI Express Device Layers
Overview
The PCI Express specification defines a layered architecture for device design as shown in Figure 2-10 on page 70. The layers consist of a Transaction Layer, a Data Link Layer and a Physical layer. The layers can be further divided vertically into two, a transmit portion that processes outbound traffic and a receive portion that processes inbound traffic. However, a device design does not have to implement a layered architecture as long as the functionality required by the specification is supported.

Figure 2-10. PCI Express Device Layers


The goal of this section is to describe the function of each layer and to describe the flow of events to accomplish a data transfer. Packet creation at a transmitting device and packet reception and decoding at a receiving device are also explained.

Transmit Portion of Device Layers
Consider the transmit portion of a device. Packet contents are formed in the Transaction Layer with information obtained from the device core and application. The packet is stored in buffers ready for transmission to the lower layers. This packet is referred to as a Transaction Layer Packet (TLP) described in the earlier section of this chapter. The Data Link Layer concatenates to the packet additional information required for error checking at a receiver device. The packet is then encoded in the Physical layer and transmitted differentially on the Link by the analog portion of this Layer. The packet is transmitted using the available Lanes of the Link to the receiving device which is its neighbor.

Receive Portion of Device Layers
The receiver device decodes the incoming packet contents in the Physical Layer and forwards the resulting contents to the upper layers. The Data Link Layer checks for errors in the incoming packet and if there are no errors forwards the packet up to the Transaction Layer. The Transaction Layer buffers the incoming TLPs and converts the information in the packet to a representation that can be processed by the device core and application.

Device Layers and their Associated Packets
Three categories of packets are defined, each one is associated with one of the three device layers. Associated with the Transaction Layer is the Transaction Layer Packet (TLP). Associated with the Data Link Layer is the Data Link Layer Packet (DLLP). Associated with the Physical Layer is the Physical Layer Packet (PLP). These packets are introduced next.

Transaction Layer Packets (TLPs)
PCI Express transactions employ TLPs which originate at the Transaction Layer of a transmitter device and terminate at the Transaction Layer of a receiver device. This process is represented in Figure 2-11 on page 72. The Data Link Layer and Physical Layer also contribute to TLP assembly as the TLP moves through the layers of the transmitting device. At the other end of the Link where a neighbor receives the TLP, the Physical Layer, Data Link Layer and Transaction Layer disassemble the TLP.

Figure 2-11. TLP Origin and Destination


TLP Packet Assembly
A TLP that is transmitted on the Link appears as shown in Figure 2-12 on page 73.

Figure 2-12. TLP Assembly


The software layer/device core sends to the Transaction Layer the information required to assemble the core section of the TLP which is the header and data portion of the packet. Some TLPs do not contain a data section. An optional End-to-End CRC (ECRC) field is calculated and appended to the packet. The ECRC field is used by the ultimate targeted device of this packet to check for CRC errors in the header and data portion of the TLP.

The core section of the TLP is forwarded to the Data Link Layer which then appends a sequence ID and another LCRC field. The LCRC field is used by the neighboring receiver device at the other end of the Link to check for CRC errors in the core section of the TLP plus the sequence ID. The resultant TLP is forwarded to the Physical Layer which concatenates a Start and End framing character of 1 byte each to the packet. The packet is encoded and differentially transmitted on the Link using the available number of Lanes.

TLP Packet Disassembly
A neighboring receiver device receives the incoming TLP bit stream. As shown in Figure 2-13 on page 74 the received TLP is decoded by the Physical Layer and the Start and End frame fields are stripped. The resultant TLP is sent to the Data Link Layer. This layer checks for any errors in the TLP and strips the sequence ID and LCRC field. Assume there are no LCRC errors, then the TLP is forwarded up to the Transaction Layer. If the receiving device is a switch, then the packet is routed from one port of the switch to an egress port based on address information contained in the header portion of the TLP. Switches are allowed to check for ECRC errors and even report the errors it finds and error. However, a switch is not allowed to modify the ECRC that way the targeted device of this TLP will detect an ECRC error if there is such an error.

Figure 2-13. TLP Disassembly


The ultimate targeted device of this TLP checks for ECRC errors in the header and data portion of the TLP. The ECRC field is stripped, leaving the header and data portion of the packet. It is this information that is finally forwarded to the Device Core/Software Layer.

Data Link Layer Packets (DLLPs)
Another PCI Express packet called DLLP originates at the Data Link Layer of a transmitter device and terminates at the Data Link Layer of a receiver device. This process is represented in Figure 2-14 on page 75. The Physical Layer also contributes to DLLP assembly and disassembly as the DLLP moves from one device to another via the PCI Express Link.

Figure 2-14. DLLP Origin and Destination


DLLPs are used for Link Management functions including TLP acknowledgement associated with the ACK/NAK protocol, power management, and exchange of Flow Control information.

DLLPs are transferred between Data Link Layers of the two directly connected components on a Link. DLLPs do not pass through switches unlike TLPs which do travel through the PCI Express fabric. DLLPs do not contain routing information. These packets are smaller in size compared to TLPs, 8 bytes to be precise.

DLLP Assembly
The DLLP shown in Figure 2-15 on page 76 originates at the Data Link Layer. There are various types of DLLPs some of which include Flow Control DLLPs (FCx), acknowledge/ no acknowledge DLLPs which confirm reception of TLPs (ACK and NAK), and power management DLLPs (PMx). A DLLP type field identifies various types of DLLPs. The Data Link Layer appends a 16-bit CRC used by the receiver of the DLLP to check for CRC errors in the DLLP.

Figure 2-15. DLLP Assembly


The DLLP content along with a 16-bit CRC is forwarded to the Physical Layer which appends a Start and End frame character of 1 byte each to the packet. The packet is encoded and differentially transmitted on the Link using the available number of Lanes.

DLLP Disassembly
The DLLP is received by Physical Layer of a receiving device. The received bit stream is decoded and the Start and End frame fields are stripped as depicted in Figure 2-16. The resultant packet is sent to the Data Link Layer. This layer checks for CRC errors and strips the CRC field. The Data Link Layer is the destination layer for DLLPs and it is not forwarded up to the Transaction Layer.

Figure 2-16. DLLP Disassembly


Physical Layer Packets (PLPs)
Another PCI Express packet called PLP originates at the Physical Layer of a transmitter device and terminates at the Physical Layer of a receiver device. This process is represented in Figure 2-17 on page 77. The PLP is a very simple packet that starts with a 1 byte COM character followed by 3 or more other characters that define the PLP type as well as contain other information. The PLP is a multiple of 4 bytes in size, an example of which is shown in Figure 2-18 on page 78. The specification refers to this packet as the Ordered-Set. PLPs do not contain any routing information. They are not routed through the fabric and do not propagate through a switch.

Figure 2-17. PLP Origin and Destination


Figure 2-18. PLP or Ordered-Set Structure


Some PLPs are used during the Link Training process described in "Ordered-Sets Used During Link Training and Initialization" on page 504. Another PLP is used for clock tolerance compensation. PLPs are used to place a Link into the electrical idle low power state or to wake up a link from this low power state.

Function of Each PCI Express Device Layer
Figure 2-19 on page 79 is a more detailed block diagram of a PCI Express Device's layers. This block diagram is used to explain key functions of each layer and explain the function of each layer as it relates to generation of outbound traffic and response to inbound traffic. The layers consist of Device Core/Software Layer, Transaction Layer, Data Link Layer and Physical Layer.

Figure 2-19. Detailed Block Diagram of PCI Express Device's Layers


Device Core / Software Layer
The Device Core consists of, for example, the root complex core logic or an endpoint core logic such as that of an Ethernet controller, SCSI controller, USB controller, etc. To design a PCI Express endpoint, a designer may reuse the Device Core logic from a PCI or PCI-X core logic design and wrap around it the PCI Express layered design described in this section.

Transmit Side
The Device Core logic in conjunction with local software provides the necessary information required by the PCI Express device to generate TLPs. This information is sent via the Transmit interface to the Transaction Layer of the device. Example of information transmitted to the Transaction Layer includes: transaction type to inform the Transaction Layer what type of TLP to generate, address, amount of data to transfer, data, traffic class, message index etc.

Receive Side
The Device Core logic is also responsible to receive information sent by the Transaction Layer via the Receive interface. This information includes: type of TLP received by the Transaction Layer, address, amount of data received, data, traffic class of received TLP, message index, error conditions etc.

Transaction Layer
The transaction Layer shown in Figure 2-19 is responsible for generation of outbound TLP traffic and reception of inbound TLP traffic. The Transaction Layer supports the split transaction protocol for non-posted transactions. In other words, the Transaction Layer associates an inbound completion TLP of a given tag value with an outbound non-posted request TLP of the same tag value transmitted earlier.

The transaction layer contains virtual channel buffers (VC Buffers) to store outbound TLPs that await transmission and also to store inbound TLPs received from the Link. The flow control protocol associated with these virtual channel buffers ensures that a remote transmitter does not transmit too many TLPs and cause the receiver virtual channel buffers to overflow. The Transaction Layer also orders TLPs according to ordering rules before transmission. It is this layer that supports the Quality of Service (QoS) protocol.

The Transaction Layer supports 4 address spaces: memory address, IO address, configuration address and message space. Message packets contain a message.

Transmit Side
The Transaction Layer receives information from the Device Core and generates outbound request and completion TLPs which it stores in virtual channel buffers. This layer assembles Transaction Layer Packets (TLPs). The major components of a TLP are: Header, Data Payload and an optional ECRC (specification also uses the term Digest) field as shown in Figure 2-20.

Figure 2-20. TLP Structure at the Transaction Layer


The Header is 3 doublewords or 4 doublewords in size and may include information such as; Address, TLP type, transfer size, requester ID/completer ID, tag, traffic class, byte enables, completion codes, and attributes (including "no snoop" and "relaxed ordering" bits). The TLP types are defined in Table 2-2 on page 57.

The address is a 32-bit memory address or an extended 64-bit address for memory requests. It is a 32-bit address for IO requests. For configuration transactions the address is an ID consisting of Bus Number, Device Number and Function Number plus a configuration register address of the targeted register. For completion TLPs, the address is the requester ID of the device that originally made the request. For message transactions the address used for routing is the destination device's ID consisting of Bus Number, Device Number and Function Number of the device targeted by the message request. Message requests could also be broadcast or routed implicitly by targeting the root complex or an upstream port.

The transfer size or length field indicates the amount of data to transfer calculated in doublewords (DWs). The data transfer length can be between 1 to 1024 DWs. Write request TLPs include data payload in the amount indicated by the length field of the header. For a read request TLP, the length field indicates the amount of data requested from a completer. This data is returned in one or more completion packets. Read request TLPs do not include a data payload field. Byte enables specify byte level address resolution.

Request packets contain a requester ID (bus#, device#, function #) of the device transmitting the request. The tag field in the request is memorized by the completer and the same tag is used in the completion.

A bit in the Header (TD = TLP Digest) indicates whether this packet contains an ECRC field also referred to as Digest. This field is 32-bits wide and contains an End-to-End CRC (ECRC). The ECRC field is generated by the Transaction Layer at time of creation of the outbound TLP. It is generated based on the entire TLP from first byte of header to last byte of data payload (with the exception of the EP bit, and bit 0 of the Type field. These two bits are always considered to be a 1 for the ECRC calculation). The TLP never changes as it traverses the fabric (with the exception of perhaps the two bits mentioned in the earlier sentence). The receiver device checks for an ECRC error that may occur as the packet moves through the fabric.

Receiver Side
The receiver side of the Transaction Layer stores inbound TLPs in receiver virtual channel buffers. The receiver checks for CRC errors based on the ECRC field in the TLP. If there are no errors, the ECRC field is stripped and the resultant information in the TLP header as well as the data payload is sent to the Device Core.

Flow Control
The Transaction Layer ensures that it does not transmit a TLP over the Link to a remote receiver device unless the receiver device has virtual channel buffer space to accept TLPs (of a given traffic class). The protocol for guaranteeing this mechanism is referred to as the "flow control" protocol. If the transmitter device does not observe this protocol, a transmitted TLP will cause the receiver virtual channel buffer to overflow. Flow control is automatically managed at the hardware level and is transparent to software. Software is only involved to enable additional buffers beyond the default set of virtual channel buffers (referred to as VC 0 buffers). The default buffers are enabled automatically after Link training, thus allowing TLP traffic to flow through the fabric immediately after Link training. Configuration transactions use the default virtual channel buffers and can begin immediately after the Link training process. Link training process is described in Chapter 14, entitled "Link Initialization & Training," on page 499.

Refer to Figure 2-21 on page 82 for an overview of the flow control process. A receiver device transmits DLLPs called Flow Control Packets (FCx DLLPs) to the transmitter device on a periodic basis. The FCx DLLPs contain flow control credit information that updates the transmitter regarding how much buffer space is available in the receiver virtual channel buffer. The transmitter keeps track of this information and will only transmit TLPs out of its Transaction Layer if it knows that the remote receiver has buffer space to accept the transmitted TLP.

Figure 2-21. Flow Control Process


Quality of Service (QoS)
Consider Figure 2-22 on page 83 in which the video camera and SCSI device shown need to transmit write request TLPs to system DRAM. The camera data is time critical isochronous data which must reach memory with guaranteed bandwidth otherwise the displayed image will appear choppy or unclear. The SCSI data is not as time sensitive and only needs to get to system memory correctly without errors. It is clear that the video data packet should have higher priority when routed through the PCI Express fabric, especially through switches. QoS refers to the capability of routing packets from different applications through the fabric with differentiated priorities and deterministic latencies and bandwidth. PCI and PCI-X systems do not support QoS capability.

Figure 2-22. Example Showing QoS Capability of PCI Express


Consider this example. Application driver software in conjunction with the OS assigns the video data packets a traffic class of 7 (TC7) and the SCSI data packet a traffic class of 0 (TC0). These TC numbers are embedded in the TLP header. Configuration software uses TC/VC mapping device configuration registers to map TC0 related TLPs to virtual channel 0 buffers (VC0) and TC7 related TLPs to virtual channel 7 buffers (VC7).

As TLPs from these two applications (video and SCSI applications) move through the fabric, the switches post incoming packets moving upstream into their respective VC buffers (VC0 and VC7). The switch uses a priority based arbitration mechanism to determine which of the two incoming packets to forward with greater priority to a common egress port. Assume VC7 buffer contents are configured with higher priority than VC0. Whenever two incoming packets are to be forwarded to one upstream port, the switch will always pick the VC7 packet, the video data, over the VC0 packet, the SCSI data. This guarantees greater bandwidth and reduced latency for video data compared to SCSI data.

A PCI Express device that implements more than one set of virtual channel buffers has the ability to arbitrate between TLPs from different VC buffers. VC buffers have configurable priorities. Thus traffic flowing through the system in different VC buffers will observe differentiated performances. The arbitration mechanism between TLP traffic flowing through different VC buffers is referred to as VC arbitration.

Also, multi-port switches have the ability to arbitrate between traffic coming in on two ingress ports but using the same VC buffer resource on a common egress port. This configurable arbitration mechanism between ports supported by switches is referred to as Port arbitration.

Traffic Classes (TCs) and Virtual Channels (VCs)
TC is a TLP header field transmitted within the packet unmodified end-to-end through the fabric. Local application software and system software based on performance requirements decides what TC label a TLP uses. VCs are physical buffers that provide a means to support multiple independent logical data flows over the physical Link via the use of transmit and receiver virtual channel buffers.

PCI Express devices may implement up to 8 VC buffers (VC0-VC7). The TC field is a 3-bit field that allows differentiation of traffic into 8 traffic classes (TC0-TC7). Devices must implement VC0. Similarly, a device is required to support TC0 (best effort general purpose service class). The other optional TCs may be used to provide differentiated service through the fabric. Associated with each implemented VC ID, a transmit device implements a transmit buffer and a receive device implements a receive buffer.

Devices or switches implement TC-to-VC mapping logic by which a TLP of a given TC number is forwarded through the Link using a particular VC numbered buffer. PCI Express provides the capability of mapping multiple TCs onto a single VC, thus reducing device cost by means of providing limited number of VC buffer support. TC/VC mapping is configured by system software through configuration registers. It is up to the device application software to determine TC label for TLPs and TC/VC mapping that meets performance requirements. In its simplest form TC/VC mapping registers can be configured with a one-to-one mapping of TC to VC.

Consider the example illustrated in Figure 2-23 on page 85. The TC/VC mapping registers in Device A are configured to map, TLPs with TC[2:0] to VC0 and TLPs with TC[7:3] to VC1. The TC/VC mapping registers in receiver Device B must also be configured identically as Device A. The same numbered VC buffers are enabled both in transmitter Device A and receiver Device B.

Figure 2-23. TC Numbers and VC Buffers


If Device A needs to transmit a TLP with TC label of 7 and another packet with TC label of 0, the two packets will be placed in VC1 and VC0 buffers, respectively. The arbitration logic arbitrates between the two VC buffers. Assume VC1 buffer is configured with higher priority than VC0 buffer. Thus, Device A will forward the TC7 TLPs in VC1 to the Link ahead of the TC0 TLPs in VC0.

When the TLPs arrive in Device B, the TC/VC mapping logic decodes the TC label in each TLP and places the TLPs in their associated VC buffers.

In this example, TLP traffic with TC[7:3] label will flow through the fabric with higher priority than TC[2:0] traffic. Within each TC group however, TLPs will flow with equal priority.

Port Arbitration and VC Arbitration
The goals of arbitration support in the Transaction Layer are:

To provide differentiated services between data flows within the fabric.

To provide guaranteed bandwidth with deterministic and smallest end-to-end transaction latency.

Packets of different TCs are routed through the fabric of switches with different priority based on arbitration policy implemented in switches. Packets coming in from ingress ports heading towards a particular egress port compete for use of that egress port.

Switches implement two types of arbitration for each egress port: Port Arbitration and VC Arbitration. Consider Figure 2-24 on page 86.

Figure 2-24. Switch Implements Port Arbitration and VC Arbitration Logic


Port arbitration is arbitration between two packets arriving on different ingress ports but that map to the same virtual channel (after going through TC-to-VC mapping) of the common egress port. The port arbiter implements round-robin, weighted round-robin or programmable time-based round-robin arbitration schemes selectable through configuration registers.

VC arbitration takes place after port arbitration. For a given egress port, packets from all VCs compete to transmit on the same egress port. VC arbitration resolves the order in which TLPs in different VC buffers are forwarded on to the Link. VC arbitration policies supported include, strict priority, round-robin and weighted round-robin arbitration schemes selectable through configuration registers.

Independent of arbitration, each VC must observe transaction ordering and flow control rules before it can make pending TLP traffic visible to the arbitration mechanism.

Endpoint devices and a root complex with only one port do not support port arbitration. They only support VC arbitration in the Transaction Layer.

Transaction Ordering
PCI Express protocol implements PCI/PCI-X compliant producer-consumer ordering model for transaction ordering with provision to support relaxed ordering similar to PCI-X architecture. Transaction ordering rules guarantee that TLP traffic associated with a given traffic class is routed through the fabric in the correct order to prevent potential deadlock or live-lock conditions from occurring. Traffic associated with different TC labels have no ordering relationship. Chapter 8, entitled "Transaction Ordering," on page 315 describes these ordering rules.

The Transaction Layer ensures that TLPs for a given TC are ordered correctly with respect to other TLPs of the same TC label before forwarding to the Data Link Layer and Physical Layer for transmission.

Power Management
The Transaction Layer supports ACPI/PCI power management, as dictated by system software. Hardware within the Transaction Layer autonomously power manages a device to minimize power during full-on power states. This automatic power management is referred to as Active State Power Management and does not involve software. Power management software associated with the OS power manages a device's power states though power management configuration registers. Power management is described in Chapter 16.

Configuration Registers
A device's configuration registers are associated with the Transaction Layer. The registers are configured during initialization and bus enumeration. They are also configured by device drivers and accessed by runtime software/OS. Additionally, the registers store negotiated Link capabilities, such as Link width and frequency. Configuration registers are described in Part 6 of the book.

Data Link Layer
Refer to Figure 2-19 on page 79 for a block diagram of a device's Data Link Layer. The primary function of the Data Link Layer is to ensure data integrity during packet transmission and reception on each Link. If a transmitter device sends a TLP to a remote receiver device at the other end of a Link and a CRC error is detected, the transmitter device is notified with a NAK DLLP. The transmitter device automatically replays the TLP. This time hopefully no error occurs. With error checking and automatic replay of packets received in error, PCI Express ensures very high probability that a TLP transmitted by one device will make its way to the final destination with no errors. This makes PCI Express ideal for low error rate, high-availability systems such as servers.

Transmit Side
The Transaction Layer must observe the flow control mechanism before forwarding outbound TLPs to the Data Link Layer. If sufficient credits exist, a TLP stored within the virtual channel buffer is passed from the Transaction Layer to the Data Link Layer for transmission.

Consider Figure 2-25 on page 88 which shows the logic associated with the ACK-NAK mechanism of the Data Link Layer. The Data Link Layer is responsible for TLP CRC generation and TLP error checking. For outbound TLPs from transmit Device A, a Link CRC (LCRC) is generated and appended to the TLP. In addition, a sequence ID is appended to the TLP. Device A's Data Link Layer preserves a copy of the TLP in a replay buffer and transmits the TLP to Device B. The Data Link Layer of the remote Device B receives the TLP and checks for CRC errors.

Figure 2-25. Data Link Layer Replay Mechanism


If there is no error, the Data Link Layer of Device B returns an ACK DLLP with a sequence ID to Device A. Device A has confirmation that the TLP has reached Device B (not necessarily the final destination) successfully. Device A clears its replay buffer of the TLP associated with that sequence ID.

If on the other hand a CRC error is detected in the TLP received at the remote Device B, then a NAK DLLP with a sequence ID is returned to Device A. An error has occurred during TLP transmission. Device A's Data Link Layer replays associated TLPs from the replay buffer. The Data Link Layer generates error indications for error reporting and logging mechanisms.

In summary, the replay mechanism uses the sequence ID field within received ACK/NAK DLLPs to associate it with outbound TLPs stored in the replay buffer. Reception of ACK DLLPs cause the replay buffer to clear TLPs from the buffer. Receiving NAK DLLPs cause the replay buffer to replay associated TLPs.

For a given TLP in the replay buffer, if the transmitter device receives a NAK 4 times and the TLP is replayed 3 additional times as a result, then the Data Link Layer logs the error, reports a correctable error, and re-trains the Link.

Receive Side
The receive side of the Data Link Layer is responsible for LCRC error checking on inbound TLPs. If no error is detected, the device schedules an ACK DLLP for transmission back to the remote transmitter device. The receiver strips the TLP of the LCRC field and sequence ID.

If a CRC error is detected, it schedules a NAK to return back to the remote transmitter. The TLP is eliminated.

The receive side of the Data Link Layer also receives ACKs and NAKs from a remote device. If an ACK is received the receive side of the Data Link layer informs the transmit side to clear an associated TLP from the replay buffer. If a NAK is received, the receive side causes the replay buffer of the transmit side to replay associated TLPs.

The receive side is also responsible for checking the sequence ID of received TLPs to check for dropped or out-of-order TLPs.

Data Link Layer Contribution to TLPs and DLLPs
The Data Link Layer concatenates a 12-bit sequence ID and 32-bit LCRC field to an outbound TLP that arrives from the Transaction Layer. The resultant TLP is shown in Figure 2-26 on page 90. The sequence ID is used to associate a copy of the outbound TLP stored in the replay buffer with a received ACK/NAK DLLP inbound from a neighboring remote device. The ACK/NAK DLLP confirms arrival of the outbound TLP in the remote device.

Figure 2-26. TLP and DLLP Structure at the Data Link Layer


The 32-bit LCRC is calculated based on all bytes in the TLP including the sequence ID.

A DLLP shown in Figure 2-26 on page 90 is a 4 byte packet with a 16-bit CRC field. The 8-bit DLLP Type field indicates various categories of DLLPs. These include: ACK, NAK, Power Management related DLLPs (PM_Enter_L1, PM_Enter_L23, PM_Active_State_Request_L1, PM_Request_Ack) and Flow Control related DLLPs (InitFC1-P, InitFC1-NP, InitFC1-Cpl, InitFC2-P, InitFC2-NP, InitFC2-Cpl, UpdateFC-P, UpdateFC-NP, UpdateFC-Cpl). The 16-bit CRC is calculated using all 4 bytes of the DLLP. Received DLLPs which fail the CRC check are discarded. The loss of information from discarding a DLLP is self repairing such that a successive DLLP will supersede any information lost. ACK and NAK DLLPs contain a sequence ID field (shown as Misc. field in Figure 2-26) used by the device to associate an inbound ACK/NAK DLLP with a stored copy of a TLP in the replay buffer.

Non-Posted Transaction Showing ACK-NAK Protocol
Next the steps required to complete a memory read request between a requester and a completer on the far end of a switch are described. Figure 2-27 on page 91 shows the activity on the Link to complete this transaction:

Step 1a. Requester transmits a memory read request TLP (MRd). Switch receives the MRd TLP and checks for CRC error using the LCRC field in the MRd TLP.


Step 1b. If no error then switch returns ACK DLLP to requester. Requester discards copy of the TLP from its replay buffer.


Step 2a. Switch forwards the MRd TLP to the correct egress port using memory address for routing. Completer receives MRd TLP. Completer checks for CRC errors in received MRd TLP using LCRC.


Step 2b. If no error then completer returns ACK DLLP to switch. Switch discards copy of the MRd TLP from its replay buffer.


Step 3a. Completer checks for CRC error using optional ECRC field in MRd TLP. Assume no End-to-End error. Completer returns Completion (CplD) with Data TLP whenever it has the requested data. Switch receives CplD TLP and checks for CRC error using LCRC.


Step 3b. If no error then switch returns ACK DLLP to completer. Completer discards copy of the CplD TLP from its replay buffer.


Step 4a. Switch decodes Requester ID field in CplD TLP and routes the packet to the correct egress port. Requester receives CplD TLP. Requester checks for CRC errors in received CplD TLP using LCRC.


Step 4b. If no error then requester returns ACK DLLP to switch. Switch discards copy of the CplD TLP from its replay buffer. Requester determines if there is error in CplD TLP using CRC field in optional ECRC field. Assume no End-to-End error. Requester checks completion error code in CplD. Assume completion code of 'Successful Completion'. To associate the completion with the original request, requester matches tag in CplD with original tag of MRd request. Requester accepts data.


Figure 2-27. Non-Posted Transaction on Link


Posted Transaction Showing ACK-NAK Protocol
Below are the steps involved in completing a memory write request between a requester and a completer on the far end of a switch. Figure 2-28 on page 92 shows the activity on the Link to complete this transaction:

Step 1a. Requester transmits a memory write request TLP (MWr) with data. Switch receives MWr TLP and checks for CRC error with LCRC field in the TLP.


Step 1b. If no error then switch returns ACK DLLP to requester. Requester discards copy of the TLP from its replay buffer.


Step 2a. Switch forwards the MWr TLP to the correct egress port using memory address for routing. Completer receives MWr TLP. Completer checks for CRC errors in received MRd TLP using LCRC.


Step 2b. If no error then completer returns ACK DLLP to switch. Switch discards copy of the MWr TLP from its replay buffer. Completer checks for CRC error using optional digest field in MWr TLP. Assume no End-to-End error. Completer accepts data. There is no completion associated with this transaction.


Figure 2-28. Posted Transaction on Link


Other Functions of the Data Link Layer
Following power-up or Reset, the flow control mechanism described earlier is initialized by the Data Link Layer. This process is accomplished automatically at the hardware level and has no software involvement.

Flow control for the default virtual channel VC0 is initialized first. In addition, when additional VCs are enabled by software, the flow control initialization process is repeated for each newly enabled VC. Since VC0 is enabled before all other VCs, no TLP traffic will be active prior to initialization of VC0.

Physical Layer
Refer to Figure 2-19 on page 79 for a block diagram of a device's Physical Layer. Both TLP and DLLP type packets are sent from the Data Link Layer to the Physical Layer for transmission over the Link. Also, packets are received by the Physical Layer from the Link and sent to the Data Link Layer.

The Physical Layer is divided in two portions, the Logical Physical Layer and the Electrical Physical Layer. The Logical Physical Layer contains digital logic associated with processing packets before transmission on the Link, or processing packets inbound from the Link before sending to the Data Link Layer. The Electrical Physical Layer is the analog interface of the Physical Layer that connects to the Link. It consists of differential drivers and receivers for each Lane.

Transmit Side
TLPs and DLLPs from the Data Link Layer are clocked into a buffer in the Logical Physical Layer. The Physical Layer frames the TLP or DLLP with a Start and End character. The symbol is a framing code byte which a receiver device uses to detect the start and end of a packet. The Start and End characters are shown appended to a TLP and DLLP in Figure 2-29 on page 94. The diagram shows the size of each field in a TLP or DLLP.

Figure 2-29. TLP and DLLP Structure at the Physical Layer


The transmit logical sub-block conditions the received packet from the Data Link Layer into the correct format for transmission. Packets are byte striped across the available Lanes on the Link.

Each byte of a packet is then scrambled with the aid of Linear Feedback Shift Register type scrambler. By scrambling the bytes, repeated bit patterns on the Link are eliminated, thus reducing the average EMI noise generated.

The resultant bytes are encoded into a 10b code by the 8b/10b encoding logic. The primary purpose of encoding 8b characters to 10b symbols is to create sufficient 1-to-0 and 0-to-1 transition density in the bit stream to facilitate recreation of a receive clock with the aid of a PLL at the remote receiver device. Note that data is not transmitted along with a clock. Instead, the bit stream contains sufficient transitions to allow the receiver device to recreate a receive clock.

The parallel-to-serial converter generates a serial bit stream of the packet on each Lane and transmits it differentially at 2.5 Gbits/s.

Receive Side
The receive Electrical Physical Layer clocks in a packet arriving differentially on all Lanes. The serial bit stream of the packet is converted into a 10b parallel stream using the serial-to-parallel converter. The receiver logic also includes an elastic buffer which accommodates for clock frequency variation between a transmit clock with which the packet bit stream is clocked into a receiver and the receiver clock. The 10b symbol stream is decoded back to the 8b representation of each symbol with the 8b/10b decoder. The 8b characters are de-scrambled. The Byte unstriping logic, re-creates the original packet stream transmitted by the remote device.

Link Training and Initialization
An additional function of the Physical Layer is Link initialization and training. Link initialization and training is a Physical Layer controlled process that configures and initializes each Link for normal operation. This process is automatic and does not involve software. The following are determined during the Link initialization and training process:

Link width

Link data rate

Lane reversal

Polarity inversion.

Bit lock per Lane

Symbol lock per Lane

Lane-to-Lane de-skew within a multi-Lane Link.

Link width. Two devices with a different number of Lanes per Link may be connected. E.g. one device has x2 port and it is connected to a device with x4 port. After initialization the Physical Layer of both devices determines and sets the Link width to the minimum Lane width of x2. Other Link negotiated behaviors include Lane reversal and splitting of ports into multiple Links.

Lane reversal if necessary is an optional feature. Lanes are numbered. A designer may not wire the correct Lanes of two ports correctly. In which case training allows for the Lane numbers to be reversed so that the Lane numbers of adjacent ports on each end of the Link match up. Part of the same process may allow for a multi-Lane Link to be split into multiple Links.

Polarity inversion. The D+ and D- differential pair terminals for two devices may not be connected correctly. In which case the training sequence receiver reverses the polarity on the differential receiver.

Link data rate. Training is completed at data rate of 2.5 Gbit/s. In the future, higher data rates of 5 Gbit/s and 10 Gbit/s will be supported. During training, each node advertises its highest data rate capability. The Link is initialized with the highest common frequency that devices at opposite ends of a Link support.

Lane-to-Lane De-skew. Due to Link wire length variations and different driver/receiver characteristics on a multi-Lane Link, bit streams on each Lane will arrive at a receiver skewed with respect to other Lanes. The receiver circuit must compensate for this skew by adding/removing delays on each Lane. Relaxed routing rules allow Link wire lengths in the order of 20"-30".

Link Power Management
The normal power-on operation of a Link is called the L0 state. Lower power states of the Link in which no packets are transmitted or received are L0s, L1, L2 and L3 power states. The L0s power state is automatically entered when a time-out occurs after a period of inactivity on the Link. Entering and exiting this state does not involve software and the exit latency is the shortest. L1 and L2 are lower power states than L0s, but exit latencies are greater. The L3 power state is the full off power state from which a device cannot generate a wake up event.

Reset
Two types of reset are supported:

Cold/warm reset also called a Fundamental Reset which occurs following a device being powered-on (cold reset) or due to a reset without circulating power (warm reset).

Hot reset sometimes referred to as protocol reset is an in-band method of propagating reset. Transmission of an ordered-set is used to signal a hot reset. Software initiates hot reset generation.

On exit from reset (cold, warm, or hot), all state machines and configuration registers (hot reset does not reset sticky configuration registers) are initialized.

Electrical Physical Layer
The transmitter of one device is AC coupled to the receiver of another device at the opposite end of the Link as shown in Figure 2-30. The AC coupling capacitor is between 75-200 nF. The transmitter DC common mode voltage is established during Link training and initialization. The DC common mode impedance is typically 50 ohms while the differential impedance is 100 ohms typical. This impedance is matched with a standard FR4 board.

Figure 2-30. Electrical Physical Layer Showing Differential Transmitter and Receiver

No comments: