| Page: | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
|---|
This chapter describes the features and configuration of the SX-4 Series system and peripheral devices.
| Home |
|---|
This section describes the central processing unit.
Functionally, the central processing unit consists of a scalar unit (SU) and a vector unit (VU). The scalar unit decodes and controls instructions executed by the central processing unit and executes scalar operations using a scalar operation pipeline. The vector unit performs vector operations using eight sets of vector operation pipelines and 144K bytes of vector registers 23.
The vector unit has eight sets of vector operation pipelines per central processing unit. A set of vector operation pipelines basically consists of an add and shift unit, a multiplication unit, a division unit, and a logical operations unit, which can all operate concurrently.
As shown in Figure 6-1, each operation pipeline is responsible for every eighth vector element. Thus, operation pipeline 0 is responsible for vector element i, operation pipeline 1 for vector element i+1, operation pipeline 2 vector element i+2, through operation pipeline 7 which processes vector element i+7. Processing by the eight sets of operation pipelines is concurrent. Because the hardware automatically allocates each vector element to an operation pipeline, the software need not be aware that there are eight sets of operation pipelines.
Figure 6-1 Vector Operation Pipelines
A set of vector operation pipelines consists of an adder/shifter, a multiplier, a divider, and a logical operations unit; each can operate independently in parallel. Therefore, when there are several consecutive vector instructions, if the units vector registers each vector instruction uses are not in contention, the vector instructions are executed in parallel.
For example, for mutually independent vector instructions such as
V4
V0+V1 and V5
V2*V3, the addition and multiplication are executed concurrently.
When there are several consecutive vector instructions, those vector instructions can be processed both consecutively and in parallel even if there is vector register contention between them.
As an example, if the vector register that contains the result of a given vector instruction is used as an input operand register by the following vector instruction, execution of the following vector instruction can begin at any time after the first vector element is available. This feature is called automatic chaining.
By using automatic chaining, vector macro operations such as multiplying and then adding vectors can be processed at high speed as shown in Figure 6-2. In the example, processor internal parallel computation can be achieved by processing a multiplication and the subsequent addition concurrently.
Figure 6-2 Operation Pipelines and Automatic Chaining
S In order to increase vector processing power, it is important to provide the vector data needed for a vector operation continuously and at high speed. If the ability to provide vector data is not sufficient for the operation, there is a data wait and processing power is lost.
Each central processing unit is provided with sixteen paths per machine cycle for transferring vector data (eight bytes each) between main memory and vector registers. Full memory bandwidth can be maintained for contiguous and equally spaced data. Further, the use of SSRAM provides a very short memory bank cycle time of only 2 clocks which minimizes any bank conflict delays which do occur.
For example, there are various access patterns for accessing array data, such as by row, by column, or diagonally, as shown in Figure 6-3. These are not necessarily all accesses of consecutive addresses in main storage. Since the SX-4 can maintain the same data transfer rate for contiguous and equally spaced vectors (accessing an array by row or diagonally), it maintains a uniformly high vector processing performance level.
Figure 6-3 Array Data and its Placement in Main Storage
Figure 6-4 shows the internal configuration of a scalar unit.
A scalar unit roughly consists of a cache memory unit, an instruction issue unit, and a scalar operation unit.
In order to access instructions and operands at high speeds, the cache memory unit provides one 64K byte cache memory for operations and one for operands (a total of 128K bytes).
The instruction issue unit reads an instruction, decodes it, and performs controls such as computing operand addresses and distributing instructions to the scalar operation unit and the vector unit.
To accelerate instruction issue, there is an 8 Kbyte instruction buffer between the cache memory and the instruction issue unit. The speed of accessing instructions has been increased by a superscalar architecture that has two sets of instruction issue pipelines. This makes it possible to decode two instructions in one machine cycle.
The scalar instruction unit provides 128 scalar registers, a pipelined operation unit consisting of a floating point multiplier, a floating point ADD/SHIFT, a divider, and two fixed point adders.
Figure 6-4 Scalar Unit Configuration
In order to improve the effective performance of the system, it is particularly important to maximize the performance of the scalar processing unit. This is accomplished by adopting an operation pipeline format like that for vector operations as the operation format for scalar operations.
Previous generation computers that adopted a pipeline control format only pipelined operations such as instruction decoding, address computation, and data fetching. In many cases, addition, multiplication, and other operations were not pipelined. Thus, if several machine cycles were needed to execute an operation, execution of operations in the next instruction was delayed, as shown in Figure 6-5 (a).
In contrast, the SX-4 scalar unit implements a pipeline format even for scalar operations. Therefore, as long as there is no contention for registers, the execution of operations is processed in parallel, as shown in Figure 6-5 (b).
Figure 6-5 Effects of Scalar Operation Pipelining
Executing branch instructions quickly is extremely difficult. For a conditional branch instruction, the branch destination instruction or the next instruction after the branch instruction cannot be executed until that branch instruction is executed. Therefore branch instructions cannot be preprocessed. Thus branch instructions cannot be executed at high speed even though branch instructions appear quite frequently.
However, investigating the character of branch instructions reveals that the character of most branch instructions is to lean toward either branching or not branching, as shown in Figure 6-6. It is possible to predict the direction of a branch instruction by using this quality. The SX-4 system executes branch instructions at high speed by predicting the branch destination and decoding and executing the instruction in the predicted direction in advance.
Moreover, the central processing unit provides a high speed instruction buffer for storing instructions to be executed. If there is a branch destination instruction in the instruction buffer, it will execute by fetching it from the instruction buffer without reading from cache or main storage. Doing this minimizes wasted main storage accesses and at the same time makes it possible to access instructions at high speed, since a large portion of the instructions in iterated loops are fetched from the instruction buffer.
Figure 6-6 Character of Iteration Loops and Branch Instructions
| Home |
|---|
The main memory unit (MMU) is a large capacity, high speed unit for accessing large volumes of data. It has a maximum memory capacity of eight gigabytes per node and a maximum interlace of 1024 ways which provides a 512 Gbyte per second bandwidth.
In order to realize a large capacity high performance main memory unit, a 15-nanosecond access time 4 Mb Bi-CMOS synchronous SRAM is used. Throughput of 16 gigabytes per second per central processing unit is realized.
Each main memory unit card consists of a maximum of 256 megabytes in 32 banks. With a maximum of 32 main memory unit card per node, a maximum memory capacity of eight gigabytes is realized. In the maximum configuration of 16 nodes, a maximum of 512 main memory unit card and a maximum memory capacity of 128 gigabytes is realized.
The main memory unit also provides automatic correction of one block errors, detection of two block errors, and complete RAS facilities.
Table 6-1 shows the specifications of the main memory unit and Figure 6-7 its configuration.
| Compact Models | Single-node Models | |
|---|---|---|
| Ce 1C 2C 3C 4C | 4 8 16 24 32 | |
| Memory capacity | 256 M to 2 Gbytes | 512 M to 8 Gbytes |
| Interlace | 32 to 256 ways | 128 to 1,024 ways |
| Maximum transfer rate | 8 G to 64 Gbyte/second | 64 G to 512 Gbyte/second |
| Multi-node Models | ||
|---|---|---|
| 16M2 16H3 | 512M16 512H16 | |
| Memory capacity | 8 Gbytes | 64 G to128 Gbytes |
| Interlace | 512 ways 2 | 1,024 ways 16 |
| Maximum transfer rate | 56 Gbytes/second | 8 Tbytes/second |
Figure 6-7 Single-node Main Memory Unit Configuration
| Home |
|---|
The extended memory unit (XMU) is a large capacity semiconductor storage device with a maximum memory capacity of eight gigabytes that uses 16 Mb MOS dynamic RAMs as storage elements. It is possible to connect one XMU on compact models, four on single-node models, and up to 48 on multi-node models. Each XMU performs block data transfer operations at rates up to 4 gigabytes per second.
Using the extended memory unit as a logical disk can greatly reduce the time needed for input/output, making it possible to process large-scale computations at higher speeds.
The features of the extended memory unit are as follows.
As shown in Figure 6-8, the extended memory unit consists of a central controller (XMC) and a memory unit (MU). The central controller controls the entire extended memory unit and has an interface to the main memory.
The memory unit consists of a maximum of four memory modules and each memory module has a maximum capacity of two gigabytes.
| Memory capacity | 2G to 8G bytes24 |
|---|---|
| Maximum transfer rate | 4G byte/sec. |
| Data transfer block | 64 bytes |
| Error checking | ECC |
Figure 6-8 Configuration of Extended Memory Unit
| Home |
|---|
Nodes are interconnected using HIPPI coupling or IXS coupling.
HIPPI coupling on a multi-node model uses HIPPI (100M byte per second ANSI X3T9.3 standard) as the internode connection interface.
See Figure 2-3 of Chapter 2, Overview of the SX-4 Series for the configuration of a HIPPI coupled multi-node model.
IXS coupling on a multi-node model uses an 8G byte per second optical interface and connects a maximum of sixteen nodes by inserting an internode crossbar switch (IXS) that realizes bisection bandwidth of up to 128G bytes per second.
Since the internode crossbar switch provides data mover facilities, data in the main memory of another node can be accessed without using the central processing unit of that other node.
See Figure 2-4 of Chapter 2, Overview of the SX-4 Series for the configuration of an IXS coupled multi-node model.
| Home |
|---|
6.5 INPUT/OUTPUT PROCESSOR
An input/output processor (IOP) controls data transfer between the main memory unit and peripheral devices.
An input/output processor consists of four channel management units (CMU) that are controlled by the firmware and channel units (CHU) that control the interface with peripheral devices. It is connected to the main memory unit.
It is connected to an input/output multiplexer, peripheral devices, or network equipment via a HIPPI interface.
Figure 6-9 Input/Output Processor Configuration
The HIPPI interface consists of input channels that control the reading of data from peripheral devices and output channels that control the sending of data to peripheral devices.
There can be up to eight input channels and up to eight output channels per IOP. Each input channel and output channel can transfer data at a rate of 100 Mbytes per second.
Input channels and output channels in a HIPPI interface often are used in pairs. The usual pair consisting of one input channel and one output channel is called a "one channel pair."
An input/output processor can have a HIPPI interface with up to an eight channel pair. Depending on the system configuration, it is possible to increase from a minimum of a two channel pair to a maximum of a 384 channel pair in two channel pair units.
Compact models
Single-node models
Multi-node models (maximum configuration)
Figure 6-10 Input/Output Processor Configuration by Model Group
| Home |
|---|
An input/output multiplexer (IOX) is connected to a HIPPI channel port to provide a SCSI channel or other general purpose interface. It can be used to connect various low cost peripheral devices as well as to provide an Ethernet or other network interface.
| Home |
|---|
The automatic operation controller (AOC) is a device that provides facilities for realizing system labor-saving and automatic operation.
This device provides facilities for automatically powering on or off equipment incidental to the computer system (such as the power supply or air conditioning), and facilities for connecting various sensors for detecting problems in the environment external to the system (such as fire, earthquake, temperature, and humidity).
The automatic operation controller provides the following features.
Figure 6-11 shows the configuration of this device.
Figure 6-11 Automatic Operation Controller Configuration
| Home |
|---|
The SX-4 Series supports a wide range of peripheral devices. Major categories include:
Availability of various devices is dependent on local certification and operational requirements. Consult your local NEC representative for complete information on appropriate SX-4 peripherals for your area.
Some of the major peripherals which are widely available are described.
The NEC N7764 RAID Subsystem is a high performance device which supports RAID 3. It consists of a single or a dual controller and arrays of data and parity disk devices. The N7764 can sustain in excess of 75 Mbytes per second data transfer bandwidth. It attaches to the SX-4 Series via HIPPI channels or a HIPPI network.
This subsystem features hot swap capability for failed drives and includes redundant power supplies and fans. The data from failed drives can be recreated during continuous operation on the subsystem.
When dual controllers are installed the second controller can be used concurrently with the primary controller; it can also continue operation if the primary controller (or channel connections) fail.
The disk arrays are configured in groups of 8 data drives and 1 parity drive.
The configuration of the N7764 and specifications for the various models are shown in the following table.
| Item Model | N7764-21 | N7764-22 | N7764-23 |
|---|---|---|---|
| Storage capacity/subsystem (G bytes) | 33.6 | 67.2 | 134.4 |
| Controller (units) | 1 | 1 | 1 |
| Expansion controllers (units) | - | 1 | 1 |
| Channel interfaces/controller | 1 | 1 | 1 |
| Data transfer rate/channel* (M bytes/sec) | 35 or more | 75 or more | 75 or more |
| Number of logical drives (units) | 1 | 1 | 2 |
| Data drives/logical drive (units) | 8 | 16 | 16 |
| Parity drives/logical drive (units) | 1 | 2 | 2 |
* Average in accessing all of disk drives simultaneously
Requirements for high on-line storage capacity without the requirements of HIPPI class performance may be satisfied by configuring SCSI disk subsystems through the IOX low speed channel multiplexer.
NEC supplies a wide range of SCSI disk devices. The following table presents main characteristics of a subset of the NEC SCSI disk products available.
| Item | N7736-81 | N7736-82 | N7759-69 |
|---|---|---|---|
| Capacity (formatted) | 8.3 to 33.2 Gbytes | 16.8 to 67.2 Gbytes | 8.4 Gbytes |
| Data Transfer Rate | 20 Mbytes per second | 20 Mbytes per second | 10 Mbytes per second |
| RAID Level | 5 | 5 | 3 |
| Hot Swap | Yes | Yes | Yes |
| Other | Battery Backup standard | Battery Backup standard |
The SX-4 Series supports a full range of tape subsystems including IBM 3480 compatible, 8 mm, and 4 mm, in both single station, stacker, and automated library configurations as well as high performance specialized subsystems such as D-2 and NDP.
Both on-line and network tape server/archive subsystems are supported. Consult your NEC representative to discuss the most appropriate configuration and products for your requirements.
The SX-4 is designed as a computational server in a networked environment. Networks supported include HIPPI (800 Mbits/sec) based switches and related hardware using TCP/IP protocol, IPI-3 , or raw transport for custom devices.
High performance HIPPI gateways and routers provide access to lower speed ATM (622 Mbits/sec), FDDI (100 Mbits/sec), and ethernet (10 Mbits/sec) networks. FDDI and ethernet access can also be configured through an SX-4 IOX.
| Home |
|---|
| Contents | Previous Chapter | Next Chapter |