The batch processing environment for the individual system site can be constructed using various subcommands of the NQS administrator command qmgr(1M).
The offered functions include the transmission of a signal that is the notification of asynchronous event, the modification of the nice value used in priority adjustment establishment/extraction of the timeslice, and the establishment/modification of the scheduling parameter for the batch processing.
These functions can be used from the NQS commands and in system calls.
For details on batch processing that includes NQS, refer to the SUPER-UX User's Guide, SUPER-UX NQS User's Guide, and SUPER-UX System Administrator's Guide.
| Home |
|---|
To make the job activation effective, it is necessary to use the various resources offered as operating systems efficiently. It is especially difficult to use the batch process and the interactive process efficiently for the system in which the CPU or memory is used infinitely due to a wrong move by the operator.
For the effective performance of the resource management, it is possible to construct an effective system.
The establishment of the limitation value is done in each NQS queue. Also, it is possible to specify it as an option at the time of request submission. Furthermore, depending on the NQS command, every type of limit can be compared.
Some functions can also be used for programs outside batch processing. The limit of the calling program can be set with the system call setrlimit(2) and ulimit(2). The limit of the specified process, the specified job, and all processes in the specified job can be set with setrlimitx(2), setrlimitj(2), and setrlimitjp(2).
You can also get the resource limits of the calling process with the system call getrlimit(2) and ulimit(2). It is possible to get the resource limits of the specified process, the specified job, and the oldest process in the specified job can be set with getrlimitx(2), getrlimitj(2), and getrlimitjp(2).
The type of resources are as follows:
In addition to the limitations of the whole memory size, you can also set the following types of limitations:
- Data segment size
- Stack segment size
The File System Resource Limitation Group (RLG) is a group of several file systems to be bundled per job file size resource management. Refer to mount(1M) in the SUPER-UX System Administrator's Reference Manual for information about creating a file system RLG.
| Home |
|---|
The limits of every type of resource are prepared by the default values. Memory size is restricted by the maximum memory size usable for one process that is prepared only in the system (MAXUMEM).
There is no limit for the CPU consumption time.
There is no limit for the CPU resident time. However, the number of CPUs subject to the resource limitation as to the CPU resident time is restricted by the number of CPUs within a system.
The file size and core file size are restricted by the largest size of the process that is prepared for SFS by one in the system (ULIMIT).
The number of open files for each process is restricted by the largest size of the process that is prepared by one in the system (NOFILES).
There is no limit for the number of magnetic tape devices, the number of processes, or file capacity.
The number of tasks for each process is restricted by the maximum number of tasks per process (NTASKPP).
Table 3-1 shows the type of resource and symbol to specify for the system call setrlimit(2), setrlimitx(2), setrlimitj(2), and setrlimitjp(2).
| Resource | Process | Job_ | Kernel Default | |
|---|---|---|---|---|
| CPU | RLIMIT_CPU | Available | Available | Unlimited |
| File Size | RLIMIT_FSIZE ulimit(2) |
Available | Not supported | Maximum available file size (ULIMIT) |
| Total Memory Size | RLIMIT_UMEM | Available | Available | Limited only by maximum available size (MAXUMEM) |
| Data Segment Size | RLIMIT_DATA | Available | Not supported | Limited only by maximum available size (MAXUMEM) |
| Stack Segment Size | RLIMIT_STACK | Available | Not supported | Limited only by maximum available size (MAXUMEM) |
| Number of MT Devices | RLIMIT_MTDEV | Not supported | Available | Unlimited |
| Number of Processes | RLIMIT_PROC | Not supported | Available | Unlimited |
| Core File Size | RLIMIT_CORE | Available | Not supported | Limited only by maximum available file size ULIMIT |
| Total File Capacity | RLIMIT_FSPACE | Available | Available | Unlimited |
| XMU Capacity | RLIMIT_XMUSE | Not supported | Available | Unlimited |
| File System RLG | RLIMIT_RLGn (n = 0, 1, 2, 3) |
Not supported | Available | Unlimited |
| Home |
|---|
| Resource | Process | Job | Kernel Default | |
|---|---|---|---|---|
| Temporary File Capacity | RLIMIT_TMPF | Not supported | Available | Unlimited |
| Number of Open Files | RLIMIT_NOFILE | Available | Available | Each process: Maximum available number of open files (NOFILES) Each job: Unlimited |
| Number of Tasks | RLIMIT_NTASK | Available | Not supported | Limited only by maximum number of tasks (NTASKPP) |
| CPU Resident Time | RLIMIT_CPURESTM | Available | Available | Unlimited |
| Number of CPUs Subject to Resource Limitation as to CPU Resident Time | RLIMIT_NCPURESTM | Available | Available | Limited only by number of CPUs in the system |
The resource parameters specify when soft or hard processing limits are reached. Tables 3-2 and 3-3 list per-process and per-job resource limit parameters and the soft and hard processing limit result.
| Resource | Soft Limit Reached | Hard Limit Reached |
|---|---|---|
| RLIMIT_CPU | Sends a process the SIGXCPU signal. | When a process receives the SIGXCPU signal, sets the signal catching function to SIG_DFL. |
| RLIMIT_UMEM | If the exceeding memory is the stack region, sends a process the SIGXMEM signal. Otherwise, returns an error. | If the exceeding memory is the stack region and a process receives the SIGXMEM signal, sets the signal catching function to SIG_DFL. Otherwise, returns an error. |
| RLIMIT_DATA | Returns an error. | ![]() |
| RLIMIT_STACK | Sends a process the SIGXMEM signal. | When a process receives the SIGXMEM signal, sets the signal catching function to SIG_DFL. |
| RLIMIT_CORE | Returns an error. | ![]() |
| RLMIT_FSIZE | Sends a process the SIGXFSZ signal. | Sends a process the SIGXFSZ signal and returns an error. |
| RLMIT_FSPACE | Sends a process the SIGXFSPACE signal. | Sends a process the SIGXFSPACE signal and returns an error. |
| RLMIT_NOFILE | Return an error. | ![]() |
| RLMIT_NTASK | Return an error. | ![]() |
| RLIMIT_CPURESTM | Sends a process the SIGXCPU signal. | When a process receives the SIGXCPU signal, sets the signal catching function to SIG_DFL. |
| Home |
|---|
| Resource | Soft Limit Reached | Hard Limit Reached |
|---|---|---|
| RLIMIT_CPU | Sends the SIGXCPU signal to all processes in a job. | When all processes in a job receive the SIGXCPU signal, sets the signal catching function to SIG_DFL. |
| RLIMIT_UMEM | If the exceeding memory is the stack region, sends the SIGXMEM signal to all processes in a job. Otherwise, returns an error. | If the exceeding memory is the stack region and all processes in a job receive the SIGXMEM, sets the signal catching function to SIG_DFL. Otherwise, returns an error. |
| RLIMIT_MTDEV | Returns an error. | ![]() |
| RLIMIT_PROC | Returns an error. | ![]() |
| RLIMIT_XMUSZ | Sends the SIGXXMU signal to all processes in a job. | Sends the SIGXXMU signal to all processes in a job and returns an error. |
| RLIMIT_RLGn (n = 0...3) | Sends the SIGRLGn signal to all processes in a job. | Sends the SIGRLGn signal to all processes in a job and returns an error. |
| RLMIT_FSPACE | Sends the SIGXFSPACE signal to all processes in a job. | Sends the SIGXFSPACE signal to all processes in a job and returns an error. |
| RLMIT_TMPF | Sends the SIGXTMPF signal to all processes in a job. | Sends the SIGXTMPF signal to all processes in a job and returns an error. |
| RLMIT_NOFILES | Return an error. | ![]() |
| RLIMIT_CPURESTM | Sends a process the SIGXCPU signal. | When a process receives the SIGXCPU signal, sets the signal catching function to SIG_DFL. |
The resource usage conditions can be observed from the user site through the system call getresource(2). Using this system call, you can acquire the resource usage information of the process and the job.
The types of resources are as follows:
- Overall memory size (combination of text, data, stack, shared memory,
shared library)
- Size of the data region
- Size of the stack region
- On-memory (core image) size
- System CPU consumption time
- User CPU consumption time
- CPU consumption time (summation of system and user CPU
consumption times)
- Number of CPUs subject to resource limitation as to the CPU resident time
- CPU resident time
CURR_CPURESTM is not included in CURR_ALL.
You can obtain the accumulated CPU usage time of the process in the job, including the terminated processes, by specifying CURR_ACMCPU resource type. Specifying CURR_CPU gives the total CPU usage time of only currently active processes in the job.
| Home |
|---|
- Total file capacity
- XMU capacity
- File System RLG
- Temporary file capacity
Table 3-4 shows the type of resource and the symbol to specify for the getresource(2) or getresourcej(2) system calls.
| Resource | Process | Job | |
|---|---|---|---|
| All Resources | CURR_ALL | Available | Available |
| CPU Time | CURR_CPU | Available | Available |
| Total Memory Size | CURR_UMEM | Available | Available |
| Data Segment Size | CURR_DATA | Available | Available |
| Stack Segment Size | CURR_STACK | Available | Available |
| Memory Size on Memory | CURR_ONMEM | Available | Available |
| Number of MT Devices | CURR_DEV | Not supported | Available |
| Accumulated CPU Time | CURR_ACMCPU | Not supported | Available |
| Number of Processes | CURR_PROC | Not supported | Available |
| File System RLG | CURR_RLG | Not supported | Available |
| Total File Capacity | CURR_FSPACE | Available | Available |
| Temporary File Capacity | CURR_TMPF | Not supported | Available |
| Number of Open Files | CURR_NOFILES | Available | Available |
| Number of Tasks | CURR_NTASK | Available | Not supported |
| CPU Resident Time | CURR_CPURESTM | Available | Available |
NOTICE:
| Home |
|---|
The Resource Block Facility allows the system administrator to manage memory and CPU resources by dividing an entire single node system into multiple blocks. With this facility, groups of processes, each having different attributes such as batch jobs and interactive commands, can use memory and CPU resources separately with minimal mutual interference. The facility also make it possible to support the scheduling of high priority, or "urgent" jobs.
Terms:
- Small pages (mainly for system activities, e.g. kernel,
daemons and commands)
- Large pages (mainly for supercomputing applications)
- CPUs
For each resource, up to 32 RBs (RB0 _ RB31) can be created in the system.
Relationships between RSGs, RBs and processes:
An RSG number of a process is inherited from the parent process at a fork(2) system call. An RSG number of a batch job is set according to the NQS queue attribute. An interactive command may be executed in association with a specific RSG by using the chrsg(1) command.
Just after the system boots, all resources are allocated to RB0, the default RB of each resource type, and all processes belong to RSG0, the default RSG.
| Home |
|---|
Memory RB facility:
If unused pages remain in memory (not in use by other RBs), a processes can use pages beyond the allocated amount for the RB, up to the maximum allowed by the RB definition. But when the processes in other RBs reclaim the borrowed memory, pages of the process are swapped out and the memory is returned immediately. After that, swapping occurs within the RB so that the memory load does not influence (interfere with) the other RBs.
CPU RB facility:
In a fashion similar to memory the management, as long as idle CPUs exist, processes can use CPUs beyond the allocated number of the RB. But when processes in other RBs reclaim those CPU resources, the CPUs are interrupted and returned immediately.
Create an RSG/RB configuration file to set up RSGs and RBs. In this file, describe the amount of memory and CPU resources to be divided and describe also the relations between RSGs and RBs.
First, use the following command to make a configuration file template in accordance with the current system configuration:
The current memory size, swap area size, and number of CPUs are output as comment lines, and serve as a guide for modifications. The command outputs a valid configuration file and therefore is useful as a template for modification.
Describe the size of memory resources for each RB. Indicate each size in pages. One large page (LP) is 1MB or 4MB, and one small page (SP) is 32KB.
Specify the following parameters for each RB.
| Home |
|---|
The number of fixed (non-swappable) pages, such as shared text pages and shared memory pages, is also limited by Imem. When Imem is too small, system calls which make fixed pages, e.g. exec(2) or shmat(2), may fail due to this limitation.
In addition, the system uses several large pages. These pages are allocated from RB0 and are fixed in memory. A shortage in RB0's large pages (Imem) may result in various failures of system services. Thus RB0's Imem for large pages should be set such that the number of fixed pages never exceeds it.
In a fashion similar to configuring the memory resources, describe the CPU allocations for each RB.
Specify the following items separately for each RB.
Except for RB0, Icpu can be set to 0. Processes belonging to an RB which for Icpu is 0 can obtain a CPU only when another RB has an idle CPU which can be borrowed.
For all RSGs, specify the RBs to be used. For each type of resource (SP, LP and CPU) one RB must be specified. An RB can be shared by multiple RSGs.
Use the following command to check whether the created RSG/RB configuration file is correct:
Note that this command can detect a syntax error, but cannot predict a configuration failure due to dynamic memory or CPU loads.
To set RSGs and RBs, execute the following command:
| Home |
|---|
This operation fail under the following conditions:
If /etc/rsg.conf exists, RSGs and RBs are set automatically according to this configuration file when the system enters multi-user mode. This procedure is carried out by the script /etc/rc2.d/S37rb.
Note that RSGs/RBs must be set up after the definition of the swap file, and before NQS starts.
The RSG to which a job belongs can be specified as an NQS parameter.
With the following qmgr(1M) subcommand, an NQS queue may be associated with a particular RSG:
Once a job starts running, the RSG number of the job cannot be changed.
By default, all interactive processes belong to RSG0, but the chrsg(1) command can execute an interactive command in a specified RSG.
When a shell is executed by the chrsg(1) command as in the following example, all programs executed from the new shell prompt will run in the specified RSG.
The chrsg(1) command must have write permission for the given RSG special file. Therefore, users of a given RSG can be limited by the mode and Access Control List of the RSG special files.
By specifying the special file indicating a desired RSG as an argument to the rsginfo(1M) command, the size of memory and the number of CPUs allocated to the RSG, as well as its current memory and CPU usage are output. The rsginfo(1M) command can display detailed statuses of RBs with options.
For details of the rsginfo(1M) command, refer to the SUPER-UX System Administrator's Reference Manual.
| Home |
|---|
The attribute of RB can be specified by describing either of the following two character strings in the Attr field of the RSG/RB configurationn file.
When a CPU failure occurs, the RB not having a fixed attribute first becomes a candidate for adjustment and the one having a fixed attribute does not become a candidate until the RB not having a fixed attribute cannot be adjusted completely.
When a process is restart(1)ed, the RSG of the process is set to the RSG to which the restart(1) command belongs. For a batch job, the RSG of a restarted job is set to the RSG defined with the NQS parameter at the restart.
Memory and CPU information which the rsgconf(1M) command outputs reflects the current system information. The values of the memory information may change as the kernel is reconfigured, the swap file setting changed, or the partitioning of small and large pages is dynamically changed.
Therefore, the rsgconf(1M) command should be executed under the same environment as planned for actual operation. Should the configuration file not correspond to the actual memory-size and the number of CPUs, rsgconf(1M) will try to adjust the resource quantity of the RB which does not have a fixed attribute. Adjustment is done preferably from the RB having a lower number.
When the number of CPUs decreased due to a CPU failure, the setting of the RB becomes inconsistent with the actual number of CPUs. In this case, the system will try to adjust the resource quantity. If the adjustment does not complete, the RB with a fixed attribute also becomes a candidate for adjustment. In either case, the RB having a lower number is preferably adjusted. The Max and Min values of each RB can be decreased if necessary, according to the actual number of CPUs. If the failed CPU is reconnected, the setting of RBs returns to RBs which has been set before the CPU failure.
While trying to reconnect, the rsgconf(1M) command can not change the setting of RSG/RB. Please retry a few minutes later.
When the memory RBs are used, it is possible that a given process must borrow one or more memory pages from other RBs in order to be swapped in. The execution of such a process might be interrupted for an arbitrarily long time (until pages become available for borrowing).
Be careful that such a process does not hold a system-wide resource (e.g. lock-file). This situation might arise, for example, when urgent jobs get started using almost all memory resources, swapping out other processes.
Aborting with SIGKILL and checkpointing are the only valid operations for such interrupted process. When a process flag, shown by ps(1) -l, shows bit 0400, the process has been interrupted for a long time.
Even when the relationships between RSGs and RBs are changed during system operation, relationships between running processes and RBs will not be changed. In other words, a process belonging to the redefined RSG continues to use former RBs.
When a process fork(2)s, the RSG number of the created process or the child process is inherited from the parent process, but RBs are determined based on the current definition of the RSG.
The situation is similar for exec(2), RBs of the exec(2)ed process are redefined according to the current definition of the RSG.
When setting Icpu value to 0 in the CPU resource block, the Max value can also be set to 0. However, such setting makes no sense because the process will not be dispatched permanently. Only aborting with SIGKILL and checkpointing are done exceptionally.
An example for setting RSGs and RBs is presented here. A system with 4GB main memory, 10GB of swap files, and 16 processors is assumed.
In this example we will show two RSGs; one for interactive and one for batch jobs. 256MB of main memory and 2GB of swap space are allocated to small pages. For batch jobs, 2.5GB of main memory and 6GB of swap space are used for large pages, while the remaining large pages are allocated to interactive processes. Batch jobs can borrow memory pages from interactive processes, but not vice-versa.
The small pages are shared between batch jobs and interactive processes.
Six of the sixteen CPUs are allocated for batch jobs and the remaining CPUs are allocated to interactive processes. Batch jobs and interactive processes are completely separated. Neither RSG can borrow CPUs from the other.
In Attr field, the attribute of the memory RB and CPU RB for the batch job is set to a fixed attribute so as not to be affected by the change of the system resource.
First, create an RSG/RB configuration template file by using the rsgconf(1M) command as follows:
| Home |
|---|
The contents of tmpfile are as follows:
#
# RSG/RB Configuration File
#
# Available small pages (unit:32KB)
# SP Imem:8236 Iswap:65536
# Available large pages (unit:1MB)
# LP Imem:3790 Iswap:8192
# Available CPUs
# CPU Icpu:16
# Attribute: FIX
# Resource Block configuration.
SP:RB0 Imem:8236 Iswap:65536 Min:0 Max:8236 Attr:NONE
LP:RB0 Imem:3790 Iswap:8192 Min:0 Max:3790 Attr:NONE
CPU:RB0 Icpu:16 Gcpu:0 Min:0 Max:16 Attr:NONE
# Resource Sharing Group configuration.
RSG0 SP:RB0 LP:RB0 CPU:RB0
Note that the actual values vary depending on the configuration of the kernel, so you may not see the same results in your own case.
Next, adjust the file contents as required.
For this example, set them as follows:
#
# RSG/RB Configuration File
#
# Available small pages (unit:32KB)
# SP Imem:8236 Iswap:65536
# Available large pages (unit:1MB)
# LP Imem:3790 Iswap:8192
# Available CPUs
# CPU Icpu:16
# Resource Block configuration.
SP:RB0 Imem:8236 Iswap:65536 Min:0 Max:8236 Attr:NONE
LP:RB0 Imem:1230 Iswap:2048 Min:0 Max:1230 Attr:NONE
LP:RB1 Imem:2560 Iswap:6144 Min:2560 Max:3790 Attr:FIX
CPU:RB0 Icpu:10 Gcpu:0 Min:10 Max:10 Attr:NONE
CPU:RB1 Icpu:6 Gcpu:0 Min:6 Max:6 Attr:FIX
# Resource Sharing Group configuration.
RSG0 SP:RB0 LP:RB0 CPU:RB0
RSG1 SP:RB0 LP:RB1 CPU:RB1
| Home |
|---|
Check the file for integrity as follows:
If tmpfile has no error, set up the RSGs and RBs as follows:
Rename tmpfile to /etc/rsg.conf to set up RSGs and RBs automatically at system startup.
Next, set NQS queue parameters. For all queues, change the RSG setting to /dev/rsg/1 as follows:
Now all batch jobs will belong to RSG1.
The usage of the rsgconf(1M) command, the output format of the rsginfo(1M) command and the format of the RSG/RB configuration file have been changed as of R6.2. When RSG/RB configuration files of versions prior to R6.2 are used, the rsgconf(1M) command fails with a syntax error.
SUPER-UX has two kinds of Multitask Resource Management facilities; the Number of physical task limit which restricts the number of physical tasks and the Number of concurrent processors control which controls the number of processors which can be used simultaneously.
Number of physical task limit
This facility restricts the number of physical tasks a process can have. If the process attempts to make physical tasks exceeding this limit, the task scheduler cannot create any more physical tasks.
(Note that the task scheduler does not create any more physical tasks if the current number of physical tasks exceeds the value set with the PSTUNE(maxcpu) intrinsic subroutine in FORTRAN. The default value for PSTUNE(maxcpu) is 32.)
Set the Limit Number of physical tasks to each queue by the following qmgr(1M) function:
This operation sets the value as the limit value of the specified queue. The NQS gives this limit value to the job in this queue at job submission.
The default Limit Number of physical tasks for batch jobs is defined by the NTASKPP (=maximum number of tasks per process) system parameter. (The default NTASKPP value is 128.) When the Limit Number of physical tasks is set to the default value, the Number of physical task limit function is not effective because the limit is determined only by NTASKPP.
The Limit Number of physical tasks is extracted and set with the system calls getrlimit(2) and setrlimit(2), respectively in a process. The default value is defined by the NTASKPP (=maximum number of tasks per process) system parameter.
Number of concurrent processors control
This facility controls the number of processors which can be used simultaneously by process scheduling. If a task attempts to get processors exceeding the specified number of the processors, the task is not dispatched before another task releases a processor.
Set a Control Number of concurrent processors to each queue by the following qmgr(1M) function:
This operation sets the value as the Control Number of concurrent processors of the specified queue. The NQS gives this control value to the job in this queue at job submission.
The default Control Number of concurrent processors is defined by the DSGCONCPU (=number of concurrent processors for DSG processes) system parameter. (The default DSGCONCPU value is the number of actual processors.) When the Control Number of concurrent processors is set to the default value, the Number of concurrent processors control function is not effective because the control is determined only by DSGCONCPU.
Set a Control Number of concurrent processors to the system constant DSGCONCPU with the config(1M) command when building a kernel. The default DSGCONCPU value is the number of actual processors.
The DSGCONCPU system parameter and Control Number of concurrent processors in a process can be changed with the dispcntl(2) system call or dcntl(1) command during system operation. However, only the superuser can increase this value.
Additionally, the dispcntl(2) system call and dcntl(1) command can be used for the purpose of changing the Control Number of concurrent processors of a running job. However, only the superuser can increase this value. The NQS manager can also change the Control Number of concurrent processors of a running request by the following qmgr(1M) function:
This operation changes the Control Number of concurrent processors of every process for the request specified as the requestid to the value.
|
From R7.1, in the programs (such as MPI, HPF, Pthread, etc.) explicitly
linked with the POSIX threads library specifying The existence of the signal task is actually found from the concurrent CPU time in account information or program information (PROGINF), although the user does not need to be conscious. The signal task runs on a processor only when signal handling in POSIX thread functionality such as pthread_cond_timedwait(3T). Note that both the Number of physical task limit and the Number of concurrent processors control functions take the signal task into account together with other user tasks. This is because the kernel is not aware of scheduling of the threads library and because even the signal task consumes user CPU time. Therefore, for the Number of physical task limit if a limit value is set in consideration of only user tasks, the number of created physical tasks in consequence may be less than expected. Similarly for the Number of concurrent processors control, one user task may not be given a processor while the signal task is running on a processor. On the other hand for both Resource Management Functions if a value is set to the number of user tasks "plus one", multitasked programs using only macrotasks/microtasks may not be controlled to the expected task number in effect. The system administrator should first determine whether the signal task (additional physical task) residing in multitasked programs is taken into account in the Resource Management functionality. The PSTUNE parameters in FORTRAN are totally separated from the signal task, so the PSTUNE programming does not depend on the existence of the signal task. |
| Home |
|---|
This system enables scheduling algorithms suitable for batch jobs executed by the NQS to coexist with scheduling algorithms suitable for interactive processing supported by the original UNIX®.
This section first explains the concept of a scheduling group that allows scheduling algorithms to be assigned to each process. Next, it explains the processing domain that is introduced in order to coexist batch processing with interactive processing and the use of a scheduling group in each processing domain. Last, this section explains how to balance the scheduling of these two processing domains.
The CPU scheduling of processes in this system has the following features:
The processes created at system set-up belong to Default Scheduling Group (DSG). They have default scheduling parameters and do not belong to any special scheduling group. In general, swapper, init, daemons, or most interactive processes belong to DSG.
The explanation of scheduling algorithms assigned to each scheduling group and assignment methods of these scheduling groups are explained in the following sections.
This system schedules CPU processes using the CPU priority formula that includes variable parameters.
where the constant is 40.
These values prevent processes running in user mode from using more CPU time than expected.
- Once the job is running in the CPU, (1) removes it from the priority list. The job continues running but does not regain its priority level.
- (3) reduces the job priority by the amount of CPU time required when the job is timesliced by task.
The scheduling parameters that can be changed by a system call are nice value, base priority, modification value, tick quantum, decay interval, and decay factor. The nice value can be changed with the system call nice(2); other parameters can be changed with the system call dispcntl(2). These formulas change the execution priorities depending on the CPU usage status. The timing of the process switch causes the system to select the process with the smallest execution priority among processes ready to run and waiting for the CPU. The smaller this value, the higher the execution priority becomes. When a process is being executed on the CPU, the tick quantum of the CPU counter is incremented every tick to reduce the execution priority. Since the shift operation is performed for the CPU counter with a decay factor every decay interval, the longer the waiting time for execution, the higher the execution priority becomes.
Variable DSG scheduling parameters are explained below:
| Home |
|---|
The graph shown in the following figure shows priority transition process. The preceding scheduling parameters change the shape of the graph.
![]() |
Figure 3-1 Priority Transition Process
| Home |
|---|
The effects of these parameters are as follows;
These parameters are adjusted so that scheduling groups having scheduling algorithms suitable for the characteristics of an appropriate process can be created.
A group of processes having the same algorithms and sharing a set of parameters included in the priority formula of the CPU scheduling algorithm is called a scheduling group.
The following processes and sets of processes can constitute a scheduling group. The identifier of a scheduling group must be also specified when it is created.
An init and various daemons activated at system start-up have scheduling algorithms specified by system default scheduling parameters. These processes belong to the default scheduling group. The system call dispcntl(2) or the command dcntl(1) is used to create a new scheduling group from the default scheduling group. Refer to the SUPER-UX Programmer's Reference Manual for the use of dispcntl(2) and refer to the SUPER-UX User's Reference Manual for the use of dcntl(1). The system should be designed in consideration of the operation of the system site. Use scheduling groups to classify NQS batch queues or to discriminate between special commands or special users and other processes belonging to the default scheduling group.
| Home |
|---|
The constant parameters of the default scheduling group are shown in Table 3-5 and can be specified with the config(1M) command or dynamically changed with the dispcntl(2) system call during the system operation.
| Parameter | Constant Name | Default Value |
|---|---|---|
| base priority | DSGBASEPRI | 0 |
| modification value of CPU counter | DSGMODCPU | 2 |
| tick quantum | DSGTICKCNT | 1 |
| decay factor | DSGDCYFCTR | 1 |
| decay interval | DSGDCYINTVL | 1 (sec) |
| memory scheduling priority | DSGMEMPRI | 20 |
| timeslice | DSGTMSLICE | 200 (tick) |
| aging range | DSGAGRANGE | 160 |
3.3.1.4 NQS DEFAULT SCHEDULING
The following constant parameters of NQS default scheduling can be specified with the set subcommands of qmgr(1M) for batch queues or can be dynamically changed with the modify subcommands of qmgr(1M) for running requests.
Table 3-6 shows the set command of qmgr(1M). The modify request command name can be used to replace, set, or modify requests. For convenience, the NQS base priority is defined as the kernel base priority of + 60. Similarly, memory scheduling priority is defined as the kernel memory scheduling priority _ 20.
| Parameter | Command Name | Default Value |
|---|---|---|
| base priority | set base_priority | 80 (kernel value 20) |
| modification value of CPU counter | set modcpu | 2 |
| tick quantum | set tickcnt | 0 |
| decay factor | set dcyfctr | 1 |
| decay interval | set dcyintvl | 1 (sec) |
| memory scheduling priority | set memory_priority | 0 (kernel value 20) |
| timeslice | set timeslice | 1000 (msec) |
| aging range | set aging_range | 160 |
For example, you can obtain the same scheduling algorithm as the default scheduling group with a base priority of 60 and tick quantum of 1.
| Home |
|---|
The system may have scheduling algorithms that are advantageous to batch processing. If the system daemons follow the default scheduling group, the daemons may hold resources and wait for execution. System trouble may occur because of the timing of the batch processing.
To avoid this, set these daemons to advantageous scheduling parameters. Because some daemons (inetd, init, etc.) have user processes that must belong to the default scheduling group as their children, their advantageous scheduling parameters must not be inherited by these user processes.
The inheritance times of the scheduling parameters dispcntl(2) or dcntl(1) can set a process that determines how often these parameters can be inherited. Using the inheritance times control function, you can ensure that the daemons only follow the advantageous scheduling algorithm and user processes that are generated as their offspring follow the default scheduling group algorithm.
The meaning of the inheritance times of the scheduling parameters are as follows.
The n generations (including the process that sets the scheduling parameters to itself) follow the scheduling parameters. The 0 parameter is the special case, which means that there is no inheritance limit and the parameters are inherited infinite times. The value of the default scheduling group is 0.
In addition to the scheduling parameter mentioned, these parameters can be set with dispcntl(2) or dcntl(1).
The value of inheritance times of scheduling parameters is decremented and passed to the child at the time of fork. If it becomes 0, the child process becomes the default scheduling group and the parameters are set to it. For example, set 2 to inetd's inheritance times and set an advantageous parameter to it.
| inetd | telnetd | sh | |
| inheritance times | 2 | 1 | 0 |
| parameters | [parameters] | [parameters] | [DSGparameters] |
In the pattern of inheritance shown above, inetd and telnetd follow the advantageous scheduling algorithm and can be run quickly. sh follows the scheduling algorithm of the default scheduling group.
A processing domain distinguishes between processes executed by a terminal and batch jobs executed by the NQS. This system has the two following processing domains:
Processes executed by a terminal and various daemon groups belong to this domain. This processing domain is used in interactive processing.
Batch jobs executed by the NQS belong to this domain. This processing domain is used in batch processing.
The system executes the process ready to run with the highest execution priority in either of these two domains impartially. A domain can include scheduling groups having a variety of scheduling algorithms.
Therefore, tuning facilities for balancing scheduling are provided to distribute the CPU consumption time into two independent domains.
| Home |
|---|
In IPD scheduling, normal processes belong to the default scheduling group having the same algorithms as the original UNIX system. First, the algorithms of the default scheduling group are explained, then the IPD memory scheduling is explained. Last, the creation of a scheduling group in IPD is explained.
Generally, IPD processes belong to the default scheduling group (DSG). The "scheduling algorithms of the DSG" evaluates the execution priority value calculated from the CPU counter by the process and waiting time for execution to select the process to be executed next.
The execution priority is evaluated by the following formula:
where the constant is 40. The base priority is not added because DSGBASEPRI (=0) is in the default scheduling group. The default modification value of DSG is specified with DSGMODCPU (=2). In the formula, the smaller the execution priority value, the higher the execution priority becomes for the process.
During the execution of the IPD process on CPU, this value is incremented by tick quantum. With each tick, the default DSG tickcnt is specified with DSG TICKCNT (=1). In addition, it is adjusted by the following decay function each decay interval (unit is second):
The default DSG decay interval is specified with DSGDCYINTVL (=1 sec). The default DSG decay factor of DSG is specified with DSGDCYFCTR (=1).
The CPU counter is incremented until the upper-limit value of aging range. The default DSG aging range is specified with DSGAGRANGE (=80% of HZ that is 160).
This value can change the base value for IPD process's execution priority shift. With this value, process CPU allocation can be enhanced or reduced relatively.
The nice value must be an integer in the range 0 through 39. It is changed with the system call nice(2) by giving the deviation value from the present nice value. Only a super-user can decrease the nice value.
| Home |
|---|
While a process is loaded on the memory, it is ready to run. The memory swap-out and swap-in mechanism is performed by the memory scheduler by evaluating the memory scheduling priority.
A process with a smaller memory scheduling priority value becomes resident on the memory more often. This value must be an integer in the range 0 through 39. As for the IPD process, the same value as its nice value should be used. The system call setmempri(2) is used to change this value by specifying the deviation value; dispcntl(2) is used by specifying the absolute value. The default memory scheduling priority is specified with DSGMEMPRI(=20).
For example, this priority can be manipulated to give a higher memory scheduling priority to a process with a smaller memory image in order to execute it prior to other processes having a larger memory image.
The scheduling algorithms of the default scheduling group give an impartial opportunity for CPU allocation to all processes. Since IPD processes are executed by a terminal, these processes belong to the default scheduling group. Therefore, a scheduling group should be created to discriminate the CPU scheduling for a process or a set of processes from the CPU scheduling for others.
A job is entered from the NQS to BPD. The NQS has a scheduling parameter for each queue. An individual job to be executed is a scheduling group having that parameter.
First, CPU scheduling is explained, then other scheduling facilities are explained.
BPD processes organize a job in the unit of an NQS batch request. They have the same execution priority parameter and constitute one scheduling group. The BPD job execution priority is determined by the following formula:
A process with a small execution priority value has a high execution priority. In addition, there are other scheduling parameters, such as tick quantum, decay interval, and modification value, which are related to the time change of the execution priority.
The NQS specifies these parameters in the NQS batch queue with the qmgr(1M) command. The NQS administrator should define scheduling parameters in the NQS batch queue depending on the characteristics of the job to be executed in the queue.
These values are set in the NQS batch queue, and each value of the batch queue is specified for the job to be executed. The batch queue can be characterized by these parameters. Scheduling parameters for the batch queue should be determined in consideration of system scheduling balance.
| Home |
|---|
To set a request, the parameters set in the NQS batch queue are changed with the following qmgr(1M) subcommands.
Only the NQS administrator is permitted to perform these operations.
The user can determine the execution priority for a request by adding this value to the base priority that is an attribute of the queue.
Generally, the nice execution value must be an integer in the range -20 through 19. The NQS administrator can extend the minimum value beyond -20 with the qmgr(1M) subcommand set nice_limit. In consideration of system scheduling balance, restrict the priority range selected by the user.
A nice execution value queued in the NQS batch queue can be changed with the user command qalter(1) or qmgr(1M) subcommand modify request nice_value.
The operation of the NQS can set the timeslice value in units of jobs that use the characteristic of scheduling groups in units of jobs.
Each NQS batch queue has its timeslice value. A job executed through this queue is given this value. In consideration of the configuration of the batch queue made by the base priorities, the timeslice value of each queue should be determined. It is recommended that queues with higher base priority have short timeslice values.
The NQS administrator sets this value to the NQS batch queue with the qmgr(1M) subcommand set timeslice. It is changed for a request with the qmgr(1M) subcommand modify request timeslice.
| Home |
|---|
BPD memory scheduling is also controlled by the memory scheduling priority. The values of the BPD job also determine the priorities with which processes are loaded on the memory.
The NQS administrator sets these values to the NQS batch queue with the qmgr(1M) subcommand set memory_priority. A job executed through this queue is given this value. In consideration of the configuration of the batch queues made by the base priorities, the memory scheduling priority of each queue should be determined. It is recommended that queues with higher base priority have lower memory scheduling priorities. The NQS administrator can change the memory scheduling priority of a request with the qmgr(1M) subcommand modify request memory_priority after the request is queued.
The NQS queue run-limit specifies the limit for existing jobs executed through a batch queue. The default value is 1 for each queue. The NQS administrator can specify the limit with a parameter of the qmgr(1M) subcommand create batch_queue when the queue is created, and change it with the subcommand set run_limit. The limit specified to batch queues should take into consideration the configuration of queues with base priority.
The presence of scheduling groups enables IPD and BPD to have individual scheduling algorithms. However, at the time of process switching, the system selects a process having the highest execution priority in spite of its processing domain.
To balance the process dispatching of the system such as for CPU time assignment, scheduling tuning facilities are supported.
The system adds IPD PDD priority and BPD PDD priority processing at the time of process switching. This includes evaluating execution priorities of all ready-to-run processes. In this mechanism, PDD priorities can change the value of the execution priorities, and consequently enhance or reduce the CPU priority allocation.
During system operation, these values can be changed with the system call setdispval(2). These PDD priority values must be an integer greater than 0.
The initial system start-up values of IPD PDD priority are set to INTPDDPRI (see Table 2-1). The initial values at system start-up of BPD PDD priority are set to BATPDDPRI (see Table 2-4). The default values of INTPDDPRI and BATPDDPRI are 0. These values can be changed with the command config(1M) in consideration of system operation.
The system records the time in which IPD and BPD processes run in the user mode. These time values can be obtained with the system call getdispinfo(2). The system administrator can dynamically tune the balance of system scheduling. For example, the dynamic use of facilities, such as domain dispatching priority or swap base priority, realizes the control of the domain CPU time assignment ratio.
To reduce the waiting time at task synchronization points, SUPER-UX provides multitask family scheduling facilities for microtasking groups. This section explains the concept of family scheduling, the relation between tasks and microtasking groups, and the tuning method of family scheduling.
| Home |
|---|
In multitasking, spin-waiting is mainly used for task synchronization. Accordingly, if a task is not given to the CPU, all the other tasks waiting for the task waste the CPU time. Therefore, it is desirable that all tasks in a multitasked program run as simultaneously as possible. Family scheduling is one mechanism for this purpose.
Family scheduling is the following execution priority control mechanism for microtasking groups:
This mechanism is valid only under the following condition: the execution priority of the first dispatched task in a microtasking group at the moment is greater than or equal to 40. (This range implies a user priority level.) Namely, the family scheduling excludes a case in which the first dispatched task is waiting for completion of I/O or other resources.
Furthermore, the family scheduling priority is given only to the tasks ready to run such as those suspended by timeslice.
If family scheduling priorities are fixed to a single high value of 39 for all multitasked programs regardless of NQS queue priority schemes, multitasked programs always defeat high priority jobs in the high priority queues. The multitasked programs result in upsetting the queue priority schemes.
For the purpose of avoiding this situation, a slave priority has been introduced to a process as a tunable parameter that enables family scheduling priorities to change. A family scheduling priority value is calculated by subtracting the slave priority value of the process from the execution priority of the first dispatched task in a microtasking group at the moment. A slave priority is valid with an integer 0 or greater.
The slave priority of 0 does not mean the absence of family scheduling control. This is because task execution priorities in a multitasked program generally vary from task to task at a given moment. The slave priority of 0 brings about the weakest family scheduling effect.
Family scheduling priorities are not allowed to be a value less than 39 (that is, higher priority than 39). Therefore, a slave priority greater than a certain value always results in the family scheduling priority of 39. In such a case, the slave priority does not make sense as a parameter to differentiate processes. See Section 3.3.6.4 for more about the range of meaningful slave priorities.
Family scheduling priorities are determined by the following algorithm:
where:
| FSpri | : | family scheduling priority |
| FDpri | : | priority of the first dispatched task in a microtasking group at the moment |
| SLpri | : | slave priority of the process
max{a, b} = a (if a >= b), = b (if a < b) |
| Home |
|---|
For example, suppose that a process has the slave priority of 10, and that the process has four tasks ready to run in a microtasking group. Let the four tasks be T1, T2, T3, and T4 with the execution priority T1pri=60, T2pri=65, T3pri=70 and T4pri=75 respectively. At this time, once T1 is dispatched, the family scheduling mechanism sets the family scheduling priority of 50 to T2, T3 and T4, because FDpri=T1pri=60, SLpri=10 and then FSpri=50. Consequently, T1 immediately starts running, and T2, T3 and T4 wait for CPUs without aging until the next dispatch opportunity.
![]() |
Figure 3-2 Task Execution Priority Transition by Family Scheduling
If the slave priority of this process is 0, it follows that T2pri=60, T3pri=60 and T4pri=60. Thus, even the slave priority of 0 raises the execution priority of T2, T3 and T4.
Family scheduling is a mechanism for microtasking groups. There are two cases of relations between tasks and microtasking groups.
When a root task (root thread) starts at the beginning of the program, a microtasking group is generated and the root task belongs to the group. Additionally, all macrotasks (threads) created later and all microtasks reserved by the root task also belong to the group.
For microtasks reserved by the macrotask (thread) other than the root task, a separate microtasking group is generated at the every reservation of the microtasks. In this case, a master microtask and the created-slave microtasks together belong to the microtasking group generated at their reservation.
| Home |
|---|
A separate microtasking group is generated at the every reservation of microtasks. A master microtask and the created-slave microtasks together belong to the microtasking group generated at their reservation.
Macrotasks (threads) do not belong to any microtasking group.
All microtasks are connected with the family scheduling. However, macrotasks (threads) are not connected with the family scheduling in the second case (when using ANALYZER).
Slave priorities can be set to a set of processes in the interactive processing domain (IPD) and also to each batch queue.
If a large slave priority value is set, the tasks having family scheduling priorities likely get CPUs as soon as possible. Then, the waiting time at task synchronization points can be reduced. Consequently, the program can run efficiently with a shorter spin-waiting time. However, the multitasked program interferes with other running programs during family scheduling control.
If a small slave priority value is set, the waiting time at task synchronization points may become larger due to weak family scheduling.
For batch jobs, if a slave priority according to the queue base priority is set, multitasked jobs in the queue can run with the family scheduling priority based on the queue priority scheme.
For example, suppose that the possible range of priorities for queue Q1 is 60 to 70 and the range for queue Q2 is 80 to 90. If you now set a slave priority of 10 or less to Q2, multitasked jobs in Q2 will never interfere with the jobs in Q1.
![]() |
Figure 3-3 Family Scheduling for Batch Jobs
| Home |
|---|
Set a slave priority to each queue by the following qmgr(1M) functions:
This operation sets the priority value as the slave priority of the specified queue. The NQS gives this slave priority to the job in this queue at the job submission. The default slave priority value for batch jobs is 0, which means the weakest family scheduling.
Set a slave priority to system constant SLAVEPRI with the config(1M) command when building a kernel. The default SLAVEPRI value is 120, which assures a family scheduling priority is always fixed to a single high value of 39 regardless of the 'nice' value under the following condition: the scheduling parameters (base priority, modification value of CPU counter, aging range, tick quantum, and processing domain dispatching priority) of the scheduling group to which the multitasked process belongs are respectively equal to those of the default scheduling group.
The super-user can change the SLAVEPRI value with the setdispval(2) system call during system operation.
Additionally, the setslavepri(2) system call has been prepared for the purpose of changing the slave priority of a running process or job. However, only the superuser can increase the priority. The NQS manager can also change the slave priority of a running request by the following qmgr(1M) function.
This operation changes the slave priority of every process for the request specified as the requestid to the priority value.
In order to fix the family scheduling priority to 39 regardless of the 'nice' value, slave priorities satisfying the following condition are required to be set individually:
>= ((aging range) + (tick quantum) -1) >> (modification value of CPU counter)
(slave priority)
+ (base priority (kernel value))
+ (processing domain dispatching priority)
+ 40
where the revealing scheduling parameters are those of the scheduling group to which the objective multitasked program belongs.
For the default scheduling group, the default SLAVEPRI value of 120 satisfies the above condition.
The scheduling function that simultaneously assigns CPU to a parallel-processing program such as a microtask and MPI program is called a gang scheduling function. This function enables to execute multiple parallel-processing programs in a system at almost the same efficiency as a single program is executed.
In SUPER-UX, the gang scheduling function is implemented by allowing or prohibiting executing parallel-processing programs every certain interval. This reduces system overhead. However, you should take care to use the gang scheduling function. Use this function after understanding the following descriptions.
The gang scheduling function is a program product and corresponds to the following program product number.
The SUPER-UX gang scheduling function allocates N CPUs for a parallel-processing program that is an object of scheduling so that the program can use CPUs at any time. The allocation is changed by rescheduling every certain interval (scheduling interval).
A parallel-processing program to which CPUs are allocated can use the allocated CPUs at any time. However, the CPU that is not used by the program is used by another process that is not a parallel-processing program. (Note 1)
In SUPER-UX, CPU is reserved for a parallel-processing program. Therefore, it is necessary to define how many CPUs a parallel-processing program uses. Also, the system must not allocate more CPUs than the reserved number for a parallel-processing program during execution.
In the gang scheduling function, a parallel-processing program and the number of CPUs used by a parallel-processing program are defined as follows.
A program that is created as multitasking. When the Control Number of concurrent processors exceeds 1, scheduling is performed for a microtask or macrotask assuming that it is a parallel-processing program. The number of CPUs to be used equals to the Control Number of concurrent processors.
Supplement
In NQS, the SEt CPu Count subcommand of qmgr sets the Control Number of concurrent processors in units of queues. The Control Number of concurrent processors can also be set by qsub assuming that the value set to the queue is the upper limit.
Gang scheduling is performed for the program using MPPG in MPI/SX. Even if MPPG extends over nodes, gang scheduling is performed for the program in multinode by synchronizing nodes.
In MPI/SX, a logical node is created in each node. MPPG consists of multiple logical nodes. The upper limit (hereafter, referred to as the Control Number of concurrent processors of a logical node) is defined for the number of CPUs allocated to the whole process group included in a logical process.
This value is defined as the number of CPUs to be used in each logical node. The number of CPUs to be used in the entire MPI program is the total Control Number of concurrent processors of a logical node of all the logical nodes.
The Control Number of concurrent processors of a logical node equals to the Control Number of concurrent processors (of a process) set by NQS. (Note 2)
In gang scheduling, it is assumed that processes included in one logical node must belong to the same CPU resource block (hereafter, CPU RB). If the processes belonging to multiple CPU RBs are included in a logical node, those processes are not assumed as an object of gang scheduling.
Only the above mentioned parallel-processing programs that are executed on a CPU RB to which a gang scheduling attribute is set are an object of gang scheduling.
The following CPU RB definitions are also applied to the gang scheduling function.
Especially note the following.
You can specify the upper limit of the total CPUs that are used by a parallel-processing program. It is the Gcpu value.
You can also specify each CPU RB whether it is an object of gang scheduling or not. If you set the gang scheduling RB, this RB's Gcpu value must be equal to or larger than 2 and equal to or smaller than Max value. If you does not set, Gcpu value must be 0.
The upper limit of the total of CPUs that are used by parallel-processing programs in a system is the total of Icpu values of CPU RB having the gang scheduling attribute. (Note 3)
Besides the above mentioned conditions, the following conditions must also be satisfied to perform scheduling.
The default parallelizing degree when executing a conventional microtask is the Max value of CPU RB. The following condition must be satisfied so as to execute the microtask efficiently using the default parallelizing degree.
Max value = Gcpu value = Control Number of concurrent processors (Number of CPUs to be used).
If the number of CPUs to be used exceeds the Icpu value of CPU RB, CPUs are not allocated to the microtask primarily. In this case, the following condition must be satisfied in order to increase the RB throughput.
Icpu value = Control Number of concurrent processors (Number of CPUs to be used)
It is recommended that the CPU RB and NQS queue be set as follows.
Max value = Gcpu value = Icpu value = Control Number of concurrent processors (Number of CPUs being used)
If the gang scheduling attribute is set to the default CPU RB, the performance of its interactive response may decrease extremely. Similarly, the performance of the response of daemons including the NQS daemon decreases. So, it would be better not to set the gang scheduling attribute to the default CPU RB.
If forced to set the gang scheduling attribute to the default CPU RB, the Gcpu value and the Control Number of concurrent processors must be less than the Icpu value, and the parallelizing degree of the microtask must be equal to the Control Number of concurrent processors.
CPUs that are not used are scheduled by the CPU scheduler. A parallel-processing program is scheduled by the gang scheduler. So, the CPUs that are not used are used by another process that is not a parallel-processing program.
Strictly speaking, the Control Number of concurrent processors of a logical node equals to the Control Number of concurrent processors of the process that generated the logical node. When using NQS, the Control Number of concurrent processors is set to a job. So, the Control Number of concurrent processors of a logical node is equal to the Control Number of concurrent processors.
If all CPUs are used for gang scheduling, the following Icpu value definition of CPU RB that does not have a gang scheduling attribute cannot be applied while a certain allocated interval.
It is guaranteed that CPUs can be used anytime up to the Icpu value.
Therefore, CPUs that are allocated to CPU RB that does not have the gang scheduling attribute cannot be used for gang scheduling.
| Home |
|---|
This section describes the memory control functions specific to the SUPER-UX system, as well as how to specify the system parameters related to memory control.
The SX series high-speed calculation capabilities can be attributed to its use of a real memory method. The real memory method is notable in that a process (all images existing in a virtual process space) to be executed must be placed in main memory in its entirety. The memory control of the SUPER-UX system is based on a swapping method, which transfers an entire process between main memory and swap area, thus reducing the system overhead. (Segments shared by multiple processes, such as texts and shared memory, however, are not subject to swapping.) If swapping is set up incorrectly, it may adversely affect system performance when the processing load is heavy.
To enable both the execution of a large-scale program and the efficient use of main memory, the SX series can use four page sizes; 32KB, 1MB, 4MB, and 16MB. A 32-KB page is called small-page, and 1-MB/4-MB/16-MB pages are called large-page. Small-pages are used for the U block (of four or nine pages in length), required for all processes, and executing small-scale programs such as commands. Large-pages are used for executing large-scale programs, notably those coded in FORTRAN.
On the SX-4, the size of a large-page is set to 1MB on a system which has a main memory capacity of 8GB or less, and is set to 4MB on a system which has a main memory capacity of more than 8GB.
On the SX-5, 4-MB and 16-MB pages coexist in a system. A program using 8G layout or 32G layout uses 4-MB pages and a program using 512G layout uses 16-MB pages.
The system management such as the resource block configuration treats large-pages in 1MB or 4MB units. Therefore, the system administrator does not need to take 16-MB pages into consideration at system design.
For details about the memory layout, refer to Section 1.2 in the User's Guide.
| A 16-MB page may be called huge-page particularly. However, the SUPER-UX memory management functions treat it as a large page. |
| Home |
|---|
How to set the system constants related to the basic functions of memory control, as well as related notes, is described next.
![]() |
Figure 3-4 SPMEM
The default value is 1, which specifies the mode in which text segments are shared. When a program that uses large pages is executed several times in this mode, the amount of main storage used is reduced due to the effect of the shared text segments. Since text segments are fixed in main memory and not swapped out, however, the maximum size of an executable process is reduced as much as the size of the segments. To make the maximum size of an executable process large, set this parameter to 0.
Since pages (such as those dynamically acquired by the system and those having a segment fixed by plock(2)) are resident in main memory, the maximum size of an executable process cannot be completely assured. Text segments used in a small-page process are shared irrespective of the setting of SHARETEXT.
![]() |
Figure 3-5 Fixed Pages and Maximum Process Size
| Home |
|---|
The SUPER-UX system allows the process size to be limited for all processes by specifying system constants, or for each process or job. Process size limitation can avoid unnecessary swapping and the exclusive use of system resources by a certain process, thus improving the system operation efficiency.
The SUPER-UX system allows the maximum process size to be limited by using the following parameters.
| The default value virtually specifies no upper limit for a small-page process. |
To limit the size of an interactive process, each user is required to specify the limitation in the /etc/userlim file, which is referenced whenever a user performs a login to the system. For details, see userlim(4) and logindefs(4) in the SUPER-UX Programmer's Reference Manual.
The amount of memory used for a batch job can be limited by using the NQS function for each queue, each job, or each process. A limitation specified by using the resource limit function cannot exceed that specified by using the system constants described previously. For details, see the SUPER-UX NQS User's Guide.
The following devices can be assigned to the swap area.
- Extended memory unit (XMU) -- SX-4 only
- Magnetic disk unit connected via HIPPI (RAID)
- Magnetic disk unit under the CHE
- Magnetic disk unit under the IOX
| Home |
|---|
It is also possible to assign the large-page swap area to the array disk while assigning the small-page swap area to the XMU. To assign the small-page swap area to an N7763 magnetic disk, set the block size to 32KB or 16KB. An N7763 magnetic disk unit having a block size of 64KB can be used for large pages only. For an explanation of how to specify the block size of the array disk, see the IBM 9570 Disk Array Subsystem Operator's Guide.
When a magnetic disk unit under IOX is used for a swap area, the response time may be degraded extremely. This is because the transfer rate of a magnetic disk unit is too slow to efficiently transfer the contents of the large amount of memory of the SX series.
Swap-in process priority:
Swap-out process priority:
For details about the MRT, refer to Section 3.4.4.
| Home |
|---|
The Memory Resident Time (MRT) refers to a minimum time needed for a process to be swapped in or memory allocated until the process is swapped out. Setting a value to MRT prevents memory swapping from occurring too frequently. A value of MRT is given in the second unit and is determined by the following formula.
where,
| A | Process size effect coefficient |
| B | MEMPRI effect coefficient |
| C | Minimum value of MRT |
| SIZE | Process size (unit: MB) |
| MEMPRI | Memory scheduling priority of process |
Adjust the preceding value considering the performance of device to be used for a swap file, size of job to be submitted for execution, and priority.
The following describes how to set each parameter.
See the SUPER-UX Programmer's Reference Manual for system call details and the SUPER-UX System Administrator's Guide for NQS details.
Figures 3-6 and 3-7 show that the MRT is effective to reduce swapping.
| Home |
|---|
![]() |
Figure 3-6 Swapping with Small MRT
![]() |
Figure 3-7 Swapping with Big MRT
Normally, MRT is not applied to a process waiting for I/O termination or resources. However, if the MEMPRI value is smaller than the value obtained by subtracting 20 from the value of sprocmrth, MRT is applied even if the process is in sleep mode.
| Home |
|---|
This section gives an example for tuning with the function explained in Section 3.4.4.
| Queue Name | CPU Limit | Memory Limit | Base Priority | Run Limit |
| q1 | 600 sec | 40 MB | 85 | 2 |
| q2 | 1800 sec | 200 MB | 85 | 2 |
| q3 | 7200 sec | 500 MB | 85 | 1 |
| q4 | 18000 sec | 700 MB | 85 | 1 |
The following batch jobs are placed in the queues, and throughput is measured. These programs compute direct products of large arrays, and write the results into files in binary image. This operation is repeated until a specified CPU time has elapsed.
| Queue Name | CPU | Memory | Number of Jobs |
| q1 | 180 sec | 38 MB | 3 |
| q2 | 300 sec | 186 MB | 2 |
| q3 | 300 sec | 491 MB | 1 |
| q4 | 300 sec | 689 MB | 1 |
Since the total memory size used by these batch jobs (sum of memory size used times multijob level) is approximately 1.6GB maximum, swapping takes place constantly.
Note that running jobs often sleep to wait for an I/O operation to complete. If the defaults are used without being modified, MRT becomes ineffective when a process sleeps. In general, MRT should be kept effective even when a batch job is sleeping; when a job involves frequent I/O as in the case of the job used this time, note that a greater difference results. To make MRT effective even when a process is sleeping, system constant SPROCMRTH and queue attribute Memorypriority are set to meet the following relationship:
| Home |
|---|
| A | B | C | D | E | |
|---|---|---|---|---|---|
| q1 | 2 | 5 | 15 | 20 | 40 |
| q2 | 2 | 10 | 20 | 30 | 60 |
| q3 | 2 | 20 | 35 | 50 | 100 |
| q4 | 2 | 40 | 65 | 90 | 180 |
Among the MRT-related parameters, only the minimum value of MRT is adjusted. This is because a high-speed device, XMU, is used as the swap device, and so the process size slightly affects the swap I/O time. In a system using a swap device with low I/O performance, the time required for swapping largely depends on the process size. In such a case, the process size effect coefficient for MRT should also be adjusted.
| A | B | C | D | E | |
|---|---|---|---|---|---|
| Run time | 1:11:17 | 1:01:49 | 0:53:43 | 1:00:44 | 1:06:23 |
As shown from the results, pattern C is optimum for the system configuration model given in this example. Although larger MRT values are set in patterns D and E, these patterns show poor throughput. As described earlier, the reason is that the CPU is often idle because no runnable process exists in memory.
From this example, you may understand that you should not set just a great value for MRT; you should select the most appropriate value for the system.
Before attempting to suppress swapping by using MRT, try to design queues so as to minimize swapping.
If a program aborts because of memory shortage, the following message is output to the console screen:
If this message is output frequently, the system design should be considered to have a problem. Check whether the following problems exist:
In this case, adjust the memory resource limit properly. For details, see Sections 3.2 and 3.4.2.
| Home |
|---|
For details of the swap(1M) command, see the SUPER-UX System Administrator's Reference Manual. For details of the sar(1M) command, see Chapter 5, System Activities.
Swap files should not be increased just because the swap area is insufficient. Consideration should also be given to the reduction of the multijob level to minimize the frequency of swapping. As a guideline, the swap area is about two or three times as large as main memory.
When many large pages are resident in memory, it is recommended that system constant SHARETEXT be set to 0 to reduce resident pages in memory.
For details on SHARETEXT, see Section 3.4.1.
3.4.6.2 THROUGHPUT DETERIORATED BY FREQUENT SWAPPING
If swapping occurs frequently, the CPU time used by the system increases, which decreases throughput. To check how often swapping occurs, use the sar(1M) command. The following explains how to examine swapping status from sar information.
For details of the sar command, see Chapter 5, System Activities.
If these checks show that swapping occurs frequently, tuning should be performed as follows.