Chapter 3


Tuning Guide

3.1 BATCH PROCESSING

3.1.1 NQS

The batch processing environment for the individual system site can be constructed using various subcommands of the NQS administrator command qmgr(1M).

3.1.2 Job Management

Home


3.2 RESOURCE MANAGEMENT

To make the job activation effective, it is necessary to use the various resources offered as operating systems efficiently. It is especially difficult to use the batch process and the interactive process efficiently for the system in which the CPU or memory is used infinitely due to a wrong move by the operator.

For the effective performance of the resource management, it is possible to construct an effective system.

3.2.1 Resource Limitations

The establishment of the limitation value is done in each NQS queue. Also, it is possible to specify it as an option at the time of request submission. Furthermore, depending on the NQS command, every type of limit can be compared.

Some functions can also be used for programs outside batch processing. The limit of the calling program can be set with the system call setrlimit(2) and ulimit(2). The limit of the specified process, the specified job, and all processes in the specified job can be set with setrlimitx(2), setrlimitj(2), and setrlimitjp(2).

You can also get the resource limits of the calling process with the system call getrlimit(2) and ulimit(2). It is possible to get the resource limits of the specified process, the specified job, and the oldest process in the specified job can be set with getrlimitx(2), getrlimitj(2), and getrlimitjp(2).

The type of resources are as follows:

Home


The limits of every type of resource are prepared by the default values. Memory size is restricted by the maximum memory size usable for one process that is prepared only in the system (MAXUMEM).

There is no limit for the CPU consumption time.

There is no limit for the CPU resident time. However, the number of CPUs subject to the resource limitation as to the CPU resident time is restricted by the number of CPUs within a system.

The file size and core file size are restricted by the largest size of the process that is prepared for SFS by one in the system (ULIMIT).

The number of open files for each process is restricted by the largest size of the process that is prepared by one in the system (NOFILES).

There is no limit for the number of magnetic tape devices, the number of processes, or file capacity.

The number of tasks for each process is restricted by the maximum number of tasks per process (NTASKPP).

Table 3-1 shows the type of resource and symbol to specify for the system call setrlimit(2), setrlimitx(2), setrlimitj(2), and setrlimitjp(2).

Table 3-1 Set Limit Resource Types

ResourceProcessJob_ Kernel Default
CPURLIMIT_CPUAvailable Available Unlimited
File SizeRLIMIT_FSIZE
ulimit(2)
AvailableNot supported Maximum available file size
(ULIMIT)
Total Memory SizeRLIMIT_UMEM AvailableAvailable Limited only by maximum available size (MAXUMEM)
Data Segment SizeRLIMIT_DATA AvailableNot supported Limited only by maximum available size (MAXUMEM)
Stack Segment SizeRLIMIT_STACK AvailableNot supported Limited only by maximum available size (MAXUMEM)
Number of MT DevicesRLIMIT_MTDEV Not supportedAvailable Unlimited
Number of ProcessesRLIMIT_PROC Not supportedAvailable Unlimited
Core File SizeRLIMIT_CORE AvailableNot supported Limited only by maximum available file size ULIMIT
Total File CapacityRLIMIT_FSPACE AvailableAvailable Unlimited
XMU CapacityRLIMIT_XMUSE Not supportedAvailable Unlimited
File System RLGRLIMIT_RLGn
(n = 0, 1, 2, 3)
Not supportedAvailable Unlimited

Home


Table 3-1 Set Limit Resource Types (cont'd)

ResourceProcessJobKernel Default
Temporary File CapacityRLIMIT_TMPF Not supportedAvailable Unlimited
Number of Open FilesRLIMIT_NOFILE AvailableAvailableEach process:
Maximum available number of open files (NOFILES)
Each job:
Unlimited
Number of TasksRLIMIT_NTASK AvailableNot supported Limited only by maximum number of tasks (NTASKPP)
CPU Resident TimeRLIMIT_CPURESTM AvailableAvailable Unlimited
Number of CPUs Subject to Resource Limitation as to CPU Resident Time RLIMIT_NCPURESTM AvailableAvailable Limited only by number of CPUs in the system

3.2.2 Watching the Resource Limit

The resource parameters specify when soft or hard processing limits are reached. Tables 3-2 and 3-3 list per-process and per-job resource limit parameters and the soft and hard processing limit result.

Table 3-2 Per-Process Resource Limits

ResourceSoft Limit ReachedHard Limit Reached
RLIMIT_CPUSends a process the SIGXCPU signal. When a process receives the SIGXCPU signal, sets the signal catching function to SIG_DFL.
RLIMIT_UMEMIf the exceeding memory is the stack region, sends a process the SIGXMEM signal. Otherwise, returns an error.If the exceeding memory is the stack region and a process receives the SIGXMEM signal, sets the signal catching function to SIG_DFL. Otherwise, returns an error.
RLIMIT_DATAReturns an error.
RLIMIT_STACKSends a process the SIGXMEM signal. When a process receives the SIGXMEM signal, sets the signal catching function to SIG_DFL.
RLIMIT_COREReturns an error.
RLMIT_FSIZESends a process the SIGXFSZ signal. Sends a process the SIGXFSZ signal and returns an error.
RLMIT_FSPACESends a process the SIGXFSPACE signal. Sends a process the SIGXFSPACE signal and returns an error.
RLMIT_NOFILEReturn an error.
RLMIT_NTASKReturn an error.
RLIMIT_CPURESTMSends a process the SIGXCPU signal. When a process receives the SIGXCPU signal, sets the signal catching function to SIG_DFL.

Home


Table 3-3 Per-Job Resource Limits

ResourceSoft Limit ReachedHard Limit Reached
RLIMIT_CPUSends the SIGXCPU signal to all processes in a job.When all processes in a job receive the SIGXCPU signal, sets the signal catching function to SIG_DFL.
RLIMIT_UMEMIf the exceeding memory is the stack region, sends the SIGXMEM signal to all processes in a job. Otherwise, returns an error. If the exceeding memory is the stack region and all processes in a job receive the SIGXMEM, sets the signal catching function to SIG_DFL. Otherwise, returns an error.
RLIMIT_MTDEVReturns an error.
RLIMIT_PROCReturns an error.
RLIMIT_XMUSZSends the SIGXXMU signal to all processes in a job.Sends the SIGXXMU signal to all processes in a job and returns an error.
RLIMIT_RLGn
(n = 0...3)
Sends the SIGRLGn signal to all processes in a job.Sends the SIGRLGn signal to all processes in a job and returns an error.
RLMIT_FSPACESends the SIGXFSPACE signal to all processes in a job.Sends the SIGXFSPACE signal to all processes in a job and returns an error.
RLMIT_TMPFSends the SIGXTMPF signal to all processes in a job.Sends the SIGXTMPF signal to all processes in a job and returns an error.
RLMIT_NOFILESReturn an error.
RLIMIT_CPURESTMSends a process the SIGXCPU signal. When a process receives the SIGXCPU signal, sets the signal catching function to SIG_DFL.

3.2.3 Acquiring Resource Usage Information

The resource usage conditions can be observed from the user site through the system call getresource(2). Using this system call, you can acquire the resource usage information of the process and the job.

The types of resources are as follows:

You can obtain the accumulated CPU usage time of the process in the job, including the terminated processes, by specifying CURR_ACMCPU resource type. Specifying CURR_CPU gives the total CPU usage time of only currently active processes in the job.

Home


Table 3-4 shows the type of resource and the symbol to specify for the getresource(2) or getresourcej(2) system calls.

Table 3-4 Get Resource Types

ResourceProcess Job
All Resources CURR_ALL Available Available
CPU Time CURR_CPU Available Available
Total Memory Size CURR_UMEM Available Available
Data Segment Size CURR_DATA Available Available
Stack Segment Size CURR_STACK Available Available
Memory Size on Memory CURR_ONMEM Available Available
Number of MT Devices CURR_DEV Not supported Available
Accumulated CPU Time CURR_ACMCPU Not supported Available
Number of Processes CURR_PROC Not supported Available
File System RLG CURR_RLG Not supported Available
Total File Capacity CURR_FSPACE Available Available
Temporary File Capacity CURR_TMPF Not supported Available
Number of Open Files CURR_NOFILES Available Available
Number of Tasks CURR_NTASK Available Not supported
CPU Resident Time CURR_CPURESTM Available Available

NOTICE:

Home


3.2.4 Resource Block Facility

The Resource Block Facility allows the system administrator to manage memory and CPU resources by dividing an entire single node system into multiple blocks. With this facility, groups of processes, each having different attributes such as batch jobs and interactive commands, can use memory and CPU resources separately with minimal mutual interference. The facility also make it possible to support the scheduling of high priority, or "urgent" jobs.

3.2.4.1 RESOURCE BLOCK FACILITY CONTENTS

Terms:

Relationships between RSGs, RBs and processes:

Home


Memory RB facility:

CPU RB facility:

3.2.4.2 DIRECTIONS FOR USE
  1. Creating an RSG/RB configuration file

    Create an RSG/RB configuration file to set up RSGs and RBs. In this file, describe the amount of memory and CPU resources to be divided and describe also the relations between RSGs and RBs.

Home


The number of fixed (non-swappable) pages, such as shared text pages and shared memory pages, is also limited by Imem. When Imem is too small, system calls which make fixed pages, e.g. exec(2) or shmat(2), may fail due to this limitation.

In addition, the system uses several large pages. These pages are allocated from RB0 and are fixed in memory. A shortage in RB0's large pages (Imem) may result in various failures of system services. Thus RB0's Imem for large pages should be set such that the number of fixed pages never exceeds it.

  • Setting CPU RB definition

    In a fashion similar to configuring the memory resources, describe the CPU allocations for each RB.

    Specify the following items separately for each RB.

    Icpu:
    Number of CPUs to be allocated. Processes using this RB have priority for the use of up to this number of CPUs. In other words, this number of CPUs is guaranteed to be available to this RB. The sum of these values for all RBs must be equal to the number of CPUs available in the system.

    Gcpu:
    Upper limit of available CPUs for gang scheduling. This value is the number of CPUs for gang scheduling of RB. Gcpu value must be equal to or less than Max value. For the details of the Gcpu value, see 3.3.7 Gang Scheduling Function.

    Min:
    Number of CPUs for exclusive use. This number of reserved CPUs cannot be used for other purposes even if they are idle. They will never be lent to other RBs. Min must be equal to or smaller than Icpu. 0 is usually recommended.

    Max:
    Upper limit of available CPUs for use. The processes using this RB cannot use more CPUs than this number of CPUs at any one time. Max must be equal to or greater than Icpu.

    Attr:
    Attribute value of the RB. This Attr field can be described the attribute of RB. The default value is NONE (described "Attr:NONE"). For the details of the attribute value, see 3.2.4.3, Attribute of RB.

    Except for RB0, Icpu can be set to 0. Processes belonging to an RB which for Icpu is 0 can obtain a CPU only when another RB has an idle CPU which can be borrowed.

  • Setting RSG definition

    For all RSGs, specify the RBs to be used. For each type of resource (SP, LP and CPU) one RB must be specified. An RB can be shared by multiple RSGs.

    1. Checking the RSG configuration file

      Use the following command to check whether the created RSG/RB configuration file is correct:

        /usr/sbin/rsgconf -c [config-file]

      Note that this command can detect a syntax error, but cannot predict a configuration failure due to dynamic memory or CPU loads.

    2. Setting RSGs and RBs

      To set RSGs and RBs, execute the following command:

        /usr/sbin/rsgconf -s [config-file]

    Home


      RSG and RB definitions can be changed during system operation by issuing this commands at any time. For details of the rsgconf(1M) command, refer to the SUPER-UX System Administrator's Reference Manual.

      This operation fail under the following conditions:

    1. Setting RSG at startup

      If /etc/rsg.conf exists, RSGs and RBs are set automatically according to this configuration file when the system enters multi-user mode. This procedure is carried out by the script /etc/rc2.d/S37rb.

      Note that RSGs/RBs must be set up after the definition of the swap file, and before NQS starts.

    2. NQS queue setting

      The RSG to which a job belongs can be specified as an NQS parameter.

      With the following qmgr(1M) subcommand, an NQS queue may be associated with a particular RSG:

        Set resource_sharing_group=(/dev/rsg/x) queue

      Once a job starts running, the RSG number of the job cannot be changed.

    3. Specifying an RSG for an interactive command

      By default, all interactive processes belong to RSG0, but the chrsg(1) command can execute an interactive command in a specified RSG.

      When a shell is executed by the chrsg(1) command as in the following example, all programs executed from the new shell prompt will run in the specified RSG.

        chrsg /dev/rsg/2 /bin/csh

      The chrsg(1) command must have write permission for the given RSG special file. Therefore, users of a given RSG can be limited by the mode and Access Control List of the RSG special files.

    4. Monitoring resource status

      By specifying the special file indicating a desired RSG as an argument to the rsginfo(1M) command, the size of memory and the number of CPUs allocated to the RSG, as well as its current memory and CPU usage are output. The rsginfo(1M) command can display detailed statuses of RBs with options.

      For details of the rsginfo(1M) command, refer to the SUPER-UX System Administrator's Reference Manual.

    Home


    3.2.4.3 ATTRIBUTE OF RB

    The attribute of RB can be specified by describing either of the following two character strings in the Attr field of the RSG/RB configurationn file.

    NONE
    No attribute is specified. (Default)
    FIX
    Each value set to the RB is assumed as a fixed value.
    When FIX is specified, the RB is not a candidate for automaitc adjustment due to rsgconf(1M) or a CPU failure.
    When the RSG/RB configuration file is inconsistent with the system resources, rsgconf(1M) adjusts the values of the RSG/RB configuration file automatically. However, when a fixed attribute is specified to the RB, the value set to the RB is not a candidate for adjustment. When the adjustment cannot complete, rsgconf(1M) fails.

    When a CPU failure occurs, the RB not having a fixed attribute first becomes a candidate for adjustment and the one having a fixed attribute does not become a candidate until the RB not having a fixed attribute cannot be adjusted completely.

    3.2.4.4 NOTES

    When a process is restart(1)ed, the RSG of the process is set to the RSG to which the restart(1) command belongs. For a batch job, the RSG of a restarted job is set to the RSG defined with the NQS parameter at the restart.

    Memory and CPU information which the rsgconf(1M) command outputs reflects the current system information. The values of the memory information may change as the kernel is reconfigured, the swap file setting changed, or the partitioning of small and large pages is dynamically changed.

    Therefore, the rsgconf(1M) command should be executed under the same environment as planned for actual operation. Should the configuration file not correspond to the actual memory-size and the number of CPUs, rsgconf(1M) will try to adjust the resource quantity of the RB which does not have a fixed attribute. Adjustment is done preferably from the RB having a lower number.

    When the number of CPUs decreased due to a CPU failure, the setting of the RB becomes inconsistent with the actual number of CPUs. In this case, the system will try to adjust the resource quantity. If the adjustment does not complete, the RB with a fixed attribute also becomes a candidate for adjustment. In either case, the RB having a lower number is preferably adjusted. The Max and Min values of each RB can be decreased if necessary, according to the actual number of CPUs. If the failed CPU is reconnected, the setting of RBs returns to RBs which has been set before the CPU failure.

    While trying to reconnect, the rsgconf(1M) command can not change the setting of RSG/RB. Please retry a few minutes later.

    When the memory RBs are used, it is possible that a given process must borrow one or more memory pages from other RBs in order to be swapped in. The execution of such a process might be interrupted for an arbitrarily long time (until pages become available for borrowing).

    Be careful that such a process does not hold a system-wide resource (e.g. lock-file). This situation might arise, for example, when urgent jobs get started using almost all memory resources, swapping out other processes.

    Aborting with SIGKILL and checkpointing are the only valid operations for such interrupted process. When a process flag, shown by ps(1) -l, shows bit 0400, the process has been interrupted for a long time.

    Even when the relationships between RSGs and RBs are changed during system operation, relationships between running processes and RBs will not be changed. In other words, a process belonging to the redefined RSG continues to use former RBs.

    When a process fork(2)s, the RSG number of the created process or the child process is inherited from the parent process, but RBs are determined based on the current definition of the RSG.

    The situation is similar for exec(2), RBs of the exec(2)ed process are redefined according to the current definition of the RSG.

    When setting Icpu value to 0 in the CPU resource block, the Max value can also be set to 0. However, such setting makes no sense because the process will not be dispatched permanently. Only aborting with SIGKILL and checkpointing are done exceptionally.

    3.2.4.5 EXAMPLE FOR SETTING RSGS AND RBS

    An example for setting RSGs and RBs is presented here. A system with 4GB main memory, 10GB of swap files, and 16 processors is assumed.

    In this example we will show two RSGs; one for interactive and one for batch jobs. 256MB of main memory and 2GB of swap space are allocated to small pages. For batch jobs, 2.5GB of main memory and 6GB of swap space are used for large pages, while the remaining large pages are allocated to interactive processes. Batch jobs can borrow memory pages from interactive processes, but not vice-versa.

    The small pages are shared between batch jobs and interactive processes.

    Six of the sixteen CPUs are allocated for batch jobs and the remaining CPUs are allocated to interactive processes. Batch jobs and interactive processes are completely separated. Neither RSG can borrow CPUs from the other.

    In Attr field, the attribute of the memory RB and CPU RB for the batch job is set to a fixed attribute so as not to be affected by the change of the system resource.

    First, create an RSG/RB configuration template file by using the rsgconf(1M) command as follows:

    Home


    The contents of tmpfile are as follows:

    Note that the actual values vary depending on the configuration of the kernel, so you may not see the same results in your own case.

    Next, adjust the file contents as required.

    For this example, set them as follows:

    Home


    Check the file for integrity as follows:

    If tmpfile has no error, set up the RSGs and RBs as follows:

    Rename tmpfile to /etc/rsg.conf to set up RSGs and RBs automatically at system startup.

    Next, set NQS queue parameters. For all queues, change the RSG setting to /dev/rsg/1 as follows:

    Now all batch jobs will belong to RSG1.

    3.2.4.6 COMPATIBILITY

    The usage of the rsgconf(1M) command, the output format of the rsginfo(1M) command and the format of the RSG/RB configuration file have been changed as of R6.2. When RSG/RB configuration files of versions prior to R6.2 are used, the rsgconf(1M) command fails with a syntax error.

    3.2.5 Multitask Resource Management

    SUPER-UX has two kinds of Multitask Resource Management facilities; the Number of physical task limit which restricts the number of physical tasks and the Number of concurrent processors control which controls the number of processors which can be used simultaneously.

    Number of physical task limit

    This facility restricts the number of physical tasks a process can have. If the process attempts to make physical tasks exceeding this limit, the task scheduler cannot create any more physical tasks.

    (Note that the task scheduler does not create any more physical tasks if the current number of physical tasks exceeds the value set with the PSTUNE(maxcpu) intrinsic subroutine in FORTRAN. The default value for PSTUNE(maxcpu) is 32.)

    Number of concurrent processors control

    This facility controls the number of processors which can be used simultaneously by process scheduling. If a task attempts to get processors exceeding the specified number of the processors, the task is not dispatched before another task releases a processor.

    Additionally, the dispcntl(2) system call and dcntl(1) command can be used for the purpose of changing the Control Number of concurrent processors of a running job. However, only the superuser can increase this value. The NQS manager can also change the Control Number of concurrent processors of a running request by the following qmgr(1M) function:

    This operation changes the Control Number of concurrent processors of every process for the request specified as the requestid to the value.

    NOTE

    From R7.1, in the programs (such as MPI, HPF, Pthread, etc.) explicitly linked with the POSIX threads library specifying -lpthread or -lpthread_s, the threads library always creates an additional physical task and it exists away from user scope until the program exits. This physical task mainly handles signals in a process and is called "signal task". When the user is conscious of 4 tasks (i.e logical task number = physical task number = 4), there are internally 5 physical tasks in the program. For strict macrotask/microtask programs which are implicitly linked with the POSIX threads library, the signal task is never created. This is compatible with releases before R7.1.

    The existence of the signal task is actually found from the concurrent CPU time in account information or program information (PROGINF), although the user does not need to be conscious. The signal task runs on a processor only when signal handling in POSIX thread functionality such as pthread_cond_timedwait(3T).

    Note that both the Number of physical task limit and the Number of concurrent processors control functions take the signal task into account together with other user tasks. This is because the kernel is not aware of scheduling of the threads library and because even the signal task consumes user CPU time.

    Therefore, for the Number of physical task limit if a limit value is set in consideration of only user tasks, the number of created physical tasks in consequence may be less than expected. Similarly for the Number of concurrent processors control, one user task may not be given a processor while the signal task is running on a processor. On the other hand for both Resource Management Functions if a value is set to the number of user tasks "plus one", multitasked programs using only macrotasks/microtasks may not be controlled to the expected task number in effect.

    The system administrator should first determine whether the signal task (additional physical task) residing in multitasked programs is taken into account in the Resource Management functionality.

    The PSTUNE parameters in FORTRAN are totally separated from the signal task, so the PSTUNE programming does not depend on the existence of the signal task.

    Home


    3.3 SCHEDULING

    This system enables scheduling algorithms suitable for batch jobs executed by the NQS to coexist with scheduling algorithms suitable for interactive processing supported by the original UNIX®.

    This section first explains the concept of a scheduling group that allows scheduling algorithms to be assigned to each process. Next, it explains the processing domain that is introduced in order to coexist batch processing with interactive processing and the use of a scheduling group in each processing domain. Last, this section explains how to balance the scheduling of these two processing domains.

    3.3.1 Scheduling Group

    The CPU scheduling of processes in this system has the following features:

    The processes created at system set-up belong to Default Scheduling Group (DSG). They have default scheduling parameters and do not belong to any special scheduling group. In general, swapper, init, daemons, or most interactive processes belong to DSG.

    The explanation of scheduling algorithms assigned to each scheduling group and assignment methods of these scheduling groups are explained in the following sections.

    3.3.1.1 CPU SCHEDULING ALGORITHMS

    This system schedules CPU processes using the CPU priority formula that includes variable parameters.

    Home


    The graph shown in the following figure shows priority transition process. The preceding scheduling parameters change the shape of the graph.

    Figure 3-1 Priority Transition Process

    Home


    The effects of these parameters are as follows;

    These parameters are adjusted so that scheduling groups having scheduling algorithms suitable for the characteristics of an appropriate process can be created.

    3.3.1.2 ASSIGNMENT OF SCHEDULING GROUPS

    A group of processes having the same algorithms and sharing a set of parameters included in the priority formula of the CPU scheduling algorithm is called a scheduling group.

    The following processes and sets of processes can constitute a scheduling group. The identifier of a scheduling group must be also specified when it is created.

    An init and various daemons activated at system start-up have scheduling algorithms specified by system default scheduling parameters. These processes belong to the default scheduling group. The system call dispcntl(2) or the command dcntl(1) is used to create a new scheduling group from the default scheduling group. Refer to the SUPER-UX Programmer's Reference Manual for the use of dispcntl(2) and refer to the SUPER-UX User's Reference Manual for the use of dcntl(1). The system should be designed in consideration of the operation of the system site. Use scheduling groups to classify NQS batch queues or to discriminate between special commands or special users and other processes belonging to the default scheduling group.

    Home


    3.3.1.3 DEFAULT SCHEDULING GROUP

    The constant parameters of the default scheduling group are shown in Table 3-5 and can be specified with the config(1M) command or dynamically changed with the dispcntl(2) system call during the system operation.

    Table 3-5 Default Scheduling Group Constant Parameters

    Parameter Constant Name Default Value
    base priority DSGBASEPRI 0
    modification value of CPU counter DSGMODCPU 2
    tick quantum DSGTICKCNT 1
    decay factor DSGDCYFCTR 1
    decay interval DSGDCYINTVL 1 (sec)
    memory scheduling priority DSGMEMPRI 20
    timeslice DSGTMSLICE 200 (tick)
    aging range DSGAGRANGE 160

    3.3.1.4 NQS DEFAULT SCHEDULING

    The following constant parameters of NQS default scheduling can be specified with the set subcommands of qmgr(1M) for batch queues or can be dynamically changed with the modify subcommands of qmgr(1M) for running requests.

    Table 3-6 shows the set command of qmgr(1M). The modify request command name can be used to replace, set, or modify requests. For convenience, the NQS base priority is defined as the kernel base priority of + 60. Similarly, memory scheduling priority is defined as the kernel memory scheduling priority _ 20.

    Table 3-6 Default Scheduling NQS Constant Parameters

    ParameterCommand NameDefault Value
    base priority set base_priority 80 (kernel value 20)
    modification value of CPU counter set modcpu 2
    tick quantum set tickcnt 0
    decay factor set dcyfctr 1
    decay interval set dcyintvl 1 (sec)
    memory scheduling priority set memory_priority 0 (kernel value 20)
    timeslice set timeslice 1000 (msec)
    aging range set aging_range 160

    For example, you can obtain the same scheduling algorithm as the default scheduling group with a base priority of 60 and tick quantum of 1.

    Home


    3.3.1.5 INHERITANCE OF SCHEDULING PARAMETER

    The system may have scheduling algorithms that are advantageous to batch processing. If the system daemons follow the default scheduling group, the daemons may hold resources and wait for execution. System trouble may occur because of the timing of the batch processing.

    To avoid this, set these daemons to advantageous scheduling parameters. Because some daemons (inetd, init, etc.) have user processes that must belong to the default scheduling group as their children, their advantageous scheduling parameters must not be inherited by these user processes.

    The inheritance times of the scheduling parameters dispcntl(2) or dcntl(1) can set a process that determines how often these parameters can be inherited. Using the inheritance times control function, you can ensure that the daemons only follow the advantageous scheduling algorithm and user processes that are generated as their offspring follow the default scheduling group algorithm.

    The meaning of the inheritance times of the scheduling parameters are as follows.

    The n generations (including the process that sets the scheduling parameters to itself) follow the scheduling parameters. The 0 parameter is the special case, which means that there is no inheritance limit and the parameters are inherited infinite times. The value of the default scheduling group is 0.

    In addition to the scheduling parameter mentioned, these parameters can be set with dispcntl(2) or dcntl(1).

    The value of inheritance times of scheduling parameters is decremented and passed to the child at the time of fork. If it becomes 0, the child process becomes the default scheduling group and the parameters are set to it. For example, set 2 to inetd's inheritance times and set an advantageous parameter to it.

    In the pattern of inheritance shown above, inetd and telnetd follow the advantageous scheduling algorithm and can be run quickly. sh follows the scheduling algorithm of the default scheduling group.

    3.3.2 Processing Domain

    A processing domain distinguishes between processes executed by a terminal and batch jobs executed by the NQS. This system has the two following processing domains:

    The system executes the process ready to run with the highest execution priority in either of these two domains impartially. A domain can include scheduling groups having a variety of scheduling algorithms.

    Therefore, tuning facilities for balancing scheduling are provided to distribute the CPU consumption time into two independent domains.

    Home


    3.3.3 IPD Scheduling

    In IPD scheduling, normal processes belong to the default scheduling group having the same algorithms as the original UNIX system. First, the algorithms of the default scheduling group are explained, then the IPD memory scheduling is explained. Last, the creation of a scheduling group in IPD is explained.

    3.3.3.1 CPU SCHEDULING

    Generally, IPD processes belong to the default scheduling group (DSG). The "scheduling algorithms of the DSG" evaluates the execution priority value calculated from the CPU counter by the process and waiting time for execution to select the process to be executed next.

    The execution priority is evaluated by the following formula:

    where the constant is 40. The base priority is not added because DSGBASEPRI (=0) is in the default scheduling group. The default modification value of DSG is specified with DSGMODCPU (=2). In the formula, the smaller the execution priority value, the higher the execution priority becomes for the process.

    Home


    3.3.3.2 MEMORY SCHEDULING

    While a process is loaded on the memory, it is ready to run. The memory swap-out and swap-in mechanism is performed by the memory scheduler by evaluating the memory scheduling priority.

    A process with a smaller memory scheduling priority value becomes resident on the memory more often. This value must be an integer in the range 0 through 39. As for the IPD process, the same value as its nice value should be used. The system call setmempri(2) is used to change this value by specifying the deviation value; dispcntl(2) is used by specifying the absolute value. The default memory scheduling priority is specified with DSGMEMPRI(=20).

    For example, this priority can be manipulated to give a higher memory scheduling priority to a process with a smaller memory image in order to execute it prior to other processes having a larger memory image.

    3.3.3.3 CREATING A SCHEDULING GROUP

    The scheduling algorithms of the default scheduling group give an impartial opportunity for CPU allocation to all processes. Since IPD processes are executed by a terminal, these processes belong to the default scheduling group. Therefore, a scheduling group should be created to discriminate the CPU scheduling for a process or a set of processes from the CPU scheduling for others.

    3.3.4 BPD Scheduling

    A job is entered from the NQS to BPD. The NQS has a scheduling parameter for each queue. An individual job to be executed is a scheduling group having that parameter.

    First, CPU scheduling is explained, then other scheduling facilities are explained.

    3.3.4.1 CPU SCHEDULING

    BPD processes organize a job in the unit of an NQS batch request. They have the same execution priority parameter and constitute one scheduling group. The BPD job execution priority is determined by the following formula:

    A process with a small execution priority value has a high execution priority. In addition, there are other scheduling parameters, such as tick quantum, decay interval, and modification value, which are related to the time change of the execution priority.

    The NQS specifies these parameters in the NQS batch queue with the qmgr(1M) command. The NQS administrator should define scheduling parameters in the NQS batch queue depending on the characteristics of the job to be executed in the queue.

    Home


    Home


    3.3.4.2 MEMORY SCHEDULING

    BPD memory scheduling is also controlled by the memory scheduling priority. The values of the BPD job also determine the priorities with which processes are loaded on the memory.

    The NQS administrator sets these values to the NQS batch queue with the qmgr(1M) subcommand set memory_priority. A job executed through this queue is given this value. In consideration of the configuration of the batch queues made by the base priorities, the memory scheduling priority of each queue should be determined. It is recommended that queues with higher base priority have lower memory scheduling priorities. The NQS administrator can change the memory scheduling priority of a request with the qmgr(1M) subcommand modify request memory_priority after the request is queued.

    3.3.4.3 NQS QUEUE RUN-LIMIT

    The NQS queue run-limit specifies the limit for existing jobs executed through a batch queue. The default value is 1 for each queue. The NQS administrator can specify the limit with a parameter of the qmgr(1M) subcommand create batch_queue when the queue is created, and change it with the subcommand set run_limit. The limit specified to batch queues should take into consideration the configuration of queues with base priority.

    3.3.5 Tuning Facilities Scheduling

    The presence of scheduling groups enables IPD and BPD to have individual scheduling algorithms. However, at the time of process switching, the system selects a process having the highest execution priority in spite of its processing domain.

    To balance the process dispatching of the system such as for CPU time assignment, scheduling tuning facilities are supported.

    3.3.5.1 PROCESSING DOMAIN DISPATCHING (PDD) PRIORITY

    The system adds IPD PDD priority and BPD PDD priority processing at the time of process switching. This includes evaluating execution priorities of all ready-to-run processes. In this mechanism, PDD priorities can change the value of the execution priorities, and consequently enhance or reduce the CPU priority allocation.

    During system operation, these values can be changed with the system call setdispval(2). These PDD priority values must be an integer greater than 0.

    The initial system start-up values of IPD PDD priority are set to INTPDDPRI (see Table 2-1). The initial values at system start-up of BPD PDD priority are set to BATPDDPRI (see Table 2-4). The default values of INTPDDPRI and BATPDDPRI are 0. These values can be changed with the command config(1M) in consideration of system operation.

    3.3.5.2 PROCESSING DOMAIN CPU-TIME ALLOCATION FACILITIES

    The system records the time in which IPD and BPD processes run in the user mode. These time values can be obtained with the system call getdispinfo(2). The system administrator can dynamically tune the balance of system scheduling. For example, the dynamic use of facilities, such as domain dispatching priority or swap base priority, realizes the control of the domain CPU time assignment ratio.

    3.3.6 Multitask Family Scheduling

    To reduce the waiting time at task synchronization points, SUPER-UX provides multitask family scheduling facilities for microtasking groups. This section explains the concept of family scheduling, the relation between tasks and microtasking groups, and the tuning method of family scheduling.

    Home


    3.3.6.1 FAMILY SCHEDULING

    In multitasking, spin-waiting is mainly used for task synchronization. Accordingly, if a task is not given to the CPU, all the other tasks waiting for the task waste the CPU time. Therefore, it is desirable that all tasks in a multitasked program run as simultaneously as possible. Family scheduling is one mechanism for this purpose.

    Family scheduling is the following execution priority control mechanism for microtasking groups:

    This mechanism is valid only under the following condition: the execution priority of the first dispatched task in a microtasking group at the moment is greater than or equal to 40. (This range implies a user priority level.) Namely, the family scheduling excludes a case in which the first dispatched task is waiting for completion of I/O or other resources.

    Furthermore, the family scheduling priority is given only to the tasks ready to run such as those suspended by timeslice.

    If family scheduling priorities are fixed to a single high value of 39 for all multitasked programs regardless of NQS queue priority schemes, multitasked programs always defeat high priority jobs in the high priority queues. The multitasked programs result in upsetting the queue priority schemes.

    For the purpose of avoiding this situation, a slave priority has been introduced to a process as a tunable parameter that enables family scheduling priorities to change. A family scheduling priority value is calculated by subtracting the slave priority value of the process from the execution priority of the first dispatched task in a microtasking group at the moment. A slave priority is valid with an integer 0 or greater.

    The slave priority of 0 does not mean the absence of family scheduling control. This is because task execution priorities in a multitasked program generally vary from task to task at a given moment. The slave priority of 0 brings about the weakest family scheduling effect.

    Family scheduling priorities are not allowed to be a value less than 39 (that is, higher priority than 39). Therefore, a slave priority greater than a certain value always results in the family scheduling priority of 39. In such a case, the slave priority does not make sense as a parameter to differentiate processes. See Section 3.3.6.4 for more about the range of meaningful slave priorities.

    Family scheduling priorities are determined by the following algorithm:

    where:

    Home


    For example, suppose that a process has the slave priority of 10, and that the process has four tasks ready to run in a microtasking group. Let the four tasks be T1, T2, T3, and T4 with the execution priority T1pri=60, T2pri=65, T3pri=70 and T4pri=75 respectively. At this time, once T1 is dispatched, the family scheduling mechanism sets the family scheduling priority of 50 to T2, T3 and T4, because FDpri=T1pri=60, SLpri=10 and then FSpri=50. Consequently, T1 immediately starts running, and T2, T3 and T4 wait for CPUs without aging until the next dispatch opportunity.

    Figure 3-2 Task Execution Priority Transition by Family Scheduling

    If the slave priority of this process is 0, it follows that T2pri=60, T3pri=60 and T4pri=60. Thus, even the slave priority of 0 raises the execution priority of T2, T3 and T4.

    3.3.6.2 FAMILY SCHEDULING AND MICROTASKING GROUPS

    Family scheduling is a mechanism for microtasking groups. There are two cases of relations between tasks and microtasking groups.

    1. Normal case

      When a root task (root thread) starts at the beginning of the program, a microtasking group is generated and the root task belongs to the group. Additionally, all macrotasks (threads) created later and all microtasks reserved by the root task also belong to the group.

      For microtasks reserved by the macrotask (thread) other than the root task, a separate microtasking group is generated at the every reservation of the microtasks. In this case, a master microtask and the created-slave microtasks together belong to the microtasking group generated at their reservation.

    Home


    1. When using ANALYZER

      A separate microtasking group is generated at the every reservation of microtasks. A master microtask and the created-slave microtasks together belong to the microtasking group generated at their reservation.

      Macrotasks (threads) do not belong to any microtasking group.

    All microtasks are connected with the family scheduling. However, macrotasks (threads) are not connected with the family scheduling in the second case (when using ANALYZER).

    3.3.6.3 TUNING OF FAMILY SCHEDULING

    Slave priorities can be set to a set of processes in the interactive processing domain (IPD) and also to each batch queue.

    If a large slave priority value is set, the tasks having family scheduling priorities likely get CPUs as soon as possible. Then, the waiting time at task synchronization points can be reduced. Consequently, the program can run efficiently with a shorter spin-waiting time. However, the multitasked program interferes with other running programs during family scheduling control.

    If a small slave priority value is set, the waiting time at task synchronization points may become larger due to weak family scheduling.

    For batch jobs, if a slave priority according to the queue base priority is set, multitasked jobs in the queue can run with the family scheduling priority based on the queue priority scheme.

    For example, suppose that the possible range of priorities for queue Q1 is 60 to 70 and the range for queue Q2 is 80 to 90. If you now set a slave priority of 10 or less to Q2, multitasked jobs in Q2 will never interfere with the jobs in Q1.

    Figure 3-3 Family Scheduling for Batch Jobs

    Home


    Additionally, the setslavepri(2) system call has been prepared for the purpose of changing the slave priority of a running process or job. However, only the superuser can increase the priority. The NQS manager can also change the slave priority of a running request by the following qmgr(1M) function.

    This operation changes the slave priority of every process for the request specified as the requestid to the priority value.

    3.3.6.4 RELATIONS WITH SCHEDULING PARAMETERS

    In order to fix the family scheduling priority to 39 regardless of the 'nice' value, slave priorities satisfying the following condition are required to be set individually:

    where the revealing scheduling parameters are those of the scheduling group to which the objective multitasked program belongs.

    For the default scheduling group, the default SLAVEPRI value of 120 satisfies the above condition.

    3.3.7 Gang Scheduling Function

    The scheduling function that simultaneously assigns CPU to a parallel-processing program such as a microtask and MPI program is called a gang scheduling function. This function enables to execute multiple parallel-processing programs in a system at almost the same efficiency as a single program is executed.

    In SUPER-UX, the gang scheduling function is implemented by allowing or prohibiting executing parallel-processing programs every certain interval. This reduces system overhead. However, you should take care to use the gang scheduling function. Use this function after understanding the following descriptions.

    The gang scheduling function is a program product and corresponds to the following program product number.

    3.3.7.1 FUNCTION

    The SUPER-UX gang scheduling function allocates N CPUs for a parallel-processing program that is an object of scheduling so that the program can use CPUs at any time. The allocation is changed by rescheduling every certain interval (scheduling interval).

    A parallel-processing program to which CPUs are allocated can use the allocated CPUs at any time. However, the CPU that is not used by the program is used by another process that is not a parallel-processing program. (Note 1)

    In SUPER-UX, CPU is reserved for a parallel-processing program. Therefore, it is necessary to define how many CPUs a parallel-processing program uses. Also, the system must not allocate more CPUs than the reserved number for a parallel-processing program during execution.

    In the gang scheduling function, a parallel-processing program and the number of CPUs used by a parallel-processing program are defined as follows.

    Microtask/macrotask

    A program that is created as multitasking. When the Control Number of concurrent processors exceeds 1, scheduling is performed for a microtask or macrotask assuming that it is a parallel-processing program. The number of CPUs to be used equals to the Control Number of concurrent processors.

    Supplement

    In NQS, the SEt CPu Count subcommand of qmgr sets the Control Number of concurrent processors in units of queues. The Control Number of concurrent processors can also be set by qsub assuming that the value set to the queue is the upper limit.

    MPI program

    Gang scheduling is performed for the program using MPPG in MPI/SX. Even if MPPG extends over nodes, gang scheduling is performed for the program in multinode by synchronizing nodes.

    In MPI/SX, a logical node is created in each node. MPPG consists of multiple logical nodes. The upper limit (hereafter, referred to as the Control Number of concurrent processors of a logical node) is defined for the number of CPUs allocated to the whole process group included in a logical process.

    This value is defined as the number of CPUs to be used in each logical node. The number of CPUs to be used in the entire MPI program is the total Control Number of concurrent processors of a logical node of all the logical nodes.

    The Control Number of concurrent processors of a logical node equals to the Control Number of concurrent processors (of a process) set by NQS. (Note 2)

    In gang scheduling, it is assumed that processes included in one logical node must belong to the same CPU resource block (hereafter, CPU RB). If the processes belonging to multiple CPU RBs are included in a logical node, those processes are not assumed as an object of gang scheduling.

    Only the above mentioned parallel-processing programs that are executed on a CPU RB to which a gang scheduling attribute is set are an object of gang scheduling.

    3.3.7.2 RELATIONSHIP WITH CPU RESOURCE BLOCK

    The following CPU RB definitions are also applied to the gang scheduling function.

    Especially note the following.

    You can specify the upper limit of the total CPUs that are used by a parallel-processing program. It is the Gcpu value.

    You can also specify each CPU RB whether it is an object of gang scheduling or not. If you set the gang scheduling RB, this RB's Gcpu value must be equal to or larger than 2 and equal to or smaller than Max value. If you does not set, Gcpu value must be 0.

    The upper limit of the total of CPUs that are used by parallel-processing programs in a system is the total of Icpu values of CPU RB having the gang scheduling attribute. (Note 3)

    3.3.7.3 SCHEDULING

    Besides the above mentioned conditions, the following conditions must also be satisfied to perform scheduling.

    3.3.7.4 RESTRICTIONS

    The default parallelizing degree when executing a conventional microtask is the Max value of CPU RB. The following condition must be satisfied so as to execute the microtask efficiently using the default parallelizing degree.

    Max value = Gcpu value = Control Number of concurrent processors (Number of CPUs to be used).
    

    If the number of CPUs to be used exceeds the Icpu value of CPU RB, CPUs are not allocated to the microtask primarily. In this case, the following condition must be satisfied in order to increase the RB throughput.

    Icpu value = Control Number of concurrent processors (Number of CPUs to be used)
    

    It is recommended that the CPU RB and NQS queue be set as follows.

    Max value = Gcpu value = Icpu value = Control Number of concurrent processors
    		(Number of CPUs being used)
    

    If the gang scheduling attribute is set to the default CPU RB, the performance of its interactive response may decrease extremely. Similarly, the performance of the response of daemons including the NQS daemon decreases. So, it would be better not to set the gang scheduling attribute to the default CPU RB.

    If forced to set the gang scheduling attribute to the default CPU RB, the Gcpu value and the Control Number of concurrent processors must be less than the Icpu value, and the parallelizing degree of the microtask must be equal to the Control Number of concurrent processors.

    Note 1

    CPUs that are not used are scheduled by the CPU scheduler. A parallel-processing program is scheduled by the gang scheduler. So, the CPUs that are not used are used by another process that is not a parallel-processing program.

    Note 2

    Strictly speaking, the Control Number of concurrent processors of a logical node equals to the Control Number of concurrent processors of the process that generated the logical node. When using NQS, the Control Number of concurrent processors is set to a job. So, the Control Number of concurrent processors of a logical node is equal to the Control Number of concurrent processors.

    Note 3

    If all CPUs are used for gang scheduling, the following Icpu value definition of CPU RB that does not have a gang scheduling attribute cannot be applied while a certain allocated interval.

      It is guaranteed that CPUs can be used anytime up to the Icpu value.

    Therefore, CPUs that are allocated to CPU RB that does not have the gang scheduling attribute cannot be used for gang scheduling.

    Home


    3.4 MEMORY SCHEDULING

    This section describes the memory control functions specific to the SUPER-UX system, as well as how to specify the system parameters related to memory control.

    3.4.1 Basic Functions

    3.4.1.1 REAL MEMORY METHOD

    The SX series high-speed calculation capabilities can be attributed to its use of a real memory method. The real memory method is notable in that a process (all images existing in a virtual process space) to be executed must be placed in main memory in its entirety. The memory control of the SUPER-UX system is based on a swapping method, which transfers an entire process between main memory and swap area, thus reducing the system overhead. (Segments shared by multiple processes, such as texts and shared memory, however, are not subject to swapping.) If swapping is set up incorrectly, it may adversely affect system performance when the processing load is heavy.

    3.4.1.2 PAGE SIZES

    To enable both the execution of a large-scale program and the efficient use of main memory, the SX series can use four page sizes; 32KB, 1MB, 4MB, and 16MB. A 32-KB page is called small-page, and 1-MB/4-MB/16-MB pages are called large-page. Small-pages are used for the U block (of four or nine pages in length), required for all processes, and executing small-scale programs such as commands. Large-pages are used for executing large-scale programs, notably those coded in FORTRAN.

    On the SX-4, the size of a large-page is set to 1MB on a system which has a main memory capacity of 8GB or less, and is set to 4MB on a system which has a main memory capacity of more than 8GB.

    On the SX-5, 4-MB and 16-MB pages coexist in a system. A program using 8G layout or 32G layout uses 4-MB pages and a program using 512G layout uses 16-MB pages.

    The system management such as the resource block configuration treats large-pages in 1MB or 4MB units. Therefore, the system administrator does not need to take 16-MB pages into consideration at system design.

    For details about the memory layout, refer to Section 1.2 in the User's Guide.

    NOTE

    A 16-MB page may be called huge-page particularly. However, the SUPER-UX memory management functions treat it as a large page.

    Home


    3.4.1.3 SYSTEM CONSTANTS

    How to set the system constants related to the basic functions of memory control, as well as related notes, is described next.

    Figure 3-4 SPMEM

    Figure 3-5 Fixed Pages and Maximum Process Size

    Home


    3.4.2 Process Size Limitation

    The SUPER-UX system allows the process size to be limited for all processes by specifying system constants, or for each process or job. Process size limitation can avoid unnecessary swapping and the exclusive use of system resources by a certain process, thus improving the system operation efficiency.

    3.4.2.1 LIMITATION BY SYSTEM CONSTANTS

    The SUPER-UX system allows the maximum process size to be limited by using the following parameters.

    NOTE

    The default value virtually specifies no upper limit for a small-page process.

    3.4.2.2 LIMITATION BY RESOURCE LIMIT FUNCTION

    To limit the size of an interactive process, each user is required to specify the limitation in the /etc/userlim file, which is referenced whenever a user performs a login to the system. For details, see userlim(4) and logindefs(4) in the SUPER-UX Programmer's Reference Manual.

    The amount of memory used for a batch job can be limited by using the NQS function for each queue, each job, or each process. A limitation specified by using the resource limit function cannot exceed that specified by using the system constants described previously. For details, see the SUPER-UX NQS User's Guide.

    3.4.3 Swapping

    Home


    Home


    3.4.4 Memory Resident Time

    The Memory Resident Time (MRT) refers to a minimum time needed for a process to be swapped in or memory allocated until the process is swapped out. Setting a value to MRT prevents memory swapping from occurring too frequently. A value of MRT is given in the second unit and is determined by the following formula.

    where,

    Adjust the preceding value considering the performance of device to be used for a swap file, size of job to be submitted for execution, and priority.

    The following describes how to set each parameter.

    Home


    Figure 3-6 Swapping with Small MRT

    Figure 3-7 Swapping with Big MRT

    Normally, MRT is not applied to a process waiting for I/O termination or resources. However, if the MEMPRI value is smaller than the value obtained by subtracting 20 from the value of sprocmrth, MRT is applied even if the process is in sleep mode.

    The system scheduling parameter sprocmrth has an initial value of the system constant SPROCMRTH. The value can be changed by the system call setdispval(2) during operation.

    Home


    3.4.5 Tuning Example

    This section gives an example for tuning with the function explained in Section 3.4.4.

    Home


    3.4.6 Checking and Tuning the Memory Use Status

    If a program aborts because of memory shortage, or swapping occurs very frequently when the system is used, checking the system use status and system tuning are recommended. This section explains how to check and tune the system use status related to memory and swapping.

    3.4.6.1 PROGRAM ABORTS DUE TO MEMORY SHORTAGE

    If a program aborts because of memory shortage, the following message is output to the console screen:

    If this message is output frequently, the system design should be considered to have a problem. Check whether the following problems exist:

    Home


    3.4.6.2 THROUGHPUT DETERIORATED BY FREQUENT SWAPPING

    If swapping occurs frequently, the CPU time used by the system increases, which decreases throughput. To check how often swapping occurs, use the sar(1M) command. The following explains how to examine swapping status from sar information.

    For details of the sar command, see Chapter 5, System Activities.

    If these checks show that swapping occurs frequently, tuning should be performed as follows.