From Kerrighed

Contents

[edit] Configurable scheduler framework

The configurable scheduler framework is part of Kerrighed trunk as of 2008-08-22. Is is enabled by default since svn revision 4685 (2008-10-08). To learn about the principles of the scheduler framework, please follow the links at the bottom of this page.


[edit] Enabling the configurable scheduler framework

[edit] Kernel configuration

To (re-)enable the scheduler framework, select "Cluster support" --> "Kerrighed support for global scheduling" --> "Run-time configurable scheduler framework" (CONFIG_KRG_SCHED_CONFIG). You should also enable the "Compile components needed to emulate the old hard-coded scheduler" option to mimic the legacy scheduler (CONFIG_KRG_SCHED_COMPAT). This last option will compile scheduler components (kernel modules) together with the main Kerrighed module, that can be used to rebuild the legacy scheduler, as shown below.

To let the scheduler framework automatically load components' modules, select "Loadable module support" --> "Automatic kernel module loading" (CONFIG_KMOD). Otherwise, components' modules must be manually loaded on each node before components that they provide can be configured.

Note: the scheduler framework depends on ConfigFs, which is automatically selected when enabling the framework.


[edit] Nodes configuration

To configure schedulers, the nodes of the cluster must have ConfigFs mounted. For instance, you can add the following line to /etc/fstab:

configfs        /config         configfs        defaults        0 0

It is not mandatory to choose /config as mountpoint, but for convenience this the mountpoint assumed by the krg_legacy_scheduler script shown below.


[edit] Scheduler configuration

With the scheduler framework nodes boot with no scheduler configured. Until a scheduler is configured, no task will be automatically migrated and no task will be remotely forked. The cluster must be started (for instance with krgadm cluster start) before any scheduler can be configured.

A scheduler is configured using simple filesystem operations under the krg_scheduler directory at ConfigFs' root. All operations are automatically replicated on all nodes. Only the root user can configure schedulers.

The krg_legacy_scheduler script (which is installed together with Kerrighed tools krgcapset, migrate, etc.) configures a scheduler having the same behavior as the legacy hard-coded scheduler: remote fork() targets are selected in (per source node) round robin, and processes are migrated to balance the CPU load using a simplified version of MOSIX' algorithms.

#!/bin/sh

# Round robin balancing with remote clone
SCHEDULER_PATH=/config/krg_scheduler/schedulers/rr

mkdir -p "$SCHEDULER_PATH/round_robin_balancer" && \
echo 1 > "$SCHEDULER_PATH/process_set/handle_all"


# Dynamic Mosix-like CPU load balancing with migration
SCHEDULER_NAME=mosix
# 2 seconds between each migration
MIGRATION_INTERVAL=2000000000
# Refresh remote node loads every second
POLL_INTERVAL=1000

SCHEDULER_PATH="schedulers/$SCHEDULER_NAME"
POLICY_PATH="${SCHEDULER_PATH}/mosix_load_balancer"
MOSIX_PROBE_PATH=probes/mosix_probe
MIGRATION_PROBE_PATH=probes/migration_probe

cd /config/krg_scheduler && \
mkdir "$MOSIX_PROBE_PATH" && \
mkdir "$MIGRATION_PROBE_PATH" && \
mkdir -p "$POLICY_PATH" && \
mkdir "${POLICY_PATH}/local_load/freq_limit_filter" && \
echo $MIGRATION_INTERVAL > "${POLICY_PATH}/local_load/freq_limit_filter/min_interval" && \
mkdir "${POLICY_PATH}/local_load/freq_limit_filter/threshold_filter" && \
v=`< "${MOSIX_PROBE_PATH}/norm_single_process_load/value"` && \
echo `echo "$v + $v / 2" | bc` > "${POLICY_PATH}/local_load/freq_limit_filter/threshold_filter/threshold" && \
mkdir "${POLICY_PATH}/remote_load/remote_cache_filter" && \
echo $POLL_INTERVAL > "${POLICY_PATH}/remote_load/remote_cache_filter/polling_period" && \
mkdir "${POLICY_PATH}/single_process_load/remote_cache_filter" && \
ln -s "${MIGRATION_PROBE_PATH}/last_migration" "${POLICY_PATH}/local_load/freq_limit_filter/last_event/migration" && \
ln -s "${MOSIX_PROBE_PATH}/process_load" "${POLICY_PATH}/process_load/mosix" && \
ln -s "${MOSIX_PROBE_PATH}/norm_upper_load" "${POLICY_PATH}/remote_load/remote_cache_filter/mosix_upper" && \
ln -s "${MOSIX_PROBE_PATH}/norm_single_process_load" "${POLICY_PATH}/single_process_load/remote_cache_filter/mosix" && \
ln -s "${MOSIX_PROBE_PATH}/norm_mean_load" "${POLICY_PATH}/local_load/freq_limit_filter/threshold_filter/mosix_mean" && \
echo 1 > "${SCHEDULER_PATH}/process_set/handle_all"


[edit] Tuning schedulers

Configured schedulers have three classes of tunables:

  • ConfigFS attributes (look at all echo commands in the krg_legacy_scheduler script),
  • impacted processes (the process_set sub-directory of a scheduler directory),
  • and impacted nodes (all node_set* attributes of a scheduler directory).


[edit] ConfigFS attributes

Scheduler directories contain various attributes, some of which can be used to tune the schedulers components. Default tunable attributes exist for probes (described below) and top scheduler component (described in the following sub-sections). Scheduler modules may also define custom ConfigFS attributes with custom meanings.

Probes, materialized by the sub-directories of /config/krg_scheduler/probes, contain the probe_period default attribute. This attribute contains the period in milli-seconds at which the probe should refresh and publish its data. Not all probes use this attribute.


[edit] Process set

The process_set sub-directory of a scheduler allows one to control which processes will be controlled by this scheduler. Process sets are allowed to overlap, and no mechanism is provided to prevent conflicting decisions from different schedulers.

The process_set directory contains the handle_all attribute and three sub-directories : single_processes, process_groups, process_sessions.

  • The handle_all attribute is a boolean that can be used to put all processes of the cluster under control of the scheduler. krg_legacy_scheduler does this.
  • If handle_all is false (ie 0), processes that should be controlled by the scheduler must be explicitly listed using the single_processes, process_groups, and process_sessions sub-directories. Each of these directories contains one sub-directory per PID (resp. PGID, SID) of the processes (resp. UNIX process groups, UNIX process sessions) explicitly controlled. An existing process (resp. process group, process session) is added by simply creating the sub-directory.
    Note: only already running processes can be added to single_processes. Creating an entry for a PID that does not exist is permitted, but a future process using this PID will not be controlled by the scheduler. Likewise process_groups and process_sessions will only act on entries referring to existing process groups and sessions when they were added. However, any new process (resp. process group) in a controlled process group (resp. process session) will be controlled by the scheduler.


[edit] Node set

  • The node_set attribute is used by mosix_load_balancer and round_robin_balancer as the set of nodes on which controlled processes are allowed to run. Other scheduling policies can use it differently. It's format is:
    <node_id_range>[,<node_id_range[...]], with <node_id_range> being <node_id>[-<node_id>]. Default is all online nodes.
  • The node_set_exclusive attribute is a boolean that controls whether this node set can overlap other schedulers' node set or not. Default is false (0).
  • The node_set_max_fit attribute is a boolean that controls whether this node set automatically expands to added nodes or not. Default is true (1).


[edit] Documentation


[edit] Source code

  • In Kerrighed trunk.
  • Latest quilt patchset, used to merge in Kerrighed trunk: 2008-08-22.