Chris King,
Scott Lathrop, Steve Molloy, Paul Moore
Department of
Electrical Engineering and Computer Science
University of
Michigan
Ann Arbor, MI
48109
Concerns about the scalability of multithreaded network servers running on Linux have prompted us to investigate possible improvements to the Linux scheduler. The purpose of our research is to improve the scalability of the Linux scheduler in order to prepare it for industrial-strength computational chores. Our problem focuses on determining why time spent in the Linux scheduler increases with the number of threads executing in the system. We determine that this problem’s cause is a direct result of the scheduler’s task selection process. Every time the schedule function is called, the scheduler calculates a “goodness” value for each ready task—an q(n) operation where n is the number of ready tasks. The task with the highest “goodness” value is the next task to execute.
Given this information, we focus on incrementally improving the existing scheduling architecture in the Linux kernel rather than fundamentally redesigning it. We propose a scheduler design alternative based on the static and dynamic portions of “goodness”, implement that design, and compare our implementation with the current Linux scheduler. Our solution demonstrates a 90% increase in performance when 1200 ready tasks in the system and compares favorably to the present scheduler when there are only a few ready threads.
The Linux Operating System
is increasingly being used not only as a personal computer (PC) operating
system, but also as a cost-efficient alternative for network servers, web
servers, distributed workstations, and other large scale platforms. Its gain in popularity for such large-scale
tasks stems mainly from its low price/commodity ratio, but also is a factor of
its multitasking environment, stable platform, and familiarity for former
Unix-like users--specifically application developers. Many companies, such as IBM, are beginning to offer software
support for their products on Linux [3], while other organizations use Linux as
web
servers [6], network
print servers providing access to over 2000 printers [10], or as an
enterprise-class network service provider for their customers.
However, there is a problem facing wide-scale Linux deployment. The root of Linux’s problem is its inability to scale. The shortfall specific to the Linux scheduler, is its inability to manage a large number of concurrent processes such as what is typical found in a high-load web server environment. Studies show that when there are a large (defined as 1000) number of threads running, up to 50% of the kernel’s time is spent in its scheduler.
To be fair, we must point out that Linux was not designed for operation in a large-scale environment. Its original target was the Intel 80386 personal computer [8]. We also wish to emphasize that the focus of our efforts is on scalability in the terms of performance. Our question of interest is “how does the performance of the scheduler degrade as the load on it increases?”
We originally proposed to investigate and determine the reason(s) causing this bottleneck, recommend alternative designs, and implement a solution. Our implementation (ELSC) is based on the “static” and “dynamic” properties of a task’s goodness. ELSC improves the scalability of the Linux operating system showing a 90% increase in performance when there are 1200 ready threads as compared to the current scheduler.
The paper’s remaining contents are organized as follows. We begin in Section 2 by describing previous work surrounding our research. In section 3, we detail some specific scheduler data structures and algorithms we find hindering scalability. These hindrances ultimately affect our design decisions. Section 4 discusses our design proposal and details the implementation. Section 5 shows the results of benchmark tests comparing the current scheduler with our implementation. Section 6 details potential future work and in section 7 we provide our conclusion.
The basis of our research begins with observations of
various Linux scalability problems by IBM and the Center for Information
Technology Integration (CITI) at the University of Michigan. IBM’s Linux Technological Center in Austin,
Texas observed problems with the Linux scheduler and thread architecture when
they measured the performance of their Java Virtual Machine (JVM) using the
VolanoMark benchmark. VolanoMark, a well-known Java benchmark that simulates a chat room application, is
characterized by long-lasting (i.e. several minutes) network connections and
high thread counts. [5]
IBM profiled two Linux kernels running VolanoMark on their Linux JVM implementation in order to better understand where performance bottlenecks occurred. Because their JVM implementation uses a one-to-one threading model (one kernel thread is created for each Java thread) and VolanoMark uses a pair of Java threads on each end of a chat room connection, four native Linux threads (or Linux “tasks”) are created for each client-sever connection. As the number of threads increased from 400 to 2000, IBM’s kernel profiles revealed that the percentage of kernel execution time spent in the scheduler function alone increased from 17% to 25%. [2] These results indicated a more general scalability problem with the Linux kernel that would affect any heavily multithreaded application.
IBM
conducted further tests to evaluate the scheduler’s performance while running
VolanoMark. They inserted
instrumentation code to measure run queue length and found that, on average,
there were 30 threads in the queue but occasionally that number increased to
400 threads. They then reorganized the Linux “task” (i.e. thread) data structure so that fields used to compute a
tasks “goodness” value were in the same cache lines. Since the “goodness” value of a task determines its potential for
scheduling and is calculated for every task each time the scheduler is called,
this adjustment reduced the number of cache accesses required in order to
calculate goodness. This structural
reorganization resulted in a 35% reduction in the “goodness” calculation time
and a 7% increase in VolanoMark throughput.
Based on these observations, IBM recommended that (1) the Linux
scheduler be modified to efficiently schedule a large number of tasks, and (2)
that Linux support a many-to-many threading model. We chose to focus our research on the scheduler.
CITI,
an applied research and development center, has been researching Linux
scalability since 1998. Their focus is
primarily on supporting greater network server loads, file descriptor
scalability, improving memory and TCP bandwidth, and multiprocessor
scalability. To date we are their lead
investigators of the scheduler’s scalability problem.
Other
research has recently spent an enormous effort designing real-time scheduling
for Linux. Atlas [1] designed and
implemented a statistical rate monotonic scheduler for periodic scheduling of
real-time tasks in Linux. Wang [9]
presents a general framework for scheduling real-time applications in their
version of Linux using an allocator and dispatcher abstraction. Regher proposes incorporating flexibility
into the scheduler by using dynamically loadable scheduler modules to fit an
application’s requirements. [7]
Although this research is focused on the scheduler, none of it focuses
on the scheduler’s scalability issue.
The
Linux discussion groups provide evidence that work on the scheduler function
has and continues to be investigated with the most recent changes being added
to a task’s scheduling policy. Similar
to our findings of what causes the scheduler scalability problem, there has
been previous discussion within the Linux community concerning the q(n) scheduling
algorithm. Counter arguments state
evidence showing that even heavily used websites only show one runnable process
90% of the time, while never going higher than about fifteen. We believe that this assumption is incorrect
based on the evidence presented by IBM running a typical web server
application—chat rooms. We show that a
scheduler design that satisfies both a small and large number of ready tasks is
possible.
To
the best of our knowledge, besides the research identified in this section, no
other group has published work on the scheduler’s scalability problem or
proposed changes enabling it to handle a large number of ready tasks.
Unfortunately, since Linus Torvalds’ original decision to keep the kernel simple and small, coupled with his belief that there are not “a lot of major new innovations in store for the kernel” [8], the Linux developer community is justifiably hesitant to incorporate any modifications that may induce extra overhead. Our work shows that such changes are justified and cost little.
This section will discuss some of the data structures and algorithms implemented in the current scheduler as of linux_2.3.99-pre3 (24 March 2000). [11] This information assists in explaining our observations and design decisions.
The Linux execution context is referred
to as a task. A task is responsible for
maintaining the state of all address space information. This state includes pointers to its address
space, the task state relative to the kernel, the processor state (i.e.
register usage), task statistics (used for memory management and resource limit
enforcement), task credentials, file system information (including file descriptors),
IPC data, signal handlers, and thread group tracking. Figure 1 shows the task structure’s key fields used in
scheduling.
The state
field holds the current state of a task.
A task can be in one of six states with TASK_RUNNING being the task’s
state when it is in the run queue. A
task’s counter indicates its time
remaining in its current quantum while a task’s policy is either SCHED_OTHER for
user tasks and SCHED_FIFO or SCHED_RR (round robin) for real-time tasks. Additionally, a task’s policy can be set to
SCHED_YIELD when an interrupt handler wants the task to yield its
processor. The field, has_cpu, is set to 1 when the task is
currently executing and 0 otherwise; processor
is the processor number that a task last ran; and run_list contains the next and
prev pointers that enable the forming
of the run queue’s circular doubly linked list. The fields next_task
and prev_task allow for the linkage
of all tasks in the system and mm points to the task’s virtual memory
mapping.
Linux, unlike other UNIX implementations, does not package kernel threads into a lightweight process. Instead every thread is a process as far as the kernel is concerned. This infers that every Java thread created in a 1-to-1 thread modeling using Linux native threads, is mapped to a user process and shows up as such if you use the process status (ps) command. This gives rise to possible alternatives for grouping these tasks.
The run queue in Linux is a circular doubly linked list containing all the tasks in a TASK_RUNNING state. Each call to schedule() results in a traversal of the entire linked list. This traversal is necessary in order to calculate a goodness value for each task in the run queue. This goodness value determines what task will run next and is in the range of ±1000, with +1000 being a real-time task and -1000 being the lowest. If the highest calculated weight is zero, the scheduler recalculates “goodness”. This is an q (n) operation where n is the number of tasks in the run queue.
Unfortunately the run queue’s
simplicity is its undoing when a large number of tasks are on the queue. We believe that if the Linux scheduler is to
scale well we have to implement a replacement that allows a linear, or near-linear
insertion and lookup of tasks.
The “goodness()” function returns a value associated with a task’s relative desirability for scheduling. The calculation depends on the following factors. First, is processor affinity. If the last processor the task ran on is the processor for which we are scheduling, it is given a significant advantage because of the existing possibility that it may still have some memory lines in that processor’s cache.
The second factor affecting the goodness()
calculation is its memory map address.
If a task shares the same address space as the previously ran task, then
the task’s goodness value is increased by one because of the reduced context
switch overhead involved.
The third and fourth factors affecting
the calculation are the task’s counter value and its priority. If its counter value is zero (i.e. its time
quantum is expended), that task’s goodness value is set to zero and the goodness()
function returns to the scheduler function with no further calculations. Otherwise the task’s priority and counter
values are added to its goodness value.
Finally, a task’s policy plays a role
in its goodness value. Any real time
task is given the maximum value of 1000 plus its real time priority in order to
guarantee scheduling unless another real time task is in the run queue with a
higher priority.
We believe that this goodness()
function considers the appropriate factors to make an intelligent scheduling
decision. However, it is not necessary,
and in fact wasteful, to calculate goodness() for each ready task
whenever the scheduler is invoked.
The Linux kernel function schedule(), as in other operating systems, is called from over 500 other functions within the Linux kernel, indicating its significance to overall system performance. The scheduler function first executes any active bottom-halfs. Bottom-halfs are functions that are too substantial to run during an interrupt. Upon an interrupt, the interrupt handler performs a minimum amount of work and saves the necessary state required so that the rest of it can execute later in a bottom-half handler. We have decided not to modify any bottom-half code because it would require extensive changes that do not affect our goal of improving the scheduler’s scalability.
After handling some additional
administrative work while interrupts are turned off, the scheduler executes the
heart of its code—a while loop that traverses through the entire run queue (Fig
2). This loop is obviously the
bottleneck in our system and what we focused on eliminating.
Linux schedule() function loop
while (tmp != &runqueue_head) {
p =
list_entry(tmp, struct task_struct, run_list);
if
(can_schedule(p)) {
int
weight = goodness(p, this_cpu, prev->active_mm);
if
(weight > c)
c =
weight, next = p;
}
tmp =
tmp->next;
} Fig 2
Based on our previously discussed
observations we decided on five criteria to assist our design decisions and
evaluation. Listed in priority below are these design criteria, followed by a
brief discussion of our reasons behind choosing each criterion.
1.
Goodness is good so use it.
2.
KISS (Keep It Simple Stupid). Keep the implementation simple in order to limit overhead and
kernel size.
3.
Make the common case fast.
Must not change performance when there are only a few runnable tasks in
the system.
4.
Must limit changes to current task scheduling behavior.
5.
Ignore real time tasks and other task policies for now.
We respect the goodness() calculation because not only does it account for a task's priority, but goodness() also considers its processor affinity and address space. We believe that these considerations benefit the system more than any extra cycles we could save by optimizing the function. In order to test our theory that goodness() is not the system’s bottleneck, we implemented a patch simplifying its calculation and ran preliminary benchmarks. We saw no change in the benchmarks’ performance. Furthermore, IBM’s experiments showed that by modifying the scheduler algorithm to take the first task in the run queue rather than calculating goodness() for each task, time in the scheduler is reduced, but so is system throughput [2]. Therefore, we use the factors in goodness() to guide scheduling decisions.
What we did observe with the goodness value, however, is that its calculation consists of two parts: (1) a static and (2) dynamic part. The static part of the goodness calculation is a task’s priority and its counter. Although these values may change whenever the task is executing, their state is constant while the task is in the run queue awaiting processor dispatch. We call these two values the static parts of goodness.
Processor affinity and shared address space are parts of the goodness() calculation dependent on the freed processor and the previously running task. Since the possibility exists that this portion of a task’s goodness calculation could change while the task is waiting in the run queue for processor dispatch, we call these two values the dynamic part of goodness.
Because the Linux community is adamant
about limiting overhead and pays meticulous attention to the details of cache
alignment and constant time running in other areas of the scheduler, we state
up front that our goal is to limit the overhead of our implementation and keep
it as simple as possible. This
criterion caused us to initially throw away many well-conceived brainstorms.
Another principle we wish to maintain
is the running time of the operating system in the common case—that is when
there is a single user using Linux on a desktop PC. Since this is the reason Linus created Linux and the reason so
many developers are committed to maintaining Linux, we do not wish to override
its excellent performance when there are only a few ready tasks in the run
queue.
We desire to limit any changes to scheduling behavior so that current applications running on Linux will not see any drastic modifications to their task’s ordering. That is, tasks should be scheduled in relatively the same order that they currently are.
Finally, and probably subject to the most criticism, is the fact that we chose to ignore real time tasks and the other task scheduling policy possibilities for now. The reasons are simple. After studying the kernel, we found no indication that policies other than SCHED_YIELD were ever assigned or changed anywhere else in the kernel. Only SCHED_YIELD is modified to indicate the yielding of a processor, but this value does not affect the current goodness calculation. Furthermore, these policies are inconsistent in how a task is placed in the run queue. In the current implementation, when a process is woken up it is always placed at the front of the run queue even though it has no bearing on whether the scheduler will select it as the next task to run. However, the round robin policy moves a task to the end of the run queue when its counter is zero. This movement makes no sense because as noted above, every task is re-examined anyway to re-calculate its “goodness” for task selection.
Our understanding is that these fields and policies are parts of the scheduler still in the developmental stage and mainly used when Linux is running in a real-time environment. Since the policy field can be considered as part of the static goodness, it would be trivial to add it to our design although we do not know at this time what the side effects may be. Finally, because our goal is to improve the scheduler’s scalability and not fine-grained scheduling of specific type tasks, we decided to disregard a task’s policy when making scheduling decisions.
Although these criteria are aggressive,
and perhaps contradictory in some ways, we believe that in order to assuage the
mass number of Linux users, we have to meet our top three criteria if we want
the Linux community to accept our recommended changes.
In order to obtain a performance time that is less than linear to the number of tasks in the system, we had to either divide the run queue up in some sort of fashion using a variation of a multilevel feedback queue, or group similar tasks together that share the same address space. Then, rather than compute goodness() for all tasks in the system, compute it for a subset of tasks--knowing that the goodness values of the remaining tasks are not high enough for selection in that scheduling round anyway.
Our design, ELSC, revolves around using the static
and dynamic parts of goodness to create this division of tasks.
Rather than store the tasks in a circular, doubly linked-list, we use a chained
hash table as the run queue. The tasks
are indexed into the table by the most significant bits of their total static goodness values (Fig 3).
The hash function takes into account the task’s counter and
priority—hashing it to one of 512 buckets.
The function is very simple, consisting of two shifts and an
addition. The priority is shifted 22
bits to the left then added to the counter.
The sum is then shifted 23 bits to the right. We chose this hash because it was inexpensive in terms of
execution cost, and signifies the desirability of a task by its bucket. In order to reduce lookup time when choosing
a task to run, we keep an index to the highest occupied bucket.
Each bucket in the table contains a circular, doubly,
linked list of tasks built using the current implementation’s functions. Each task in a list has the same upper nine bits of static goodness, where 29 is the hash table size. Upon entering the run queue or after
completion of its CPU burst, the task’s hash value is determined and the task
is inserted as the tail of that bucket’s list.
The schedule() function then
traverses the table’s top list calculating the dynamic goodness of each task in that bucket and selecting the task
with the highest dynamic goodness. If
all tasks in top bucket are equally desirable for selection then the scheduler
chooses the first task for execution since it is the oldest task in the top
bucket. In order to protect against the
worst case, that is when all ready tasks hash to the same bucket, we bound the
search to the first 16 tasks. Finally,
if all tasks currently in the top bucket are executing or have a counter value
of zero, the scheduler searches the next lower bucket for potential tasks and
continues this operation until it finds a task it can dispatch giving
preference to tasks in the highest bucket.
Once again, in order to limit the search through all of the buckets, a
bound is set at two buckets.
Another way to look at this design is that it is a radix
sort of a runnable task’s static goodness value. All tasks sorted into the highest bin are candidates for
selection in the next scheduling cycle with the task having the highest dynamic goodness being selected. The general algorithm for our design is then
(1) Remove the interrupted task from the run queue.
(2) Calculate static goodness of the interrupted task
(3) Re-insert the interrupted task and update the top pointer if necessary
(4) Select
the next task from the top hash bucket based on the highest dynamic goodness
and/or using the previously mentioned heuristics. If all ready tasks are executing, run the idle task.
How does this design meet our criteria? Its advantages are that we continue to use
the idea of goodness to select the next task, keep the run queue data structure
relatively simple, and do not change the performance of the system when there
are a small number of runnable tasks.
Its disadvantage is that it may not exactly replicate the scheduling
behavior of the current system although this is difficult to determine. We believe that even though the scheduler
behavior may have changed, the change is for the better. Our tests show that the current Linux
scheduler is not always fair under certain conditions but that our
implementation promotes fairness [C1]among equally desirable tasks. We will discuss these findings in section 5.
The difference between our design and the current
scheduler’s goodness calculation is that we chose not to recalculate the static
part of goodness on each call to schedule(). Instead static goodness is calculated only
when a task enters the run queue from yielding its processor. Dynamic goodness is then the final selection
criteria for determining the next task to dispatch. The number of goodness calculations is reduced to the number of
tasks in the highest bucket plus the static goodness of the previously run
task. Furthermore, the goodness
calculation performed when selecting a task is not the full goodness
calculation, but rather the dynamic portion of goodness.
Our implementation is kept relatively simple. Since our design initially allocates memory
for the maximum hash table size, the only additional overhead between our data
structure and the current implementation is that we introduce 512 (29 ) doubly linked list
sentinels (8 bytes each) into the
structure for an additional overhead of 512 * 8 = 4K bytes. This allows the hash table to fit nicely
onto one virtual memory page and results in only a 7% increases in the
scheduler size.
Since the shifting of bits is a fast operation our hashing
function is efficient. The insertion
and deletion of tasks is a trivial operation.
We wrapped our hash insertion and deletion functions with the existing
scheduler run queue functions.
In keeping with our original criteria, the new scheduler is
comparable to the current scheduler for everyday desktop use. In fact we used the scheduler during the
last few days of development work and did not feel any noticeable difference in
terms of speed or interactivity.
Informal tests involving kernel compiles showed the two schedulers to be
statistically the same.
An average running time analysis of our algorithm yields an
q (1 +
n/2k) operation where 2k is the hash table size. Because we know that all tasks in the top
bucket have the same upper k bits of static goodness values, we limit the
search of the top bucket to a constant number of tasks. Thus, our algorithm defaults to a constant
time operation—much better than q (n). Our design defaults to the current scheduler’s algorithm in the
case when there are only a small number of ready tasks.
Using our design criteria, the only possible disadvantage
to our system is that, as of now we cannot guarantee that task scheduling
behavior is exactly the same as the current scheduler although we argue that it
is not important. The only difference
between our design and the current scheduler is that if a task hashes to a
bucket significantly lower than the top bucket, the scheduler will not select
it as the next task to run even if processor affinity would give it enough
“points” to make it more desirable than other tasks in the top bucket.
Still, we believe the ELSC scheduler exhibits more correct
behavior by grouping tasks based upon time quantum and desired priority and
then selecting from within that group a task that can be executed most
efficiently. We could easily modify our
scheduler to hash based upon address space and processor affinity--grouping the
tasks into levels of efficiency; then, within a group, selecting a task based
upon desirability.
In order to compare the scalability of the current scheduler and our implementation we tested five benchmarks using the Linux operating system version 2.3.99-pre3 and our version 2.3.99-pre3-elsc. Version 2.3.99-pre3 is the most recent version of Linux as of 24 March 2000 [11].
We used two machines for our tests. The first is a generic desktop computer with an AMD K6-2 400 MHz. The second machine is a SMP computer with four, 500 MHZ Pentium III Xeon processors.
In analyzing our results we desired to answer two questions: does our scheduler show the same performance when there are a small number of threads in the system and does it scale when there are a large number of threads?
The remaining paragraphs in this section give a description of each benchmark, the purpose for the benchmark, followed by the results obtained running the benchmark on both the current scheduler and our implementation.
5.1 gcc Compiler.
Our first benchmark is the gcc
compiler. The benchmark measures the
time to compile the Linux kernel in order to determine how our scheduler
compares with the original scheduler in the common case—that is when there are
only a few tasks running in the system.
Table I and II show our results.
Time Breakdown |
2.3.99-pre3 |
2.3.99-pre3-elsc |
User Time
(secs) |
443.29 |
444.09 |
System Time
(secs) |
35.73 |
35.33 |
CPU Time |
99% |
99% |
Total Time |
08:00.0 |
08:00.1 |
Table I -
Uniprocessor |
Time Breakdown |
2.3.99-pre3 |
2.3.99-pre3-elsc |
User Time
(secs) |
262.78 |
265.05 |
System Time
(secs) |
19.61 |
19.03 |
CPU Time |
332% |
332% |
Total Time |
01:24.9 |
01:25.4 |
Table II - Symmetric
MultiProcessor |
The tables show almost identical
performance between the schedulers with the current scheduler compiling the
kernel slightly faster than our implementation. However, the total time spent in the kernel is interesting. On both the uniprocessor and the SMP, our
implementation spends less time in the system indicating that its scheduling
algorithm is running slightly faster than the current version.
5.2 Thread
Spawning.
The second benchmarks spawns 20 to 1000 threads and counts
the number of times they yield in one second. To ensure that all threads are on
the run queue when the benchmark begins all threads RSVP via a semaphore to a
control thread. Those threads then proceed to wait for a pipe file descriptor
by calling select( ). Thus, all threads have similar thread control blocks when
they begin execution, as if they had all executed the same code. Once all the
threads have RSVPed, the control thread gets the current time and writes a byte
to the pipe. This action wakes up all threads and places them into the ready
state simultaneously. An alarm handler signals the finish time by setting a
shared stop byte to stop the benchmark.
The benchmark measures the average number of microseconds it takes the
kernel to handle a yield call.
The purpose of this benchmark is to see how the fast the
scheduler makes decisions as the number of threads increase. The results, shown in figure 4, list the
current Linux scheduler as “base” and our implementation as “elsc”. The results clearly indicate that the current
scheduler runs in linear time while ELSC runs in constant time. This verifies earlier experiments by IBM
that the current Linux scheduler does not scale. Additionally, when there are less than 50 ready tasks in the run
queue both schedulers show similar performance.
Furthermore, while running this benchmark we also observed that ELSC made fairer scheduling decisions as compared to the current scheduler—that is the number of yield calls made per thread was relatively consistent with ELSC, while in the current scheduler certain threads dominated the number of yield calls made.
5.3 Talking
Threads.
The second micro benchmark spawns an increasing number of
server processes operating in separate address spaces. Each server process then spawns ten threads
executing within the same address space as the server process. Those ten threads send a byte to the server
thread at a random interval. The server
thread receives the byte and then echoes it to the 10 threads.
As with the previous benchmark, this benchmark uses the
RSVP/select( ) method to ensure all threads are on the run queue prior to
starting the timer. In order to spawn
1000 threads, their creation required regulation with semaphores. The benchmark measures the average number of
microseconds required to generate a message.
The purpose of the benchmark, as the previous benchmark, is
to measure how each scheduler handles an increase in the number of
threads. This benchmark also stresses
the address space portion of the dynamic goodness calculation since it
creates more than one task operating in separate domains. Finally, because of the number of
synchronization elements used, the benchmark also increases confidence that
ELSC does not disturb other areas of the operating system.
Even with all these ongoing actions, ELSC still generates
500 more threads than the current scheduler version (figure 5). The results
again show that our scheduler runs in relatively constant time—even with an
increase in the number of ready tasks in the system. The slight slope is likely due to an increase number of servers’
broadcasting their messages back to their threads. Once again, when there are only a small number of ready tasks the
two schedulers show similar performance.
5.4 Java Counting
Threads.
The fourth micro-benchmark is written in Java. This program takes as input the number of
Java threads to create. The benchmark
first creates all the threads, then each thread starts incrementing a
counter. When a thread’s counter
reaches 10,000 it stops. The count is
high enough so that a thread’s time quantum expires at least once prior to
reaching 10,000--forcing a minimum of one yield per thread during the
simulation. The program stops executing
when all threads have counted to 10,000, and it records the total time for all
threads to reach this limit. We used
IBM’s Java Development Kit (JDK) 1.18 for Linux to obtain our results since
their JVM uses native threads in the implementation of Java threads. Java is run with a maximum heap size of 256
MB in order to minimize garbage collection time. Similar to the second and third benchmarks, this benchmark
stresses the scheduler because it forces it to choose between large numbers of
threads, resulting in a large run queue.
Since this is a Java implementation the JVM/kernel interface is also
stressed.
The benchmark results are shown in Table III and figure
6. We ran this benchmark on the
uniprocessor machine only to obtain some initial results as to the performance
of our scheduler in a JVM/kernel environment. The graph shows that between 10
and 800 threads ELSC performs slightly better than that of the current
scheduler. However, the graph clearly
indicates that when 1000 threads are running in the scheduler, ELSC scales much
better.
Number |
Elapsed |
Time |
of Threads |
(msecs) |
|
|
base |
elsc |
10 |
12 |
11 |
100 |
38 |
26 |
200 |
53 |
34 |
400 |
85 |
57 |
800 |
114 |
108 |
1000 |
645 |
135 |
Table III - Java CountingThread |
5.5 VolanoMark.
The final benchmark we ran is VolanoMark. As discussed in section 2, VolanoMark is a
well-known Java benchmark currently used to compare JVM implementations because
it simulates a real-world type scenario—chat rooms. The benchmark creates client connections in
groups of twenty. After establishing an
intial connection to the server, the clients send messages to the server. The server receives messages from each
client and then broadcasts those messages so every client can another client’s
messages. VolanoMark measures how long
it takes for clients to broadcast their messages to the group. At the end of
the test, it reports a score as the average number of messages transferred by
the server per second. [5]
Because we ran the benchmark using
IBM’s JVM, VolanoMark also tested the schedulers’ ability to the service the
client and server threads. We ran all VolanoMark tests in loop back mode in order to factor out any network latency, and only
ran it on the SMP machine because page swapping was dominating the time on the
uniprocessor. Additionally, we set a
thread’s stack size to 256K and the maximum heap size to 256 MB in order to
minimize the frequency of Java’s garbage collection (java –ms 64MB –mx 128 MB).
Each VolanoMark room creates 80 threads
(20 clients per room x 2 threads/client x 2 threads/server). We varied the number of rooms from 1 to 15
resulting in 80 to 1200 threads and had each client send 100 messages. The total elapsed time for each thread to
send 100 messages and the total message throughput is shown in figures 7and 8
and Tables IV and V.
Again, as witnessed with the previous
Java micro-benchmark, our scheduler performs slightly better than the current
scheduler when there are less than 800 threads and scales considerably better
than the current Linux scheduler, showing almost a two times speedup and a 90%
message throughput improvement when there are 1200 threads running in
VolanoMark.
Number |
Elapsed |
Time |
of Threads |
(secs) |
|
|
pre3 |
elsc |
80 |
4.58 |
4.55 |
160 |
15.31 |
11.86 |
400 |
63.30 |
41.51 |
800 |
195.78 |
126.34 |
1200 |
438.55 |
224.37 |
TABLE
IV VolanoMark Elapsed Time |
Number |
Average |
Throughput |
of Threads |
(messages/sec) |
|
|
pre3 |
elsc |
80 |
8737 |
8795 |
160 |
5337 |
6744 |
400 |
3162 |
4818 |
800 |
2046 |
3166 |
1200 |
1387 |
2674 |
TABLE V VolanoMark Throughput |
Our results running a real-world
benchmark show that our scheduler performs as well if not, perhaps, slightly
better than the current scheduler when there are a small number of ready
threads, and scales much more gracefully when there are more than 400 threads
in the system.
Based upon what we have learned in
implementing the ELSC scheduler, we feel that there are two opportunities for
future work: further optimization and a
slightly different approach. We would
like to first optimize the ELSC scheduler. There are a few implementation
constants, such as search bounds, that we chose arbitrarily. A closer study of
running systems would allow us to fine tune these parameters and yield the most
effective system. Other optimizations
might involve the use of an extensible hash function to better scale the
structure for both a large and small number of running tasks. Finally, we could improve the implementation
through careful code inspection, optimizing for both uniprocessor and SMP
machines.
Another design proposal, that we
considered but did not have time to implement, is to sort the ready tasks
within each hash bucket by an address space/processor combination in an attempt
to "predict" the dynamic
portion of a goodness value. Further,
we would order each run queue using a priority queue, or similar
structure. This design has the
advantage of extremely fast lookups and comparisons. However, it is our opinion that the overhead required to maintain
the structures correctly may outweigh the advantages. We do believe though, that
“everything is worth trying once”, and the Linux scheduler is no
exception.
7 Conclusions
An
increasing number of organizations continue to evaluate, test, and use the
Linux operating system to address their computer system requirements because of
its low price/commodity ratio and ease of upgrading upon new releases. Several of these organizations are large
corporations and Internet service providers, interested in using Linux as their
corporate operating system and web server.
We have shown in this paper, however, that when the Linux scheduler is
confronted with a large number of ready tasks, overall system performance and
user responsiveness rapidly declines.
In a large-scale enterprise or web server environment this decrease in
performance is unacceptable.
In
this paper we set out to incrementally improve the Linux scheduler’s
scalability problem, desiring modifications that did not change its excellent
desktop performance yet scaled appropriately when faced with a large number of
ready tasks. While there is still work
required in terms of optimizing our current design, we demonstrated that the
current ELSC scheduler satisfies both a
small and large number of ready tasks and offers
a possible alternative to the current Linux scheduler.
The
authors would like to acknowledge Peter Honeyman and Chuck Lever of CITI for
their initial idea, guidance, and assistance; the expertise and help from Ray
Bryant and Bill Hartner of the IBM Technology Center in Austin, Texas for their
assistance in acquiring and using VolanoMark and the IBM Kernel Trace Facility
for Linux; Brian Noble for sparking our interest in operating system’s
research; and finally, all the past, previous, and future developers of Linux.
Copyright and Trademark Information
CITIÔ is a register trademark of the Center for
Information Technology Integration as part of the Information Technology
Division (ITD) at the University of Michigan.
Linuxâ is a register
trademark of Linus Torvalds.
IBMÔ Kernel Trace Facility is a trademark
of IBM Corporation.
IBM Java Development Kit 1.18 (JDK
1.18) is a trademark of IBM Corporation.
JavaÔ is a trademark of Sun MicrosystemsÔ, Inc and
refers to Sun’s Java Programming language.
VolanoChatÔ and VolanoMarkÔ are registered trademarks of Volano LLC. The VolanoMarkÔ benchmark is Copyright ã 1996-1999 by Volano LLC, All Rights Reserved.
References
[1] Atlas, A.
Design and implementation of statistical rate monotonic scheduling in
KURT Linux. In Proceedings 20th IEEE Real-Time Systems Symposium. Phoenix, AZ, December, 1999.
[2] Bryant Ray and Hartner, Bill. Javaä, Threads, and Scheduling in Linuxâ. IBM Linux Technology Center, IBM Software
Group. http://www-4.ibm.com/software/developer/library/java2/index.html.
[3] Lohr, Steve. IBM goes countercultural with Linux. The New York Times On The Web, 20 March 2000. http://www10.nytimes.com/library/tech/00/03/biztech/articles/20soft.html
[4] Molnar, Ingo. Re: scheduling. mingo@chiara.csoma.elte.hu,
1 May 1998. http://www.uwsg.indiana.edu/hypermail/linux/kernel/9805.0/0056.html
[5] Neffenger,
John. The Volano Report. Volano LLC, 24 March 2000. http://www.volano.com/report.html.
[6] Orr, G.
Building a Library Web Sever on a Budget. Library Software Review, 17:3, Sep 1988, 171-176.
[7] Regehr,
John. Reply: A different approach to scheduling issues. jdr8d@cs.virginia.edu,
30 September 1998. http://www.uwsg.indiana.edu/hypermail/linux/kernel/9809.3/0933.html.
[8] Torvalds, Linus. The Linux Edge. Communications of the ACM 42, 4 Apr 1999, 38-39.
[9] Wang, Y.C.
Implementing a general real-time scheduling framework in the RED-Linux
real-time kernel. In Proceedings 20th IEEE Real-Time Systems
Symposium. Los Alamitos, CA, 1999,
246-55.
[10] Woodard,
B. Building an enterprise printing
system.
In Proceedings of the Twelfth Systems Administration Conference (LISA XII). USENIX Association, Berkeley, CA, 1998, 219-28.
[11]
Available: http://www.kernel.org/pub/linux/kernel/