Work scope for the Linux Scalability Project
Chuck Lever, Niels Provos, Naomaru Itoi
linux-scalability@citi.umich.edu
Abstract
|
We outline areas of research for the
Linux Scalability
research project, and list specific deliverables.
|
This working document is Copyright © 1998, 1999
Netscape Communications Corp.,
all rights reserved.
Trademarked material referenced in this document is copyright
by its respective owner.
|
Introduction
This document outlines areas
of research and development for the
Linux Scalability research project.
The primary goal of this
research is to
help make the Linux operating system
enterprise-ready
by improving its performance and scalability.
Linux is beginning to compete with other
UNIX and non-UNIX
operating systems in the server market,
and is becoming more popular among
ISPs and ESPs who are using it to provide
enterprise-class network services to their customers.
We are specifically interested
in finding immediate and practical
improvements to Linux that will increase
the performance of Netscape's network server
suite, which includes an LDAP directory
server, an IMAP electronic mail server,
and a web server, among others.
To achieve our primary goal, we will
select a number of areas of potential
improvement to the Linux operating system,
prioritize the areas
based on their estimated improvement pay-off
versus their implementation cost,
then implement the highest
priority improvements.
We will evaluate each improvement
using server and OS benchmarking methodologies
that are as close
to standard as possible to allow scientific
comparison with other research in this area.
Finally we will work with Linux developers to
incorporate our improvements into the baseline
Linux source code.
Improving existing features and
adding new ones to an operating system
to boost application-specific performance
via independent research
is an opportunity afforded
by freeware operating systems
such as Linux.
In the following sections, we outline
the areas where we think we can have the
most success.
In addition to technical achievements,
part of our work will include building
collaborative relationships among freeware
advocates, and with system and software vendors.
Network server performance issues
There are several common performance
and scalability issues recognized by
researchers and server developers.
Here we summarize
some of the most important factors we have
considered.
File descriptor scalability
As the number of network users and
clients grows, the number of concurrent
open file descriptors on network servers
can easily exhaust system limits. The
number of file descriptors maintained on
servers often grows proportionally with
the number of concurrent clients served.
For example, IMAP servers need to
maintain a socket to connect with each
client, and an open file descriptor for the
client's mailbox. A system-wide file
descriptor limit of 1024 prevents such a
server from supporting more than about
500 concurrent users.
Data throughput
Service scalability depends on the
ability of network servers to deliver
more and more data at higher and higher
rates. Operating system architecture
and implementation can have significant
effects on data bandwidth. To improve
a server's effectiveness, we need to
address issues in the operating system
and application that limit the amount
and rate of data flowing from the server's
disk to the network.
On some types of network servers, such
as mail servers, the disk read-write ratio
is significantly skewed towards writes.
Metadata updates and data writes are
among the most expensive disk and file
system operations.
Careful analysis of these operations
may be of great benefit.
Memory bandwidth is also important
in this regard.
Memory allocation and system memory
management can be optimized to make
good use of hardware memory caches.
As well, maintaining I/O data cached
in main memory can improve overall
server efficiency.
Network traffic generated by heavily-used
network servers exhibits unique
characteristics not easily reproduced when
analyzing server performance. Clients
are often situated behind high latency
network connections, resulting in a high
degree of server packet retransmission.
Packet retransmission creates unnecessary
levels of network congestion. Furthermore,
servers often maintain an increasingly
large number of concurrent connections
because most clients are slow to retrieve
data, and thus maintain their connection
for a longer time.
Research has suggested ways to improve TCP
congestion management and startup behavior.
The good news is that these changes can
be implemented on the server, benefiting
server network data throughput without
dependencies on client networking software.
Specific OS dependencies
Lock contention
Locks are used extensively in server applications,
so the performance of an OS's lock primitives
is very important.
Also, support for mutexes that can be shared
among processes is required.
Memory management
Server applications make heavy use of shared
regions, anonymous maps, and mapped files.
Special features like locking down regions
so they aren't swapped, fast
mmap(),
support for allocating very large shared
regions and memory areas, and efficient memory
allocation are especially useful.
For example
malloc() in the C run-time library
appears not to scale well across multiple
processors, since sharing the heap requires
a single global heap lock.
MP scalability
To provide more processing power to
a server application, we can add more
CPUs to a server, but first we must be
sure that the operating system and the
server application can take full
advantage of more than one or two
CPUs at a time.
Network servers are generally I/O bound,
however. Increasing the number of CPUs,
while not directly increasing the I/O
bandwidth of a system, may have other
benefits, such as increasing the amount
of CPU available for handling interrupts
and processing network protocols.
The very latest versions of Linux use MP
hardware significantly more efficiently than
some earlier versions do. However, there
is still room to improve.
Asynchronous events and thread dispatching
Network servers require an integrated
approach to asynchronous I/O and thread
dispatching.
Most modern server architectures make heavy
use of both asynchronous I/O and threads.
Asynchronous I/O support helps keep the
amount of kernel resources and number of
outstanding read buffers to a minimum.
Having an asynchronous I/O model that
is easy to program with and allows
reuse of server software among various OS
platforms is a big win.
Most importantly, an OS-provided integrated
asynchronous I/O and event dispatching facility
has been shown by researchers to be critical
to the performance and scalability of
internet servers.
More flexible and efficient system
call interfaces
Under some circumstances, Netscape
server products appear to perform better
on Windows NT than on UNIX platforms.
Many have conjectured that NT
has better system call interfaces
for network servers than UNIX.
A way of improving
server performance and scalability is to
help the server application itself make
more efficient use of the operating system
and the resources it provides. We
can do this by adding improved interfaces,
or by making the current interfaces,
such as
poll(), more efficient.
System interfaces should also support
64-bit files and filesystems, as well as
very large address spaces and more than
a few gigabytes of physical RAM.
Prioritizing our work
To decide which improvements provide
the most benefit, we will rate each
potential improvement in the following
categories.
Estimated pay-offs
-
Measurable throughput, performance, scalability improvements.
- These are the most important gains we hope
for. We can estimate these pay-offs
by using research of preexisting literature
and simple microbenchmarks.
-
Added stability during overload.
- While some improvements may provide
little, none, or even slightly negative
performance or scalability gains, they
might offer significant enhancements of
system behavior during overload conditions.
-
Synergy with other potential improvements.
- Several potential improvements to
Linux can be accomplished in
different ways. Choosing to implement
one improvement may make others
much simpler to implement.
Estimated implementation costs
-
Implementation time with existing resources.
- We want to make sure the
work we plan is feasible for our developers,
and can be completed successfully during the
course of this project.
-
Estimated complexity.
- Complexity
relates directly to the amount of
testing required, for example,
and also increases the likelihood
that improvements may introduce new
bugs. We are also concerned about introducing
improvements that require
significant changes to applications, especially
changes that are not backwards compatible.
-
Potential introduction of security or
scalability problems.
- While an improvement might be easily implemented,
it also might introduce other problems
that make it unsuitable, such as unstable
overload behavior, or unacceptable security exposures.
-
Amount of server re-engineering required.
- A strong bias towards work that requires
no or only incremental modification of system
interfaces is appropriate. This will improve the ability of
everyone to take advantage of our changes immediately.
-
Degree of expected acceptance by Linux developers.
- We'd like to provide
improvements that are likely to be accepted
as permanent modifications to the
Linux kernel, as maintained by the
Linux developers. Our improvements
lose value if they have to be installed or
included separately.
Coordination efforts
These efforts help build collaborative
relationships among research and
corporate entities to forward the mission of
our research.
EECS research on performance
and NT-like system call APIs
Graduate research at U-M's College
of Engineering may be approaching
several of these issues already. We would
like to coordinate with this work so we
don't duplicate it.
Input from Netscape server product teams
We will consult with members of
Netscape server product teams to itemize
and prioritize work that the teams would
like to see to improve Linux/server
product performance.
Collaboration with Linux development community
We will co-operate with members of
the Linux development community to
determine the current state of Linux,
and determine how that work
affects Netscape server product
performance. We will offer development
resources for work on
scalability and performance issues.
ISV relationships
We will work with interested system and software
vendors to build support for Linux as a server
platform. This work may range from helping Veritas
create a Linux version of their VxFS file system,
to working with Intel's performance engineers,
or working with the makers of Purify to port their
products to run on Linux.
Staged delivery
We've broken our project goals and deliverables
into three stages.
Project prioritization is based on what expertise
and resources are available to our project, and
what will have the highest pay-off and probability
of success.
In other words, we will start with "low-hanging
fruit," and as our resource base and experience
grows, we will tackle more difficult and
riskier problems.
Stage one
Initially, we are interested in providing
improvements that require no changes to
application architecture or to the system
interface.
These changes are easy and have a high
probability of pay-off, with little risk of
introducing new bugs or performance problems.
This is a period during which we will build
our expertise, and create ties to the Linux
development community.
We also anticipate forming relationships with
several ISVs.
Finally, we will construct and benchmark a
small local test harness in preparation for
measuring later implementations.
Our deliverables during stage one include
a finalized version of this work
scope document, scholarly papers and status
reports describing our progress, and the
construction of our local test harness.
We will also establish OS and application
benchmark levels with microbenchmarks and
application-level benchmarks.
The benchmark results will provide both
a base-level performance measurement, and
specific improvement initiatives.
Specific stage one projects include:
-
Removing file descriptor limits
-
This will involve increasing the default file descriptor
limits in select(), poll(),
and get_unused_fd(), as well as measuring and
analyzing file descriptor related kernel functions to
see that they scale well with the number of file descriptors.
We will also explore improvements and changes to the C
run-time library.
-
Improving memory bandwidth
-
We will implement and measure new versions of memset()
and memcpy() in the kernel and in the C run-time library
that can approach much more closely hardware memory bandwidth
maximums.
We will also tune malloc() in the kernel and the C run-time
library to help mitigate latencies in underlying system
resource providers, and to help these routines layout memory
in a more hardware cache-friendly way.
-
Improving TCP bandwidth
-
-
Several interesting innovations can help boost TCP
throughput, and reduce latency due to lost packets.
We will implement and study a new mechanism which connects recovery
processing on one connection to all other connections between
the server and a particular client.
We will also tune and improve current TCP recovery mechanisms,
including SACK, duplicate ACK, slow start, and fast retransmit.
Finally, we will dynamically analyze Linux's current TCP
implementation to check its compliance with TCP standards,
and that its congestion behavior is neighborly.
-
Improving mmap() efficiency
-
The mmap() system call is used pervasively in Linux
and in server applications.
We can examine and improve mmap()
performance on Linux to help make the whole system faster
and more scalable.
The mmap() system call uses system disk I/O and memory
resources heavily, so this work may also give us an excuse
to fiddle with pieces of the Linux VM system, kernel memory
allocation, the C run-time library, and disk device drivers.
-
Constructing and benchmarking our test harness
-
In order to measure scientifically our improvements, we will
require a local test harness on which to stress and benchmark
Linux and the various network server applications.
Selecting the specific hardware and choosing which versions
of server and benchmarking software will be part of stage
one.
Stage two
By stage two, we will explore solutions that
may involve some changes to Linux's system
API and/or to server application architecture.
We may also try some of the more risky
or more complicated improvements, now that
we have some experience under our belts.
We will also have built constructive relationships
with some ISVs and with parts of the Linux
development community.
Our deliverables during stage two include
the improvements themselves,
with scholarly papers and reports
describing our improvements.
Specific stage two projects include:
-
Implementing a caching send_file()
-
-
There already exists a Linux implementation of
send_file() but there may be room for improvement.
Integrating send_file() with the kernel's SKBUFF cache
will improve its performance significantly, for example.
This project would involve finding ways to make send_file()
more efficient, then re-architecting one or more servers
to use it.
-
Improving NSPR
-
We can focus some effort on the Linux version of NSPR to help it
make the best use of system resources, including new system
calls such as send_file(), or new features such as
asynchronous I/O completion notification.
-
Discovering Linux scalability limits
-
Linux may have some unfortunate system limits that we will
need to discover in order to address them within Netscape
server product software. Examples of such limits might be:
- small process address space size
- small kernel address space size
- inability to page most kernel data structures
- fixed limits on kernel data structure size
- 32-bit limits on file system interfaces (like VFS)
-
This work would attempt to stress various parts of the
Linux kernel to determine where its limits lie. We can
also engage in research and communication with Linux
developers to uncover architected limits, and find ways
to relieve the limits.
-
Exploring asynchronous I/O models
-
The lastest version of the Gnu C run-time library (glibc 2.1)
has asynchronous I/O support, based on the POSIX.4 spec, built
into it. Recent versions of the Linux development kernel
support glibc's aio API with POSIX.4 realtime signals, and
I/O completion notification via these signals.
-
This work would explore the usefulness of the current support,
and also implement other system interfaces that provide
different programming models, to compare the efficiency of
different implementations, and to see
which are easier to use in applications.
It may also involve modifications to base server product
code to try out some of the new interfaces.
-
High-performance filesystems
-
Linux is scheduled to get a journalling file system, as
well as support for 64-bit files, in the near future.
It is important that the underlying filesystem implementations
can realistically scale to provide these features.
Some such areas might include boosting the ability to create
many files in the same directory, providing support for
swapping memory-based filesystems,
improving the efficiency of metadata operations and
data writes, and supporting very large
filesystems via variable block sizes (for use with RAID
subsystems).
Stage three
During stage three, we will subject
our most successful improvements
from early stages to more stringent performance
and scalability testing by
working directly with Netscape's
server development
teams, and by providing some of our
work to service providers we know well,
such as Netcenter.
As part of our stage three efforts, we
might also focus on creating a Linux
Center of Excellence at CITI.
This Center of Excellence could host
other researchers and provide hardware
arenas for advanced research and
development of the Linux operating system.
Our deliverables during stage three
include scholarly papers and reports
describing the deployment and
performance measurement results.
We will also identify a comprehensive
set of performance measurement tools
and methodologies.
Finally, we will complete a Linux
server "Best Practices" guide which
describe Linux-specific configuration
options and enhancements to help
customers get the most out of their
Linux-based network servers.
Specific stage three projects include:
-
Measuring and improving SMP scalability
-
Our specific interest here is finding where SMP scalability
is constrained within the kernel, and relieving those areas.
Many recent versions of Linux have greatly improved SMP
support, including support for up to 16 processors, and
multiprocessor interrupt support.
However, there may be areas where resource contention or
scheduler limitations, for instance, might limit the amount
of real scalability obtainable.
-
Optimizing PCI performance on multiple buses
-
This work would require a server configuration with
high memory bandwidth and multiple PCI buses. We would
attempt to understand the interaction between the
operating system and multiple PCI buses, and try
some improvements based on our analysis.
As well, we will explore ways of improving the efficiency
of SCSI drivers by increasing the capacity of the
device driver to handle overlapped I/O and RAID.
-
Linux device driver support for ATM cards
-
ATM networking can help increase
server throughput over and above FDDI
or fast Ethernet technologies. Therefore,
we can explore much further the edges
of server performance with ATM. It's
not clear, though, how well ATM and
other types of high-performance networking
are supported in the Linux kernel.
-
Improved interrupt handling
-
This work would combine SMP enhancements with the addition of
a hybrid polling/interrupt-driven interrupt model to the kernel
to allow device drivers to handle batches of interrupts rather
than one interrupt at a time.
Such support already exists for serial devices; we may find that
it significantly improves the performance of disk and high-bandwidth
network devices, too.
-
Zero-copy networking
-
Reducing or eliminating data copy operations that
result from processing a network packet can help improve
application data and request bandwidth.
Mechanisms for improving networking
efficiency include checksum caching,
reducing the number of data copy operations, and moving data
directly from one driver to another without context
switches (using a mechanism such as IO-Lite). Some of these
changes are easy, but something like IO-Lite would be a
significant undertaking.
Benchmark methodologies
The current Linux development kernel (v2.1)
is about to be rolled over into the next
version of the stable kernel (v2.2).
The 2.1 kernel has been "feature-frozen" since
the Spring of 1998, meaning that bug fixes are
gladly accepted, but new features generally are
not.
Since the 2.2 kernel is so close, it is likely that
most or all of our enhancements will be added
to the 2.3 kernel when it arrives.
Whereever possible, we will work with
the current development tree, since it contains
a number of enhancements that are required by the
Netscape server products.
The development tree
contains many improvements to the
kernel, but is sometimes made unusable by work
in progress, so it may be a source of delays.
To provide truly useful measurements of
performance and scalability, we
will choose benchmarking systems that
performance researchers and Netscape's
own performance engineers use most
often. This permits comparison and
repetition of our work, increasing its
value over time. At the same time, we
recognize that standard benchmarks are
often inadequate for measuring certain
types of performance problems, so we
will use other benchmarks as well.
There will also be cases where we
want to examine directly the effects of
certain modifications to operating
system features. For analyzing OS-specific
modifications, McVoy's microbenchmarks
and the Byte Linux Benchmarks
will be useful. File system benchmarks,
such as Bonnie, the Modified Andrew
Benchmark, and SPEC's SDET and KENBUS
benchmarks, will provide cross-sections
of overall system performance.
We are especially interested in
application performance, so
application-specific benchmarks will also be used to
measure our progress. Webstone and
SPECweb96 appear to be the standard
web server benchmarks. However,
S-client and httperf have features that
would exercise pathological network
behavior, and may be useful in judging
networking improvements.
Directory-Mark is Netscape's directory server
benchmark of choice.
We have a strong bias towards Web-server
benchmarks, even though our work will initially
be focused on the Netscape Directory and Messaging
Servers, for several reasons:
-
Using freeware web servers and benchmarks means CITI
and others can remain without nondisclosure while still
making significant contributions.
-
Many web server performance issues are common to other
types of network servers, and are easier to measure
in web server.
-
There are a variety of web servers and web server
benchmarks available.
Hardware
High speed networking technologies
will be an integral part of our test harness.
Either switched fast Ethernet or
ATM will comprise our test harness
network.
We will also have multiple CPU
hardware on hand to implement and test
SMP changes. It may be advisable to use
the more powerful machines to drive
server loads on smaller machines in the
test harness to approach server performance
limits more quickly and repeatably.
Testing and evaluation of large-scale server
configurations is beyond the scope of this
project. We can go as far as understanding
compatibility and Linux-specific performance
issues with large-scale and esoteric configurations,
but our expertise is focused on software
optimization. Besides, we believe that all our
operating system optimizations will
benefit moderate and large-scale server
deployments. And, as our work progresses, we
will be better positioned later to investigate
large-scale server performance issues.
Milestones
In this section, we list project milestones by date.
- November 1, 1998
U-M has selected
GSRA. This work scope document is available to
Netscape and U-M researchers for
review. Stage one begins.
- December 1, 1998
Research agreement is executed by
Netscape and U-M.
Netscape has on-site employee up to
speed with orientation and project
supplies.
- January 1, 1999
Test harness hardware is available
to U-M researchers.
Initial reports on stage one improvements
available from U-M.
- February 1, 1999
Work scope draft is accepted. U-M
has working test harness, and initial draft
of development priorities. Research and
prototyping is nearing completion.
Scientific benchmark results of
test harness are available.
- March 1, 1999
U-M provides three to five papers
describing their work to date. A
technical exchange is scheduled during
March for Netscape and U-M to meet and
discuss progress.
- April 1, 1999
Stage two begins.
U-M makes developments available
for testing at Netscape. Two to four
more papers are available from U-M.
U-M and Netscape select arenas for further
testing in production environments.
- July 1, 1999
Stage three begins.
U-M is working with Netscape
and several service
providers to test and deploy some of the most
promising developments.
- September 1, 1999
U-M provides two to three papers
describing deployment experiences.
U-M provides final status paper for
Netscape to review.
- November 1, 1999
Netscape and U-M agree on final
work completion. Stage three is
complete.
If you have comments or suggestions, email
linux-scalability@citi.umich.edu
Revision: $Id: workscope.html,v 1.12 1999/11/12 20:15:46 cel Exp $