Linux Web Server Optimizations

Chuck Lever, Netscape Communications Corp.

$Id: web-opt.html,v 1.3 1999/11/12 20:12:54 cel Exp $

Abstract

This report summarizes discussion on the linux-kernel mailing list on tuning optimizations for running a web server under Linux 2.2.

This document is Copyright © 1999 Netscape Communications Corp., all rights reserved. Trademarked material referenced in this document is copyright by its respective owner.

Introduction

In this report, we summarize a recent discussion of tuning suggestions for Linux systems running as web servers. This report is not meant as an exhaustive survey nor as a source of benchmark data, but simply as a sophisticated bookmark for information resources, and a touchstone for future work.

Summary

While most system tunables are already set to acceptable values in the default kernel, make sure you have boosted inode-max (/proc/sys/fs/inode-max), and have built the kernel with a large number of process slots (include/linux/tasks.h). If you need to run on a large-memory machine (larger than 1G), recompile as per the directions in the kernel tree (include/asm-i386/page.h). If your web server generates a significant swap load or heavily uses mmap'd files, increases the swap cluster size in /proc/sys/vm/page-cluster.

Adjust snd_cwnd, snd_cwnd_cnt (TCP congestion window defaults), and /proc/sys/net/core/netdev_max_backlog, to improve networking capacity.

Other web servers are generally faster than Apache, which comes with most Linux distributions. Apache tuning info can be found here: http://www.apache.org/docs/misc/perf-tuning.html. Disable most or all logging and reverse IP lookups if you need performance. See Zeus (www.zeus.co.uk). Zeus tuning info can be found here: http://support.zeustech.net/tuning.htm.

You can tune your file systems for better performance by creating them with 4k blocksize instead of the default 1k blocksize: use "mke2fs -b 4096". Reducing the number of inodes will also help: use "mke2fs -i 16384". Both of these changes will greatly reduce the amount of time required to fsck and mount large file systems.

The rest of the discussion revolves around how to improve Linux and the Apache software to make better use of multiple CPUs via multithreading.

URL Index

http://people.unt.edu/~mev0003

http://www.acme.com/software/thttpd/

Discourse

Date: Fri, 16 Apr 1999 11:04:39 +0100 (BST)
From: Alan Cox 
To: Dick Balaska 
Cc: linux-smp@vger.rutgers.edu
Subject: Re: NT vs Linux info

> Linux SMP vs. NT SMP and Linux failed.  Miserably.  If indeed Linux 
> suffered a nervous breakdown and slid rapidly to 0 throughput, that 
> needs to be addressed.

Pick any result and pay for it.

> Here is the zdnet response to the test:
>
http://www.excite.com/computers_and_internet/tech_news/zdnet/?article=/news/19990415/2242246.inp

Interesting, speculative and also wildly inaccurate. NT won their tests because
someone (either the test lab or MS) very carefully chose to misconfigure the
system.

An out of the box Linux 2.0 setup with an SMP kernel dropped over it gives
very different graphs to those. An untuned 2.2.x x>=4 does likewise

Someone specifically sat down and said

o       I've heard 2.2.2 has an interoperability problem with NT where
        the odd NT/Windows ack patterns cause slow transfers

o       This scsi controller here appears to be known to be slow and not
        working SMP in this beta, lets specify that.

And mindcruft also didn't look at www.linux.org.uk at all for example.

This is not a technical matter, its exactly the same as if I decided to publish
a Linux thrashes NT flat benchmark using 16Mb machines to make NT look 
artificially worse than the dreadful figures it gives anyway.

Alan

Date: Fri, 16 Apr 1999 11:10:53 +0100 (BST)
From: Alan Cox 
To: Cacophonix Gaul 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?

> for a web benchmark. I understand many of the tunables 
> in /proc/sys, but I'm not sure if I'm missing something 
> crucial.

You actually probably don't need to tune them

> Assuming fast hardware, with lots of RAM (ranging from 
> 512MB - 2GB), and with/without RAID (software), and a 
> fast network (yellowfin or eepro100), and ext2fs. 

For 2Gig of RAM you want to recompile as per the directions in the
kernel tree (include/asm-i386/page.h) to use an offset of 0x80000000.

> What should I change at 2.2.5 compile time? I'm 
> especially curious if increasing the initial tcp cwnd 
> to min (4380, 4*MSS) as suggested by IETF/tcpimpl would 

It will do for an http/1.0 test, it doesn't for real world. You can tell
if its a factor because you will see much less than 100% CPU usage and
that adding clients increases performance.

> connections of the benchmark (I presume I'd have to 
> change both snd_cwnd and snd_cwnd_cnt)

Yes

> Will I get any benefit from changing the tunables in eepro100.c and
> yellowfin.c ?

There is nothing there you need to touch

> inode-max
> bdflush
> buffermem
> freepages
> kswapd
> pagecache
> Any other I'm missing ?

You shouldnt need to touch those either.

> I'm using many of the common apache 1.4.3 
> optimizations. Is there anything I can do to improve 
> SMP performance, to help the built in affinity-scheme 
> of linux?

Apache does fine with the default CPU affinity. You probably need to tune
the apache configuration more than anything else.  You want the logging and
swap on a different disk to the pages and to the system preferably. For the
raid you want striping (obviously) and the raid howto/tools tell you how
to find the optimal stripe size.

If you want to answer the question "how fast is linux as a web server" 
consider benchmarking using Zeus (www.zeus.co.uk) too.

Alan

Date: Fri, 16 Apr 1999 17:31:31 +0800
From: David Luyer 
To: linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks? 

> If you want to answer the question "how fast is linux as a web server" 
> consider benchmarking using Zeus (www.zeus.co.uk) too.

Benchmark Boa.  The Debian package of it would be an easy target.  The
web server executable is around 48k in size.  (see www.boa.org)

David.

Date: Fri, 16 Apr 1999 10:48:12 +0100 (GMT)
From: Matthew Kirkwood 
To: Alan Cox 
Cc: Cacophonix Gaul , linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?

On Fri, 16 Apr 1999, Alan Cox wrote:

> If you want to answer the question "how fast is linux as a web server"
> consider benchmarking using Zeus (www.zeus.co.uk) too.

Just ask the Mindcraft people how fast Linux is :-)

I've heard claims that Roxen (www.roxen.com) out-benches Zeus.  Is this
likely?  (Are there any comparative benchmarks out there?)

Matthew.

Date: Fri, 16 Apr 1999 23:04:03 +0200 (CEST)
From: Rik van Riel 
To: Cacophonix Gaul 
Cc: Linux Kernel 
Subject: Re: 2.2.5 optimizations for web benchmarks?

On Fri, 16 Apr 1999, Cacophonix Gaul wrote:

> I'd like some help with optimizing linux (and apache) 
> for a web benchmark. I understand many of the tunables 
> in /proc/sys,

I've recently updated the documentation for the sysctl
files:
        http://www.nl.linux.org/~riel/patches/

> What should I change at 2.2.5 compile time?

If you want to do an honest benchmark, you should (IMHO)
compile a mostly vanilla kernel...

> Does anyone have any empirical ideas about 
> _specific_ values that would work well for:
> inode-max

Large, really large (like, 8192 or slightly above that)

> bdflush

Set a low number of max dirty percentage and
high syncing times...

> buffermem
> pagecache

Forget these two files, they really don't do much
anymore (or need to, with the new algorithms).

> freepages
> kswapd

Leave at standard value.

> Any other I'm missing ?

If you plan on going into swap, set page-cluster to 5;
if you do a lot of fork()s / exit()s, set pagetable_cache
to something higher...

> I'm using many of the common apache 1.4.3 
> optimizations. Is there anything I can do to improve 
> SMP performance, to help the built in affinity-scheme 
> of linux?

If you're running with 50+ apache processes, processor
affininity isn't going to buy you much. Better make
sure that each Apache child can serve a lot of processes.

And don't do reverse IP lookups with the standard named :)

> The benchmark itself is specweb96, so the files are 
> distributed over a large range of sizes. I expect to 
> see 10K-20K simultaneous active connections.

10k active connections?  Have you found a way to run
this many processes simultaneously?
(is it included in the kernel and did I miss that event
or isn't it integrated yet?)

Rik -- Open Source: you deserve to be in control of your data.
+-------------------------------------------------------------------+
| Le Reseau netwerksystemen BV:               http://www.reseau.nl/ |
| Linux Memory Management site:  http://humbolt.geo.uu.nl/Linux-MM/ |
| Nederlandse Linux documentatie:          http://www.nl.linux.org/ |
+-------------------------------------------------------------------+

Date: Sat, 17 Apr 1999 01:51:17 +0100 (BST)
From: Stephen C. Tweedie 
To: Cacophonix Gaul 
Cc: linux-kernel@vger.rutgers.edu, Stephen Tweedie 
Subject: Re: 2.2.5 optimizations for web benchmarks?

Hi,

On Fri, 16 Apr 1999 00:15:31 -0700 (PDT), Cacophonix Gaul
 said:

> I'd like some help with optimizing linux (and apache) 
> for a web benchmark. 

OK.  Most of the important points have been covered already.  Especially
the tuning of the apache server itself is one of the most significant
issues.

> Does anyone have any empirical ideas about 
> _specific_ values that would work well for:
> inode-max, bdflush, buffermem, freepages, kswapd, pagecache

No advice right now: these should be OK out of the box.

However, having said that, there is a group of people working quite hard
to optimise the kernel fairly aggressively for large server loads like
this.  It's not necessarily something you want to get involved with if
you just want to take a fair benchmark of the existing kernels, but if
you really want to get the best out of a large memory machine then it
may be worthwhile looking up the scalability work being done at

        http://www.citi.umich.edu/projects/citi-netscape/

In particular we've been working on bottlenecks in the page, buffer and
dentry hashing mechanisms and have found fairly impressive performance
gains to be had by tuning that.  Updates will be posted as we come up
with tested patches.

Cheers,
 Stephen.

Date: 16 Apr 1999 16:14:43 GMT
From: Linus Torvalds 
To: linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?

In article ,
Alan Cox  wrote:
>
>> inode-max
>> bdflush
>> buffermem
>> freepages
>> kswapd
>> pagecache
>> Any other I'm missing ?
>
>You shouldnt need to touch those either.

Depending on what the benchmark is, some of these _can_ have a quite
noticeable difference.

For example, if the benchmark consists of web-serving 50.000 different
small files, you want to make sure that you can have all of them cached
at the same time.  So you probably want to increase inode-max quite
noticeably. 

I agree that you should not need to touch any of the values for any REAL
WORLD benchmark.  But some of them can certainly be an issue for the
above kind of loads that aren't very realistic. 

>Apache does fine with the default CPU affinity. You probably need to tune
>the apache configuration more than anything else.  You want the logging and
>swap on a different disk to the pages and to the system preferably. For the
>raid you want striping (obviously) and the raid howto/tools tell you how
>to find the optimal stripe size.

And make sure you log as little as humanly possible.  In particular,
older versions of apache did reverse name lookups etc, which just killed
performance. 

>If you want to answer the question "how fast is linux as a web server" 
>consider benchmarking using Zeus (www.zeus.co.uk) too.

Yup,

                Linus

Date: Fri, 16 Apr 1999 19:28:35 -0700 (PDT)
From: Dean Gaudet 
To: Stephen C. Tweedie 
Cc: Cacophonix Gaul , linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?

On Sat, 17 Apr 1999, Stephen C. Tweedie wrote:

> OK.  Most of the important points have been covered already.  Especially
> the tuning of the apache server itself is one of the most significant
> issues.

Uh I dunno.  Unless by tuning you mean "replace apache with something
that's actually fast" ;) 

Really, with the current multiprocess apache I've never really been able
to see more than a handful of percentage improvement from all the tweaks.
It really is a case of needing a different server architecture to reach
the loads folks want to see in benchmarks. 

That said, people might be interested to know that we're not dolts over at
Apache.  We have recognized the need for this... we're just slow.  I did a
pthread port last year, and threw it away because we had portability
concerns.  I switched to Netscape's NSPR library to solve the portability
concern[1].  That was last spring... then summer happenned and I found
other things to do.  In the interim IBM joined the apache team, and showed
us how the NPL sucks (patent clause), and that using NSPR would be a bad
thing. 

Months went on in this stalemate... we finished 1.3.0, .1, ...  We kept
hoping netscape's lawyers would see the light and fix the NPL.  That
didn't look hopeful -- so IBM started up a small team to redo the threaded
port, using everything I'd learned (without looking at my code... 'cause
it was NPL tainted), and port to pthreads. Their goal:  beat their own
webserver (Go).  This port is called apache-apr, and as of today someone
posted saying they'd served 2.6 million hits from apache-apr over a 4 day
period.  Not a record or anything, but an indication of stability. 

Oh, netscape fixed the patent clause.  Or they're supposed to be releasing
the fix.  But we're down the road far enough now we won't turn back. 

At this point apache-apr isn't in a state where we want zillions of people
using it, because it's probably still full of bugs.  But if you really
want it, visit http://dev.apache.org/ and dig around in our cvs stuff...
just don't expect hand holding.

Oh, to forestall anyone saying "apache should be a poll/event-loop style
server to go the fastest"... yes, you're bright (but probably wrong[2],
and if I digress any further I'll make myself puke).  Apache will never be
the fastest webserver, because that isn't our goal.  Our goal is
correctness, and useability.  Performance at this level is mostly a
marketing gimick. 

Dean

[1] NSPR had a feature that had me excited -- hybrid userland/kernel
threads.  I suspect these won't be necessary to do well on benchmark
loads.  But on real-life loads where there are lots of long haul clients,
these might be real nice... you won't need to chew up a kernel resource
for each client.  Something for linuxthreads folks to think about.

[2] insert discussion about kernel assisted http serving here, reference
IBM's and Sun's published 4-way xeon specweb numbers with kernel
accelerators

Date: Fri, 16 Apr 1999 19:10:29 +0200
From: Kristian Koehntopp 
To: linux-kernel@vger.rutgers.edu
Newsgroups: netuse.lists.linux-kernel
Subject: Re: 2.2.5 optimizations for web benchmarks?

In netuse.lists.linux-kernel you write:
>Roxen Challenger is written in Pike, which they say "somewhat resembles C
>and Java, but is not quite like either of those languages".  I haven't looked
>at it to see what it is, maybe just further developed from uLPC.

Pike is a much further developed uLPC. It is a byte-compiled
language, that is interpreted at runtime. The Roxen Challenger
web server and the Pike language are not exactly the speed combo
for several reasons (Pike being only one of them) and like to
have loads of memory. They are not slow either, but a
performance leader is surely not written in an interpreted
byte-compiled language. I would try phttpd
(http://www.signum.se/phttpd) for a C-based ultra-fast, low
memory footprint solution that is up to the task of tackling
Zeus. Don't know how phttpd does on Linux, though: I have only
seen it on Solaris.

The Challenger excels in GUI driven configurability and in HTML
generating features. The server allows you to create your own
markup, which is interpreted at page-serving time and is
converted to HTML on the fly. This is a lot of like CFML in Cold
Fusion, only more powerful, more flexible and more reliable
under load and for free (GPLed system).

It is also recursive, that is, Challengers tags can generate new
tags which are ultimately expanded into HTML after an arbitrary
number of iterations. And Challenger allows you to define
container tags which conditionally expand to HTML. Both of this
means that the server can only deliver the first byte of a page
after the last byte of a page has been handled (actually, after
the last byte in a defined container has been handled). As you
may easily imagine, this system introduces potentially large
latencies (think SQL querying tags that build large TABLE
structures), which surely is going to ruin your benchmarks, but
is just great when you need to design dynamic content.

The Challenger server may not be the performance leader, but it
surely redefined comfort with regard to configuration and web
design. It is also great for separating page layout and page
content.

There are a bunch of very expensive, very powerful tools built
upon the GPLed Roxen Challenger Webserver. These form the Roxen
Platform, a commercial product. If you need a publishing system
product or support for your GPLed webserver, this is very much
the way to go.

>I wouldn't be surprised if Challenger has really good benchmarks, given how 
>good Spinner was for its time :-)

Given enough memory, Challenger has okay benchmarks that are
solid under load. Due to the construction of the system,
Challenger will never be performance leader - that is okay, too,
because the other features of the system are well worth that
price. 

After all, serving speed is a problem that can be solved by
buying a larger iron, too. Creating and maintaining content in
time is the much harder problem in most cases, and this is where
the Roxen Platform really shows its strengths.

Kristian

Date: Fri, 16 Apr 1999 23:25:35 -0700 (PDT)
From: Cacophonix Gaul 
To: Alan Cox 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?

--- Alan Cox  wrote:

> > especially curious if increasing the initial tcp cwnd 
> > to min (4380, 4*MSS) as suggested by IETF/tcpimpl would 
> 
> It will do for an http/1.0 test, it doesn't for real world. You can tell
> if its a factor because you will see much less than 100% CPU usage and
> that adding clients increases performance.
> 
> > connections of the benchmark (I presume I'd have to 
> > change both snd_cwnd and snd_cwnd_cnt)

In my setup, changing the initial value of these more than doubles 
performance (with the same number of clients).

> > Will I get any benefit from changing the tunables in eepro100.c and
> > yellowfin.c ?
> 
> There is nothing there you need to touch
> 
> > inode-max
> > bdflush
> > buffermem
> > freepages
> > kswapd
> > pagecache
> > Any other I'm missing ?
> 
> You shouldnt need to touch those either.

There must be _something_ I'm missing. A typical procinfo shows:

Memory:      Total        Used        Free      Shared     Buffers      Cached
Mem:        517720      427572       90148       21972       26916      385984
Swap:      1312720           0     1312720

Bootup: Thu Sep  3 17:47:11 1998    Load average: 4.94 2.59 1.65 1/34 1776

user  :       0:00:01.10  13.7%  page in :        0  disk 1:        0r       8w
nice  :       0:00:00.00   0.0%  page out:       11  disk 2:        0r       0w
system:       0:00:05.38  67.1%  swap in :        0
idle  :       0:00:09.56 119.2%  swap out:        0
uptime:       1:17:38.42         context :    61598

irq  0:       802 timer                 irq 16:     26222 Intel EtherExpress P
irq  1:         0 keyboard              irq 17:     12774 Intel EtherExpress P
irq  2:         0 cascade [4]           irq 18:        18 ncr53c8xx, Intel Eth
irq  7:       122                       irq 19:         3 Intel EtherExpress P
irq 13:         0 fpu                   irq 129:     61598

Looks like the CPU is mostly idle, there is memory to spare (and no swapping),
and most of the files are being served from cache - hence the low disk read
numbers (although the disk is a pretty fast seagate cheetah). Logging is off.
The network is pretty clean - no retransmissions. Yet the specweb throughput
is below offered load.

I haven't had much time to investigate this yet...

> 
> If you want to answer the question "how fast is linux as a web server" 
> consider benchmarking using Zeus (www.zeus.co.uk) too.
> 

That was my first option - but the version I'm running is  disappointing,
and the binary doesn't give many options to tune.

Anyone know of a webserver that uses sendfile() on linux?

thanks.

Date: Sat, 17 Apr 1999 12:03:55 -0600 (MDT)
From: Dax Kelson 
To: Cacophonix Gaul 
Cc: Alan Cox , linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?

On Fri, 16 Apr 1999, Cacophonix Gaul wrote:

> > If you want to answer the question "how fast is linux as a web server" 
> > consider benchmarking using Zeus (www.zeus.co.uk) too.
> > 
> 
> That was my first option - but the version I'm running is  disappointing,
> and the binary doesn't give many options to tune.
> 
> Anyone know of a webserver that uses sendfile() on linux?
> 
> thanks.

Yes.  Zeus can use sendfile().

Go read the tuning page....

http://support.zeustech.net/tuning.htm

Dax Kelson

Date: Wed, 21 Apr 1999 00:01:16 +0100 (BST)
From: Stephen C. Tweedie 
To: Dean Gaudet 
Cc: Stephen C. Tweedie , Cacophonix Gaul ,
     linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?

Hi,

On Fri, 16 Apr 1999 19:28:35 -0700 (PDT), Dean Gaudet
 said:

> On Sat, 17 Apr 1999, Stephen C. Tweedie wrote:
>> OK.  Most of the important points have been covered already.  Especially
>> the tuning of the apache server itself is one of the most significant
>> issues.

> Uh I dunno.  Unless by tuning you mean "replace apache with something
> that's actually fast" ;) 

> Really, with the current multiprocess apache I've never really been able
> to see more than a handful of percentage improvement from all the
> tweaks.

Fair enough, although I know other people have definitely reported
bigger differences.  Is there a decent tuning writeup online that we can
direct people to in the future?  I'm helping Rik van Riel and a number
of folk on the linux performance lists to assemble some basic tuning
info and Apache is obviously one of the important components to cover.

> I did a
> pthread port last year, 

_Now_ we're talking interesting. :) 

> IBM started up a small team to redo the threaded port, using
> everything I'd learned (without looking at my code... 'cause it was
> NPL tainted), and port to pthreads. Their goal: beat their own
> webserver (Go).  This port is called apache-apr, and as of today
> someone posted saying they'd served 2.6 million hits from apache-apr
> over a 4 day period.  

So is this being actively developed?

Cheers,
 Stephen.

Date: Tue, 20 Apr 1999 16:22:38 -0700 (PDT)
From: Dean Gaudet 
To: Stephen C. Tweedie 
Cc: Cacophonix Gaul , linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?



On Wed, 21 Apr 1999, Stephen C. Tweedie wrote:

> Fair enough, although I know other people have definitely reported
> bigger differences.  Is there a decent tuning writeup online that we can
> direct people to in the future?  I'm helping Rik van Riel and a number
> of folk on the linux performance lists to assemble some basic tuning
> info and Apache is obviously one of the important components to cover.

http://www.apache.org/docs/misc/perf-tuning.html

> > IBM started up a small team to redo the threaded port, using
> > everything I'd learned (without looking at my code... 'cause it was
> > NPL tainted), and port to pthreads. Their goal: beat their own
> > webserver (Go).  This port is called apache-apr, and as of today
> > someone posted saying they'd served 2.6 million hits from apache-apr
> > over a 4 day period.  
> 
> So is this being actively developed?

Yeah.

Dean

Date: Tue, 20 Apr 1999 21:19:41 -0400 (EDT)
From: Greg Lindahl 
To: Dean Gaudet 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?

> > Fair enough, although I know other people have definitely reported
> > bigger differences.  Is there a decent tuning writeup online that we can
> > direct people to in the future?  I'm helping Rik van Riel and a number
> > of folk on the linux performance lists to assemble some basic tuning
> > info and Apache is obviously one of the important components to cover.
> 
> http://www.apache.org/docs/misc/perf-tuning.html

There is nothing in this document about tuning Linux to help Apache.
It does seem to cover tuning Apache itself quite well.

-- g

Date: Wed, 21 Apr 1999 10:03:06 -0500
From: Matthew Vanecek 
To: linux kernel list 
Subject: http://www.nfr.net/nfr/mail-archive/nfr-users/1999/Feb/0110.html

Has anyone seen this?  It's a pretty sad commentary on Linux packet
handling.  Is there truth to it, and if so, plans to fix it?

I found it from a link on:
http://www.anzen.com/products/nfr/testing/
-- 
Matthew Vanecek
Course of Study: http://www.unt.edu/bcis
Visit my Website at http://people.unt.edu/~mev0003
For answers type: perl -e 'print
$i=pack(c5,(41*2),sqrt(7056),(unpack(c,H)-2),oct(115),10);'
*****************************************************************
For 93 million miles, there is nothing between the sun and my shadow
except me. I'm always getting in the way of something...

Date: Wed, 21 Apr 1999 11:23:51 -0400 (EDT)
From: Greg Lindahl 
To: Matthew Vanecek 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: http://www.nfr.net/nfr/mail-archive/nfr-users/1999/Feb/0110.html

> Has anyone seen this?  It's a pretty sad commentary on Linux packet
> handling.  Is there truth to it, and if so, plans to fix it?

The bsd machines were sniffing 45,000 packets per second. Linux -- in
the default configuration -- can't even receive 45,000 packets per
second, because of the default setting of
/proc/sys/net/core/netdev_max_backlog.

-- g

Date: Wed, 21 Apr 1999 17:40:00 +0100 (GMT)
From: Matthew Kirkwood 
To: Matthew Vanecek 
Cc: linux kernel list 
Subject: Re: http://www.nfr.net/nfr/mail-archive/nfr-users/1999/Feb/0110.html

On Wed, 21 Apr 1999, Matthew Vanecek wrote:

> Has anyone seen this?  It's a pretty sad commentary on Linux packet
> handling.  Is there truth to it, and if so, plans to fix it?

NFR uses a BSD API which Linux doesn't support.

Under Linux, one syscall per-packet is required, which quite
seriously limits the rate at which packets can be sniffed.  It's
not exactly a "sad commentary" - it's missing feature which
proves to be a deficiency for this application.

I imagine that something like
ftp://ftp.inr.ac.ru/ip-routing/lbl-tools/kernel-turbopacket.dif.gz
might help to rectify the situation.

Matthew.

Date: Wed, 21 Apr 1999 11:03:34 -0700 (PDT)
From: Dean Gaudet 
To: Greg Lindahl 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?

On Tue, 20 Apr 1999, Greg Lindahl wrote:

> > > Fair enough, although I know other people have definitely reported
> > > bigger differences.  Is there a decent tuning writeup online that we can
> > > direct people to in the future?  I'm helping Rik van Riel and a number
> > > of folk on the linux performance lists to assemble some basic tuning
> > > info and Apache is obviously one of the important components to cover.
> > 
> > http://www.apache.org/docs/misc/perf-tuning.html
> 
> There is nothing in this document about tuning Linux to help Apache.
> It does seem to cover tuning Apache itself quite well.

Right -- it's a document about tuning apache.  "Tuning linux" is even more
a black art and I wasn't about to write up everything.  Plus it changes
every couple kernel versions and libc versions anyhow.  It's a nightmare
to keep up to date any documentation surrounding linux internals.

I'm saying this from the point of view of the person who answers the
apache bugdb mail regarding linux problems.  90% of what I end up having
to say to people is "well gee it works fine on all machines I have access
to, maybe it's your distribution/kernel version/libc version/phase of the
moon/colour of your hair/..." there are just too many variables that make
linux inconsistent.  Take it as a light flame. 

I'm actually *afraid* of what problems will be phrased as apache bugs when
people start to learn how to "tune" their linux kernel.  Yay!  Another
dozen dimensions of freedom for inexperienced people to break their
system! 

Dean

Date: Wed, 21 Apr 1999 16:25:20 -0400 (EDT)
From: Greg Lindahl 
To: Dean Gaudet 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?

> Right -- it's a document about tuning apache.  "Tuning linux" is even more
> a black art and I wasn't about to write up everything.  Plus it changes
> every couple kernel versions and libc versions anyhow.  It's a nightmare
> to keep up to date any documentation surrounding linux internals.

Well, the object of what's going on is to write things up. Some things
like the backlog queue defaulting to 300 have been the case for ages,
so that's well worth writing up. If you have any other suggestions,
even if they only apply to particular kernel versions, they would be
helpful.

> I'm actually *afraid* of what problems will be phrased as apache bugs when
> people start to learn how to "tune" their linux kernel.  Yay!  Another
> dozen dimensions of freedom for inexperienced people to break their
> system! 

I would like to write a tool which will be more predictable
behavior. "Um, did you get the latest kerntune and turn on the apache
setting?" This would be optimal, but I think it can be very good, even
with an inexperienced user.

-- g

Date: Thu, 22 Apr 1999 18:36:24 -0400
From: Joshua E. Rodd 
To: Gary Lawrence Murphy 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

Gary Lawrence Murphy wrote:
> In the excerpt from OpenSources on LinuxWorld, there is
> an omninous statement in Linus' musings which goes
>    "Of course Linux isn't being used to its full potential even
>     by today's web servers. Apache itself doesn't do the right
>     thing with threads, for example"
> What *should* Apache be doing?

Apache should not use a seperate process for each HTTP request.
It ought to one of two things:

 - Use a thread to handle each request, all in one process.
   There is already an alpha pthreads Apache out there, but
   it's virulently unstable (I used it on OS/2, and it had
   SEGV/BUS errors almost nonstop). If you enjoy playing with
   threads, check it out.

 - Be event-driven rather than procedure-driven by using
   select(2) to serve files. (Obviously CGI scripts and anything
   hard to do in an event-driven manner can be done with a
   new spawned/forked process.)

Note that on *.BSD, Apache's process-intensiveness is not an
issue because *.BSD kernels can fork at a mind-boggling rate.

Cheers,
Joshua.

Date: Thu, 22 Apr 1999 19:45:24 -0400 (EDT)
From: Greg Lindahl 
To: Joshua E. Rodd 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

> Apache should not use a seperate process for each HTTP request.

It doesn't. By default each child process serves 30 requests before
exiting; you can turn that up if the requests are simple. This
optimization is in there precisely because fork() isn't free.

>  - Use a thread to handle each request, all in one process.

Then you lose the memory protection that the current scheme gets you.

>    There is already an alpha pthreads Apache out there, but
>    it's virulently unstable

Doh! Hammer. Head. Bang. OK, I'm mostly joking, but...

>  - Be event-driven rather than procedure-driven by using
>    select(2) to serve files. (Obviously CGI scripts and anything
>    hard to do in an event-driven manner can be done with a
>    new spawned/forked process.)

That's how ircd works, but I wouldn't necessarily think it would be
better under load. select() is not cheap, and you have extra system
calls to avoid blocking while sending output, and you have to avoid
blocking on the disk.

Now if you invented a syscall which was specifically designed to
collect input from many sockets and write output to many sockets, that
might be a win.

-- g

Date: 23 Apr 1999 02:00:57 GMT
From: Stuart Lynne 
Reply-To: sl@fireplug.net
To: linux-kernel@vger.rutgers.edu
Newsgroups: list.linux-kernel
Subject: Re: Linus on Linux, Apache and Threads

In article <371FA468.5308ABE8@noah.dhs.org>,
Joshua E. Rodd  wrote:
>Gary Lawrence Murphy wrote:
>> In the excerpt from OpenSources on LinuxWorld, there is
>> an omninous statement in Linus' musings which goes
>>    "Of course Linux isn't being used to its full potential even
>>     by today's web servers. Apache itself doesn't do the right
>>     thing with threads, for example"
>> What *should* Apache be doing?
>
>Apache should not use a seperate process for each HTTP request.
>It ought to one of two things:
>
> - Use a thread to handle each request, all in one process.
>   There is already an alpha pthreads Apache out there, but
>   it's virulently unstable (I used it on OS/2, and it had
>   SEGV/BUS errors almost nonstop). If you enjoy playing with
>   threads, check it out.

Apached processes are long lived processing multiple requests. So doesn't it 
boil down to the difference between passing file descriptors to child processes
or a different thread? That and the fact that you probably can have a larger
number of active threads than processes for any given amount of memory you 
want to use?

This is not to say that the file passing model is optimal. A select() loop
should be faster. But the apache team seems to be more interested in robust
and portable with reasonable performance than performance at any cost.

> - Be event-driven rather than procedure-driven by using
>   select(2) to serve files. (Obviously CGI scripts and anything
>   hard to do in an event-driven manner can be done with a
>   new spawned/forked process.)
>
>Note that on *.BSD, Apache's process-intensiveness is not an
>issue because *.BSD kernels can fork at a mind-boggling rate.

For static pages that can be fetched without additional processing normal
apache child processes should not have to fork additional processes this
should not be an issue. CGI requests would benefit from fast fork()ing.

-- 
Stuart Lynne       604-461-7532      
PGP Fingerprint: 28 E2 A0 15 99 62 9A 00  88 EC A3 EE 2D 1C 15 68

Date: Thu, 22 Apr 1999 19:28:16 -0700 (PDT)
From: Alex Belits 
To: sl@fireplug.net
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

On 23 Apr 1999, Stuart Lynne wrote:

> >   it's virulently unstable (I used it on OS/2, and it had
> >   SEGV/BUS errors almost nonstop). If you enjoy playing with
> >   threads, check it out.
> 
> Apached processes are long lived processing multiple requests. So doesn't it 
> boil down to the difference between passing file descriptors to child processes
> or a different thread? That and the fact that you probably can have a larger
> number of active threads than processes for any given amount of memory you 
> want to use? 

  AFAIK Apache on Unix doesn't pass file descriptors -- every process does
all the work by itself. Apache on Windows passes fds between threads,
however that was done not as an attempt to make usable
multithreaded design but to make something that will work on Windows there
multiple-processes design simply won't work.

> This is not to say that the file passing model is optimal. A select() loop
> should be faster. But the apache team seems to be more interested in robust
> and portable with reasonable performance than performance at any cost.

  It's rather complex issue -- select() has limited scalability, and
poll() that is supported as a syscall in later version of the kernel
scales somewhat better. Worse yet, 2.0.x kernels had 256 fds per process
limit, and that is also fixed in later versions. Another problem is that
select() loop shouldn't call any blocking syscalls, and sending large
amount of data through select() loop requires more syscalls and more user
time than blocking i/o.

  In my FTP/HTTP server (fhttpd) I use a combination of main process that
uses select() or poll(), then passes fd (along with parsed request) to one
of many processes that use blocking I/O to reply to the client. The goal
of that design was mostly flexibility -- processes can be specialized
(file sending, PHP, possibly interpreters of other languages,
some applications), so main process can parse request and make decision,
which process should handle it. Lack of blocking in the main loop saves a
lot of trouble, however the performance of this thing depends on
select()/poll() scalability and the time of context switch between
processes.

-- 
Alex

----------------------------------------------------------------------
 Excellent.. now give users the option to cut your hair you hippie!
                                                  -- Anonymous Coward

Date: Fri, 23 Apr 1999 19:05:01 +1200
From: Chris Wedgwood 
To: Joshua E. Rodd 
Cc: Gary Lawrence Murphy , linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

> Apache should not use a seperate process for each HTTP request.

It doesn't. It uses a separate process for each simulatenous request,
but each of these processes in their lifetime may server many
requests.

> It ought to one of two things:
> 
>  - Use a thread to handle each request, all in one process.

No, that would be insane. Perhaps use threads the way it now uses
processes, but not one thread per request -- that would be death as
far as performance goes.

>  - Be event-driven rather than procedure-driven by using
>    select(2) to serve files. (Obviously CGI scripts and anything
>    hard to do in an event-driven manner can be done with a
>    new spawned/forked process.)

Maybe... but there are still probably better ways of doing this at
the extreme high end anyhow (oh, and poll would probably be preferred
for gobs of FDs).

> Note that on *.BSD, Apache's process-intensiveness is not an
> issue because *.BSD kernels can fork at a mind-boggling rate.

*BSD kernels can fork at a very high rate, as can linux, but neither
of them need to fork all that fast for reasons I've already outlined.

-cw

Date: Fri, 23 Apr 1999 09:14:30 -0700 (PDT)
From: Ian D Romanick 
To: Chuck Lever 
Cc: gale@syntax.dera.gov.uk, linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

> On Fri, 23 Apr 1999, Tony Gale wrote:
> > On 23-Apr-99 Chris Wedgwood wrote:
> > >> It ought to one of two things:
> > >> 
> > >>  - Use a thread to handle each request, all in one process.
> > > 
> > > No, that would be insane. Perhaps use threads the way it now uses
> > > processes, but not one thread per request -- that would be death as
> > > far as performance goes.
> > 
> > Depends. You can use a thread pool and queue requests which are then
> > picked up by the threads.
> 
> how do you propose to do that efficiently?  is there a nice way in
> Unix/Linux to hand out incoming network requests to a pool of threads?

Huh?  It's not that hard of a problem.  You have one (or several) thread
that just reads from network sockets.  They package up the requests and put
them on the end of a queue.  The other threads just pull requests off the
head of the queue.  The trick is all in waking up the sleeping threads when
the queue becomes non-empty without having the "thundering herd" problem.

It is even possible, if you have only one thread adding to the queue, to
implement the queue so that the writing thread doesn't need a lock.  Only
the reading threads need a lock.  If you have multiple writing threads you
would need a read lock and a write lock.  This enables massive amounts of
concurency, and is a huge win on mp systems.
-- 
"With a touch more confidence and a liberal helping of ignorance I would have 
been a famous evangelist."
                        -- Stranger In A Strange Land
PLENTY of ignorance at http://www.cs.pdx.edu/~idr

Date: Fri, 23 Apr 1999 13:55:42 -0400 (EDT)
From: Chuck Lever 
To: Ian D Romanick 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

On Fri, 23 Apr 1999, Ian D Romanick wrote:
> > On Fri, 23 Apr 1999, Tony Gale wrote:
> > > On 23-Apr-99 Chris Wedgwood wrote:
> > > >> It ought to one of two things:
> > > >> 
> > > >>  - Use a thread to handle each request, all in one process.
> > > > 
> > > > No, that would be insane. Perhaps use threads the way it now uses
> > > > processes, but not one thread per request -- that would be death as
> > > > far as performance goes.
> > > 
> > > Depends. You can use a thread pool and queue requests which are then
> > > picked up by the threads.
> > 
> > how do you propose to do that efficiently?  is there a nice way in
> > Unix/Linux to hand out incoming network requests to a pool of threads?
> 
> Huh?  It's not that hard of a problem.  You have one (or several) thread
> that just reads from network sockets.  They package up the requests and put
> them on the end of a queue.  The other threads just pull requests off the
> head of the queue.

i wasn't trying to suggest it was hard to code or understand.  my question
is how to do this efficiently.  has anyone compared the performance of
this model with the performance of the same application implemented
using NT's completion ports?

every Unix model i've seen that uses the "single thread waits for event
and wakes up worker threads" has suffered from a variety of problems that
can cripple its performance, like:

how does the event waiter thread determine whether there is a
suitable/idle worker thread to awaken?

> The trick is all in waking up the sleeping threads when
> the queue becomes non-empty without having the "thundering herd" problem.

agreed!  any ideas?

        - Chuck Lever
--
corporate:      
personal:        or 

The Linux Scalability project:
        http://www.citi.umich.edu/projects/linux-scalability/

Date: Fri, 23 Apr 1999 12:59:59 -0500
From: Manoj Kasichainula 
To: linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

On Fri, Apr 23, 1999 at 11:48:47AM -0400, Chuck Lever wrote:
> is there a nice way in
> Unix/Linux to hand out incoming network requests to a pool of threads?

Grab a tarball from http://dev.apache.org/from-cvs/apache-apr/ to look
at an attempt to do this in a multiprocess multithreaded web server.
See the pthreads/ subdirectory in the tarball for the server itself.
The code is most definitely in pre-alpha state.

We have two methods in the code (selectable with a #define) for
distributing connections to threads. One is with a pool of threads in
an accept() loop (one per listening socket) pushing connections onto a
queue, and another pool of worker threads popping connections off the
queue and handling them.

Another is very similar to the Apache 1.3 model. Every thread is in a
loop of

accept_mutex_lock()
poll() all listening sockets
accept a connection
accept_mutex_unlock()
process_connection()

-- 
Manoj Kasichainula - manojk at io dot com - http://www.io.com/~manojk/
"Violence is the first refuge of the violent." - Aaron Allston

Date: Fri, 23 Apr 1999 19:59:01 +0100 (BST)
From: Alan Cox 
To: Greg Lindahl 
Cc: mev0003@unt.edu, linux-kernel@vger.rutgers.edu
Subject: Re: http://www.nfr.net/nfr/mail-archive/nfr-users/1999/Feb/0110.html

> The bsd machines were sniffing 45,000 packets per second. Linux -- in
> the default configuration -- can't even receive 45,000 packets per
> second, because of the default setting of
> /proc/sys/net/core/netdev_max_backlog.

I've benched a Linux box with the standard settings doing over 55,000 packets
per second _routing_ not just receiving.

The fun with NFR isnt the device backlog, its that BSD has a hack built into
it basically solely for sniffing tools to use, and Linux doesn't.

Alan

Date: Fri, 23 Apr 1999 11:34:10 -0700 (PDT)
From: Ian D Romanick 
To: Chuck Lever 
Cc: idr@cs.pdx.edu, linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

> On Fri, 23 Apr 1999, Ian D Romanick wrote:
> > Huh?  It's not that hard of a problem.  You have one (or several) thread
> > that just reads from network sockets.  They package up the requests and put
> > them on the end of a queue.  The other threads just pull requests off the
> > head of the queue.
> 
> i wasn't trying to suggest it was hard to code or understand.  my question
> is how to do this efficiently.  has anyone compared the performance of
> this model with the performance of the same application implemented
> using NT's completion ports?
> 
> every Unix model i've seen that uses the "single thread waits for event
> and wakes up worker threads" has suffered from a variety of problems that
> can cripple its performance, like:
> 
> how does the event waiter thread determine whether there is a
> suitable/idle worker thread to awaken?
> 
> > The trick is all in waking up the sleeping threads when
> > the queue becomes non-empty without having the "thundering herd" problem.
> 
> agreed!  any ideas?

It could be done several ways.  Does Linux have Tanenbaum style up/down
semaphores?  If not, it shouldn't be too hard to do using pthread_mutex.
Each time an element is put on the event queue, you up the semaphore.
Before getting the dequeue lock on the queue, the thread would down the
semaphore.

At this point, you have a pretty easy model, and all of the hard work is in
implementing the down function.  I would say that the semaphore could be
implemented with a lock, a counter, and a queue of pthread_t objects.  The
trick is that when you down a semaphore that is already zero, you put the
thread on the queue and have it go to sleep.  Then when you up a semaphore
that has sleeping threads, you wake up the first thread.

The problem is that I don't know how to do the whole sleep/wake up thing
with pthreads.  It seems as though it could be done condition variables, but
then you still have the thundering herd problem.  I suppose that you could
allocate one condition variable per thread, but that doesn't seem very
elegant either.

-- 
"With a touch more confidence and a liberal helping of ignorance I would have 
been a famous evangelist."
                        -- Stranger In A Strange Land
PLENTY of ignorance at http://www.cs.pdx.edu/~idr

Date: Fri, 23 Apr 1999 14:24:23 -0400 (EDT)
From: Greg Lindahl 
To: Alan Cox 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: http://www.nfr.net/nfr/mail-archive/nfr-users/1999/Feb/0110.html

> > The bsd machines were sniffing 45,000 packets per second. Linux -- in
> > the default configuration -- can't even receive 45,000 packets per
> > second, because of the default setting of
> > /proc/sys/net/core/netdev_max_backlog.
> 
> I've benched a Linux box with the standard settings doing over 55,000 packets
> per second _routing_ not just receiving.

But you had a tail-wind. I take it that the bottom half can execute
more than 100x a second if the machine is otherwise idle. But if
anything comes along and uses a whole timeslice, the backlog queue
fills (default size 300) and you start dropping packets on the floor.
Yes? No?

And routing is fairly efficient; it's all in the kernel, at
least. Sniffing, on the other hand, consumes a fair bit of extra CPU
time getting the data up to the user process and consuming it.

> The fun with NFR isnt the device backlog, its that BSD has a hack built into
> it basically solely for sniffing tools to use, and Linux doesn't.

That may be the key to getting to *really* high packet rates. But Linux,
pin their test, slowed down as the packet rate increased. That's what
made me suspect the backlog. But it's just a guess.

-- g

Date: Sat, 24 Apr 1999 01:17:14 +0100 (BST)
From: Alan Cox 
To: Ian D Romanick 
Cc: cel@monkey.org, idr@cs.pdx.edu, linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

> Each time an element is put on the event queue, you up the semaphore.
> Before getting the dequeue lock on the queue, the thread would down the
> semaphore.

That isnt the performance issue. The scaling issue tends to be the "wake all
/ wake one" stuff - which is hard to do well for wake one.

Alan

Date: Sat, 24 Apr 1999 01:14:37 +0100 (BST)
From: Alan Cox 
To: Greg Lindahl 
Cc: alan@lxorguk.ukuu.org.uk, linux-kernel@vger.rutgers.edu
Subject: Re: http://www.nfr.net/nfr/mail-archive/nfr-users/1999/Feb/0110.html

> But you had a tail-wind. I take it that the bottom half can execute
> more than 100x a second if the machine is otherwise idle. But if
> anything comes along and uses a whole timeslice, the backlog queue
> fills (default size 300) and you start dropping packets on the floor.
> Yes? No?

No. The BH isnt scheduled, it follows the interrupts, tasks cannot hold off
a bh.

> > The fun with NFR isnt the device backlog, its that BSD has a hack built into
> > it basically solely for sniffing tools to use, and Linux doesn't.
> 
> That may be the key to getting to *really* high packet rates. But Linux,
> pin their test, slowed down as the packet rate increased. That's what
> made me suspect the backlog. But it's just a guess.

Its partly the packet backlog. This is why I dumped the whole NFR discussion
nobody involved with the entire thing had done any serious investigation into
why and how to solve it.

On the other hand I've had a short conversation with another company doing
similar tools which has been rational and basically ended at "look at
X, Y and Z. If you want to write a BPF driver for linux using the
sock filter hooks then go ahead, let me know if there are any other
problems in the filter structure that might make it hard"

Date: Sat, 24 Apr 1999 01:26:45 +0100 (BST)
From: Alan Cox 
To: Alan Cox 
Cc: lindahl@cs.virginia.edu, alan@lxorguk.ukuu.org.uk, linux-kernel@vger.rutgers.edu
Subject: Re: http://www.nfr.net/nfr/mail-archive/nfr-users/1999/Feb/0110.html

> > made me suspect the backlog. But it's just a guess.
> 
> Its partly the packet backlog. This is why I dumped the whole NFR discussion

Erp - I mean its partly the _socket_ backlog - thats not the same as the
bh backlog.

The perils of trying to catch up with 2500 emails

Date: Fri, 23 Apr 1999 17:20:40 -0700 (PDT)
From: Dan Hollis 
To: Alan Cox 
Cc: Greg Lindahl , linux-kernel@vger.rutgers.edu
Subject: Re: http://www.nfr.net/nfr/mail-archive/nfr-users/1999/Feb/0110.html

On Sat, 24 Apr 1999, Alan Cox wrote:
> On the other hand I've had a short conversation with another company doing
> similar tools which has been rational

Indeed the NFR group being a BSD outfit from the beginning dont like
Linux, they set up their tests knowing ahead of time it would do badly
with their BSD-specific code, did the comparison tests, then jumped up and
down going "nyah nyah, Linux sucks". They have been anti-Linux for as long
as I can remember.

-Dan

Date: 24 Apr 1999 12:51:44 +0200
From: Andi Kleen 
To: Chuck Lever 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

cel@monkey.org (Chuck Lever) writes:

> On Fri, 23 Apr 1999, Tony Gale wrote:
> > On 23-Apr-99 Chris Wedgwood wrote:
> > >> It ought to one of two things:
> > >> 
> > >>  - Use a thread to handle each request, all in one process.
> > > 
> > > No, that would be insane. Perhaps use threads the way it now uses
> > > processes, but not one thread per request -- that would be death as
> > > far as performance goes.
> > 
> > Depends. You can use a thread pool and queue requests which are then
> > picked up by the threads.
> 
> how do you propose to do that efficiently?  is there a nice way in
> Unix/Linux to hand out incoming network requests to a pool of threads?

You have multiple threads doing an accept on a single listen socket. As
soon as a thread finished work it calls accept and gets the next ready 
connection handed from the kernel.


-Andi

-- 
This is like TV. I don't like TV.

Date: Sat, 24 Apr 1999 12:40:08 +0200
From: Olaf Titz 
To: linux-kernel@vger.rutgers.edu
Subject: OT: multithreaded web server implementation (Re: Linus on Linux,
     Apache and Threads)

> Huh?  It's not that hard of a problem.  You have one (or several) thread
> that just reads from network sockets.  They package up the requests and put
> them on the end of a queue.  The other threads just pull requests off the
> head of the queue.  The trick is all in waking up the sleeping threads when
> the queue becomes non-empty without having the "thundering herd" problem.

Even simpler: put the worker threads themselves on a queue. The
acceptor thread dequeues a worker thread and hands it the request, the
worker threads re-enqueue themselves after doing work. A queue-watcher
thread spawns and enqueues new workers whenever the queue runs dry (or
takes some out of the [end of] a queue when there are too many
unused). A thread can hold a keep-alive connection and not re-enqueue
itself, it can exit (implicitly not re-enqueuing itself) and do
similar stuff it likes, the queue-watcher will always take care. The
acceptor could be split into several threads (easing virtual hosts?),
etc.

I expect the queue manager code for this to be very short and
efficient, because it does little more than managing a double headed
linked list and do some wakeup calls.

You can even have multiple queues this way: specialized worker threads
either for request types (file/CGI/servlet/etc) or for virtual hosts.
Would perhaps need some research whether it is better to handle just a
FD or a preparsed request line to the worker threads. In any case,
threads use shared memory, so you don't really pass down the stuff
any sort of pipe.

Olaf

Date: Sat, 24 Apr 1999 05:35:59 -0700 (PDT)
From: Alex Belits 
To: Andi Kleen 
Cc: Chuck Lever , linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

On 24 Apr 1999, Andi Kleen wrote:

> 
> You have multiple threads doing an accept on a single listen socket. As
> soon as a thread finished work it calls accept and gets the next ready 
> connection handed from the kernel.

 ...or will be awakened on the connection that was handled by another
thread (because of "wake everyone" handling), and accept() will fail,
causing the infamous "thundering herd".

-- 
Alex

----------------------------------------------------------------------
 Excellent.. now give users the option to cut your hair you hippie!
                                                  -- Anonymous Coward

Date: 24 Apr 1999 14:23:42 +0200
From: Andi Kleen 
To: olaf@bigred.inka.de
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

olaf@bigred.inka.de (Olaf Titz) writes:

> > how does the event waiter thread determine whether there is a
> > suitable/idle worker thread to awaken?
> 
> like this?
[complicated queue management code snipped]

This is essentially equivalent to what accept() does in the kernel.
So why not use that directly (together with some sample statistics code
that starts new worker threads as needed)?

-Andi

-- 
This is like TV. I don't like TV.

Date: Sat, 24 Apr 1999 14:37:03 +0200
From: Andi Kleen 
To: Alex Belits , Andi Kleen 
Cc: Chuck Lever , linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

On Sat, Apr 24, 1999 at 02:35:59PM +0200, Alex Belits wrote:
> On 24 Apr 1999, Andi Kleen wrote:
> 
> > 
> > You have multiple threads doing an accept on a single listen socket. As
> > soon as a thread finished work it calls accept and gets the next ready 
> > connection handed from the kernel.
> 
>  ...or will be awakened on the connection that was handled by another
> thread (because of "wake everyone" handling), and accept() will fail,
> causing the infamous "thundering herd".

If the load is high enough it doesn't matter, because there will
be always enough connections to be returned to an accept after a wakeup.
If it isn't the threads pool should adapt and use less threads which
avoids the problem (and a few lost wakeups in the transitions don't harm, 
because the machine has enough free cycles).

Do you have any real data that this doesn't happen?

-Andi
-- 
This is like TV. I don't like TV.

Date: Sat, 24 Apr 1999 05:32:57 -0700 (PDT)
From: Alex Belits 
To: Olaf Titz 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: OT: multithreaded web server implementation (Re: Linus on Linux,
     Apache and Threads)

On Sat, 24 Apr 1999, Olaf Titz wrote:

> You can even have multiple queues this way: specialized worker threads
> either for request types (file/CGI/servlet/etc) or for virtual hosts.
> Would perhaps need some research whether it is better to handle just a
> FD or a preparsed request line to the worker threads. In any case,
> threads use shared memory, so you don't really pass down the stuff
> any sort of pipe.

  This model, but with processes instead of threads, is used in fhttpd.
Requests are preparsed in the main process because otherwise specialized
"worker" processes can't be chosen before passing fd.

-- 
Alex

----------------------------------------------------------------------
 Excellent.. now give users the option to cut your hair you hippie!
                                                  -- Anonymous Coward

Date: Sat, 24 Apr 1999 23:14:39 +0200
From: Olaf Titz 
To: linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

> This is essentially equivalent to what accept() does in the kernel.
> So why not use that directly (together with some sample statistics code
> that starts new worker threads as needed)?

Really? I thought this simplest approach - have a number of threads or
even independent processes which all do accept() themselves - is not
done because of the thundering herds problem.

I think the reason is this: you can't bind more than one socket to an
address to listen on, so all processes share the same listening FD
(dup() wouldn't help because it just creates a new reference to the
same socket.) When this FD becomes ready, all processes waiting on it
get woken up.

Olaf

Date: Sat, 24 Apr 1999 14:44:11 -0700 (PDT)
From: Marc Slemko 
To: Alex Belits 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

On Sat, 24 Apr 1999, Alex Belits wrote:

> On 24 Apr 1999, Andi Kleen wrote:
> 
> > 
> > You have multiple threads doing an accept on a single listen socket. As
> > soon as a thread finished work it calls accept and gets the next ready 
> > connection handed from the kernel.
> 
>  ...or will be awakened on the connection that was handled by another
> thread (because of "wake everyone" handling), and accept() will fail,
> causing the infamous "thundering herd".

If that is a problem (and there are various reasons why it may or may not
be in various situations), then someone should fix the kernel so it
doesn't have that problem.  Several other OSes have.

Date: Sat, 24 Apr 1999 18:43:52 -0400 (EDT)
From: Chuck Lever 
To: Andi Kleen 
Cc: Alex Belits , linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

On Sat, 24 Apr 1999, Andi Kleen wrote:
> > > You have multiple threads doing an accept on a single listen socket. As
> > > soon as a thread finished work it calls accept and gets the next ready 
> > > connection handed from the kernel.
> > 
> >  ...or will be awakened on the connection that was handled by another
> > thread (because of "wake everyone" handling), and accept() will fail,
> > causing the infamous "thundering herd".
> 
> If the load is high enough it doesn't matter, because there will
> be always enough connections to be returned to an accept after a wakeup.

right, but before you get to this point, there is a performance drop.

> If it isn't the threads pool should adapt and use less threads which
> avoids the problem (and a few lost wakeups in the transitions don't harm, 
> because the machine has enough free cycles).

this is sounding more complicated by the minute.  you also want to tune
this so that you have just the right number of threads active to keep the
L1/L2 caches working at their most efficient.  is there any guarantee that
waiting in accept() won't cause round-robin behavior rather than just
picking the first couple of threads on the list?

in other words, it's best to have the number of worker threads be close to
the number of physical CPUs; otherwise, if the threads are scheduled in
round-robin fashion, they could constantly knock each others' working set
out of the CPU caches.

if waiting in accept() does cause the thundering herd problem, that might
be a good thing - the thread that wins will probably have the best cache
foot-print.

> Do you have any real data that this doesn't happen?

i'm wondering if anyone has studied the performance difference between
using this kind of model, and using Windows NT completion ports?  i've
heard lots of speculation that using a completion port model is
significantly more efficient than accept().

        - Chuck Lever
--
corporate:      
personal:        or 

The Linux Scalability project:
        http://www.citi.umich.edu/projects/linux-scalability/

Date: Sun, 25 Apr 1999 01:15:55 +0200
From: Andi Kleen 
To: Chuck Lever , Andi Kleen 
Cc: Alex Belits , linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

On Sun, Apr 25, 1999 at 12:43:52AM +0200, Chuck Lever wrote:
> On Sat, 24 Apr 1999, Andi Kleen wrote:
> > > > You have multiple threads doing an accept on a single listen socket. As
> > > > soon as a thread finished work it calls accept and gets the next ready 
> > > > connection handed from the kernel.
> > > 
> > >  ...or will be awakened on the connection that was handled by another
> > > thread (because of "wake everyone" handling), and accept() will fail,
> > > causing the infamous "thundering herd".
> > 
> > If the load is high enough it doesn't matter, because there will
> > be always enough connections to be returned to an accept after a wakeup.
> 
> right, but before you get to this point, there is a performance drop.

Then you have enough cycles left, so it doesn't matter. The server always
has to be a bit oversized to handle traffic peaks, in non-peak situation
you can afford to be a bit less efficient (to complicate code to optimize
this would be wasted time).

Also if the threads pool size is adapting quickly enough it shouldn't
be that bad.

> 
> > If it isn't the threads pool should adapt and use less threads which
> > avoids the problem (and a few lost wakeups in the transitions don't harm, 
> > because the machine has enough free cycles).
> 
> this is sounding more complicated by the minute.  you also want to tune
> this so that you have just the right number of threads active to keep the
> L1/L2 caches working at their most efficient.  is there any guarantee that
> waiting in accept() won't cause round-robin behavior rather than just
> picking the first couple of threads on the list?

There is no such guarantee (except perhaps if you play with nice values[1]), 
but it does not matter when you keep statistics about number of requests/time.
As soon as the average time a thread has to wait in the accept goes below
some time add more threads. If it goes gets above the time kill threads and
lower the time. Costs you a few gettimeofdays() if you don't use a time
keeper thread (a thread that updates a timestamp counter in shared memory)

> 
> in other words, it's best to have the number of worker threads be close to
> the number of physical CPUs; otherwise, if the threads are scheduled in
> round-robin fashion, they could constantly knock each others' working set
> out of the CPU caches.
> 
> if waiting in accept() does cause the thundering herd problem, that might
> be a good thing - the thread that wins will probably have the best cache
> foot-print.

Only real tests can show. Anyways, if the kernel accept() behaviour
should really cause problems (I would guess it doesn't by intuition, but I 
have no data), then accept() should be fixed, not complicated code added
to th e user application.

-Andi

[1] I wouldn't suggest that because of the possible nasty interactions
with other server processes running on the same machine.

-- 
This is like TV. I don't like TV.

Date: Mon, 26 Apr 1999 18:09:03 +0100 (BST)
From: Stephen C. Tweedie 
To: Chuck Lever 
Cc: Andi Kleen , Alex Belits ,
     linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

Hi,

On Sat, 24 Apr 1999 18:43:52 -0400 (EDT), Chuck Lever 
said:

> this is sounding more complicated by the minute.  you also want to tune
> this so that you have just the right number of threads active to keep the
> L1/L2 caches working at their most efficient.  is there any guarantee that
> waiting in accept() won't cause round-robin behavior rather than just
> picking the first couple of threads on the list?

The default behaviour should be at least reasonable.  If you have
multiple threads in accept, then yes, you'll get thundering herd in the
kernel: all threads in wait_for_connect() will be made runnable.
However, the scheduler will prefer to actually schedule those threads
which are already on the best CPU, so those threads will tend to be the
ones which win the accept().  The other threads will get scheduled later
and will either find another connection ready to be accepted, or will
sleep again without ever leaving the accept syscall.

Doing a real wake-one is more difficult, but sysV ipc semaphores should
at least avoid thundering herd.  If you have one demutiplexor thread
adding work to a queue and waking up worker threads by up()ing a sysV
semaphore, exactly one thread at a time will be woken up.  That _will_
be done in strict FIFO, however, but if you have multiple CPUs idle at
the time the thread will end up being scheduled on its last CPU
(assuming it was last run on one of the idle processors).

--Stephen

Date: Mon, 26 Apr 1999 20:56:26 +0100 (BST)
From: Stephen C. Tweedie 
To: Alan Cox 
Cc: Ian D Romanick , cel@monkey.org, linux-kernel@vger.rutgers.edu,
     Stephen Tweedie 
Subject: Re: Linus on Linux, Apache and Threads

Hi,

On Sat, 24 Apr 1999 01:17:14 +0100 (BST), alan@lxorguk.ukuu.org.uk (Alan
Cox) said:

>> Each time an element is put on the event queue, you up the semaphore.
>> Before getting the dequeue lock on the queue, the thread would down the
>> semaphore.

> That isnt the performance issue. The scaling issue tends to be the "wake all
> / wake one" stuff - which is hard to do well for wake one.

SysV semaphores do wake-one for simple up/down counted use.  The 2.2.6
implementation does have a couple of problems right now: it will _only_
wake one at a time even if you up() more than once, and it will do so in
strict FIFO, which is not necessarily going to direct the wakeup to the
best thread (ie. the one last scheduled on a CPU which is now idle).

The multiple-wakeup issue can be fixed pretty easily.  Selecting the
best next thread if you have multiple waiters, without doing a wake-all,
is harder.

--Stephen

Date: Mon, 26 Apr 1999 22:19:31 +0200 (CEST)
From: Ingo Molnar 
To: Stephen C. Tweedie 
Cc: Alan Cox , Ian D Romanick ,
     cel@monkey.org, linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads


On Mon, 26 Apr 1999, Stephen C. Tweedie wrote:

> > That isnt the performance issue. The scaling issue tends to be the
> > "wake all / wake one" stuff - which is hard to do well for wake one. 
> 
> SysV semaphores do wake-one for simple up/down counted use.  The 2.2.6
> implementation does have a couple of problems right now: it will _only_
> wake one at a time even if you up() more than once, and it will do so in
> strict FIFO, which is not necessarily going to direct the wakeup to the
> best thread (ie. the one last scheduled on a CPU which is now idle). 

i dont think this is a RL issue, most SysV semaphores (in SAP and Oracle) 
are mutexes.

-- mingo

Date: Mon, 26 Apr 1999 22:52:48 +0100 (BST)
From: Stephen C. Tweedie 
To: Ingo Molnar 
Cc: Stephen C. Tweedie , Alan Cox ,
     Ian D Romanick , cel@monkey.org, linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

Hi,

On Mon, 26 Apr 1999 22:19:31 +0200 (CEST), Ingo Molnar
 said:

> i dont think this is a RL issue, most SysV semaphores (in SAP and Oracle) 
> are mutexes.

That's not the point.  Many server systems do use a single
demultiplexing process with multiple worker threads.  

Right now, they typically use something like pipes to share the work
amongst worker threads.  Threads pick the data up in FIFO order when new
work enters the queue.  The problem is that every mechanism we have for
this will wake up ALL idle worker threads in the kernel: only one will
actually succeed in grabbing the message, but the others will still
consume kernel-mode CPU time.

The only place I can see where we avoid wake-all is in the SysV
semaphore code.  Using a message queue in shared memory plus sysV
semaphores to actually wake up the clients _will_ avoid the thundering
herd effect.  So it's a genuine issue to consider whether or not this is
worth it.

--Stephen

Date: Tue, 27 Apr 1999 13:01:09 +0000
From: Richard Dynes 
To: linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

Tony Gale wrote:
>
> Can I just get some clarification here. pthread_cond_signal is
> defined as waking *one* thread, thus:
>
>       pthread_cond_signal restarts one of the threads  that  are
>       waiting on the condition variable cond.  If no threads are
>       waiting on cond, nothing happens. If several  threads  are
>       waiting  on  cond, exactly one is restarted, but it is not
>       specified which.
>
> Are you saying that the kernel will in fact wake up *all* the threads
> (either just internally, or in actuality)?

I'm interested in this too.  Tony's point was my immediate thought:
pthread_cond_signal is an obvious implementation of a solution to the
'thundering herd' problem. But it has apparently been discounted.

I had assumed that this was because of some limitation to the linux
pthread implementation.  I had read (somewhere, recently) that linux
pthreads in a process were not scheduled independently by the kernel,
but were scheduled only within that process, thus only get a single
kernel scheduling slot.  This is a real limitation. But Matt Ranney
has said on this list that: 

         Linux pthreads are all kernel processes, 

True?  I'll go find the answers on my own, but I'm curious about the
answer to Tony Gale's question.

-Richard

-- 
    Richard Dynes
    rdynes@varcom.com

Date: Tue, 27 Apr 1999 14:20:59 +0000
From: Richard Dynes 
To: linux-kernel@vger.rutgers.edu
Subject: [Fwd: Re: Linus on Linux, Apache and Threads]


Read the FAQ, I tell myself:

http://sunsite.doc.ic.ac.uk/Mirrors/sunsite.unc.edu/pub/Linux/docs/faqs/Threads-FAQ/html
/
http://pauillac.inria.fr/~xleroy/linuxthreads/
http://www.serpentine.com/~bos/threads-faq/

-Richard

-- 
    Richard Dynes
    rdynes@varcom.com

Date: Tue, 27 Apr 1999 16:16:36 +0100 (BST)
From: Tony Gale 
To: Richard Dynes 
Cc: linux-kernel@vger.rutgers.edu
Subject: RE: [Fwd: Re: Linus on Linux, Apache and Threads]

Except these don't actually answer the question. There are a couple
of issues here:

        o Signals
        o Wake-one-thread
        o PIDs

Signals: LinuxThreads has traditionally used SIGUSR1 and SIGUSR2 to
do it's internal work. Looking at the source code, it looks like
glibc 2.1 with recent kernels will use the RT Signals instead. This
is good.

Wake-one-thread: The function pthread_cond_signal is supposed to wake
a single thread that is waiting on the condition variable. However, a
number of people in this forum have hinted that this is not so, and
that all threads are awakened. Hence, lots of talk of thundering
herds. Is this all just a mistake? If so, this is bad.

PIDs: LinuxThreads assign a different process id to each thread,
even though they have lightweight context switching - this leads to
much confusion with people who don't know about CLONE. [Haven't
checked the glibc 2.1 position on this one.]

I'm beginning to think that there is a misconception with linux and
the wake-one-thread issue, and that it does, indeed, wake a single
thread. I see no reason, given the signalling mechanism and the PID
stuff that it should do anything different.

-tony

On 27-Apr-99 Richard Dynes wrote:
> 
> Read the FAQ, I tell myself:
> 
> http://sunsite.doc.ic.ac.uk/Mirrors/sunsite.unc.edu/pub/Linux/docs/f
> aqs/Threads-FAQ/html/
> http://pauillac.inria.fr/~xleroy/linuxthreads/
> http://www.serpentine.com/~bos/threads-faq/
> 
> -Richard

---
E-Mail: Tony Gale 
Moebius always does it on the same side.

The views expressed above are entirely those of the writer
and do not represent the views, policy or understanding of
any other person or official body.

Date: 27 Apr 1999 11:45:03 -0700
From: Matt Ranney 
To: linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

Ingo Molnar  writes:

> > Are you saying that the kernel will in fact wake up *all* the threads
> > (either just internally, or in actuality)?
> 
> i think in glibc 2.1 it uses RT queued signals, thus it's a true 1:1
> wakeup. Not sure though.

After trying both, I can state with certainty that using
pthread_cond_broadcast() vs. pthread_cond_signal() results in VERY
different levels of context switching and associated performance
degredation.  Of course, this still doesn't prove that there isn't
extra overhead with the waking of just one thread.

As I said to someone in private email on the subject, the overhead of
a thread context switch might indeed be less than that of a process,
but its still quite significant in some applications.
-- 
Matt Ranney - mjr@ranney.com

Date: Tue, 27 Apr 1999 16:30:24 -0600
From: Larry McVoy 
To: Mike Touloumtzis 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads 

: Some OSes have true lightweight thread switching (it takes place entirely
: in userspace).  It's very fast...

Last I checked, Linux process context switches were faster than Solaris
thread context switches.

The speed issue is a red herring anyway.  People get seduced by the
oh-so-fast user level threads (which definitely can context switch
faster than processes) and then forget to realize that each of those
oh-so-lightweight threads has a stack.  If each stack is a page (and 
maybe an allocated page followed by an unallocated page so it can page
fault and get auto filled by the kernel), and you have a 1000 threads,
that's 4MB in stacks.  Not so light weight.

My favorite thread quote:

        Threads are like salt.  I like salt, you like salt, we all
        like salt, but we eat more pasta than salt.  

: Many Un*x OSes have two-tiered thread
: implementations (both kernel thread switching and userspace thread
: switching).  This works pretty well but is complex.

Really complex.

Date: Wed, 28 Apr 1999 04:22:34 +0200
From: Bernd Eckenfels 
To: linux-kernel@vger.rutgers.edu
Subject: Re: Linus on Linux, Apache and Threads

In article <3725B515.5CFD6E9D@varcom.com> you wrote:
> I'm interested in this too.  Tony's point was my immediate thought:
> pthread_cond_signal is an obvious implementation of a solution to the
> 'thundering herd' problem. But it has apparently been discounted.

The reason for this is, that you do accept() in multiple threads (or select)
but not sleep, which can be interrrupted by cond_signal. Since the "signal"
is issued by the kernel.

Greetings
Bernd

Date: Fri, 30 Apr 1999 00:53:16 +0100 (BST)
From: Stephen C. Tweedie 
To: Greg Lindahl 
Cc: Alan Cox , linux-kernel@vger.rutgers.edu,
     Stephen Tweedie 
Subject: Re: 2.2.5 optimizations for web benchmarks?

Hi,

On Mon, 26 Apr 1999 19:08:36 -0400 (EDT), Greg Lindahl
 said:

>> If you are sharing the virtual memory space it means you don't take
>> a TLB flush

> If I recall correctly, the Sybase folks described this as a major win
> across many OSes. On the other hand, Apache in particular may not
> access enough memory to make a huge difference. 

Remember, this is TLB flushes, not cache flushes, we're talking about.
*Every* memory access after a TLB flush needs to reload the TLB, even if
it is just a read from already-cached memory.  Even just spinning in
kernel space may require TLB refills after a flush (although newer
Pentia do let you mark certain page tables as global, so a mm replace
won't evict those TLBs).

--Stephen

Date: Thu, 29 Apr 1999 19:58:31 -0400 (EDT)
From: Greg Lindahl 
To: Stephen C. Tweedie 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?

> > If I recall correctly, the Sybase folks described this as a major win
> > across many OSes. On the other hand, Apache in particular may not
> > access enough memory to make a huge difference. 
> 
> Remember, this is TLB flushes, not cache flushes, we're talking about.

Yes. If Apache isn't accessing that much memory, it doesn't take that
many TLB reloads no matter how often it is flushed. It's only when
you're repeatedly accessing many TLB entries that it's critical to not
flush.

i.e. high pressure on TLB, more damage from a flush.

-- g

Date: Fri, 30 Apr 1999 01:04:22 +0100 (BST)
From: Stephen C. Tweedie 
To: Greg Lindahl 
Cc: Stephen C. Tweedie , linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?

Hi,

On Thu, 29 Apr 1999 19:58:31 -0400 (EDT), Greg Lindahl
 said:

>> > If I recall correctly, the Sybase folks described this as a major win
>> > across many OSes. On the other hand, Apache in particular may not
>> > access enough memory to make a huge difference. 
>> 
>> Remember, this is TLB flushes, not cache flushes, we're talking about.

> Yes. If Apache isn't accessing that much memory, it doesn't take that
> many TLB reloads no matter how often it is flushed. It's only when
> you're repeatedly accessing many TLB entries that it's critical to not
> flush.

Yes, but the amount of memory you touch is not simply related to the
number of TLBs.  You're probably going to take an exit path out of the
kernel which returns through 3 or 4 levels of function calls all over
the kernel.  That's not a lot of memory, and it may well already be in
cache, but there's a TLB hit for each such access.  Then there's the
task struct and the stack --- another two.  You'll have one for the
apache stack, a bunch for the call unwind inside apache and in libc, and
more for every single item of data referenced.

In other words, the problem with TLB refill costs is that even a small
amount of code/data reference is going to touch many TLBs if there is
poor locality of reference, and that's exactly what you expect on a
function return from deep in the kernel.  The TLB cost is
disproportionately high considering the amount of memory actually
referenced. 

--Stephen

Date: Thu, 29 Apr 1999 20:22:04 -0400 (EDT)
From: Greg Lindahl 
To: Stephen C. Tweedie 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?

> Yes, but the amount of memory you touch is not simply related to the
> number of TLBs.  You're probably going to take an exit path out of the
> kernel which returns through 3 or 4 levels of function calls all over
> the kernel.

User programs do occasionally access memory, above and beyond what the
kernel uses...

> In other words, the problem with TLB refill costs is that even a small
> amount of code/data reference is going to touch many TLBs

Fine. Are you then claiming that it's impossible for a program to
touch a lot more than the minimum number of TLB entries, or are you
claiming that you think this is a big effect even for apache due to
the mandatory hits? The first is clearly not true (I write programs
which are extremely TLB-intensive); the second I have no idea about.

-- g

Date: Fri, 30 Apr 1999 00:59:48 +0100 (BST)
From: Stephen C. Tweedie 
To: Steve Dodd , Ingo Molnar 
Cc: Alan Cox , Matt Ranney ,
     linux-kernel@vger.rutgers.edu, Stephen Tweedie 
Subject: Re: 2.2.5 optimizations for web benchmarks?

Hi,

On Tue, 27 Apr 1999 13:26:48 +0100, Steve Dodd  said:

> Does the scheduler prefer processes which share vm space with the
> current task?  

Not yet.

> As I see it, threads are just special processes, so the scheduler may
> just switch to a completely different process anyway and incur the TLB
> flush.

Correct.  Actually, Ingo, this is not a bad idea --- are your new
scheduler patches doing this?  Giving a small goodness boost to related
threads when one thread blocks is not much different in principle to the
CPU-binding boost we already have.  In both cases, we really only end up
changing the order in which we consider runnable tasks, we don't
actually credit the chosen tasks with any extra "counter" cycles.  The
biggest problem might be that in the case where we have one
highly-threaded tasks, we'd need to be careful not to starve out other
processes.  

--Stephen

Date: Fri, 30 Apr 1999 08:37:37 +0200 (CEST)
From: Ingo Molnar 
To: Stephen C. Tweedie 
Cc: Steve Dodd , Alan Cox ,
     Matt Ranney , linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?


On Fri, 30 Apr 1999, Stephen C. Tweedie wrote:

>                        [...] Giving a small goodness boost to related
> threads when one thread blocks is not much different in principle to the
> CPU-binding boost we already have.

goodness() has been doing this for a long time:

        /* .. and a slight advantage to the current MM */
        if (p->mm == prev->mm)
                weight += 1;

-- mingo

Date: Fri, 30 Apr 1999 08:50:36 +0200 (CEST)
From: Ingo Molnar 
To: Stephen C. Tweedie 
Cc: Greg Lindahl , Alan Cox ,
     linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.5 optimizations for web benchmarks?


On Fri, 30 Apr 1999, Stephen C. Tweedie wrote:

> > across many OSes. On the other hand, Apache in particular may not
> > access enough memory to make a huge difference. 
> 
> Remember, this is TLB flushes, not cache flushes, we're talking about.
> *Every* memory access after a TLB flush needs to reload the TLB, even if
> it is just a read from already-cached memory.  Even just spinning in
> kernel space may require TLB refills after a flush (although newer
> Pentia do let you mark certain page tables as global, so a mm replace
> won't evict those TLBs).

yes and this is a pretty important effect. Also, a TLB miss on a Xeon with
all the page table info cached takes 3 cycles (and does not cause a
pipeline stall, so the real cost can be lower than this). So the cost for
a TLB flush is not all that high as it seems at first sight.

-- mingo

Date: Tue, 04 May 1999 05:20:47 +0000
From: Dan Kegel 
Subject: Re: Apache performance under high loads

FYI.  Two interesting responses today.  I'll quote them
on http://www.kegel.com/mindcraft_redux.html as soon
as the authors give permission.

Still waiting for that writeup from the person at Compaq
alluded to by Jim Gettys.
- Dan

#1------------------------------------------------------------------
Michael Taht wrote:
> 
> Dear Karthik:
> 
> An old linux-kernel thread that I was on about two years ago where I
> had isolated a similar weird web performance wall under linux. It
> turned out that linux did not handle more than 1800 connections (in
> various states) well due to several factors.
> 
> 1) Kernel timers were stored in a singly linked list that had to be
> searched for openings. TCP uses a minimum of (I think) 16 kernel
> timer calls per connection.
> 
> On the hardware I had available at the time (ppro 200), at about
> 1800 outstanding connections, the box would slow to a crawl. It was
> neat in that I could only duplicate the bug in the world of 28.8
> connections or with apache - zeus performed well in testing on a
> 100Mbit lan, but performed about the same as apache in the real
> world. This led to 2) finding and closing large numbers of
> outstanding tcp connections (as you get in a benchmark or in real
> world slow tcp connections) in linux was very inefficient.
> 
> The symptoms were very similar to the ones you are describing in bug
> number 4268.
> Plenty of cpu apparently available, but web performance flatlined.
> Take a look at netstat and see how many connections are outstanding.
> Take a look at sched.c to see how timers are being allocated.
> 
> Several solutions were proposed back then. For timers, mine (based
> on the old SCO-like hash method) was fast, and deterministic, and
> the code released just a few hours later by someone whos's name I
> can't recall, was MUCH faster, if less deterministic. I assumed that
> this superior code had made it into the kernel by now. And I had
> assumed that the work that miller was doing was fixing the network
> stuff for 2.2.
> 
> I haven't looked at the current scheduler code. At one point I was
> working on a more advanced scheduler (better gang scheduling, more
> flexible priortisation, processor affinity) but my company decided
> just to throw more boxes at the problem.
> 
> I will revisit my trees later today and see what's up on this side.

#2------------------------------------------------------------------
Ariel Faigon wrote:
> 
> Hi Dan,
> 
> Thanks for writing http://www.kegel.com/mindcraft_redux.html
> 
> A couple of items you may find interesting.
> 
> 1) For a long time the web performance team at SGI has noted
>    that among the three web servers we have been benchmarking:
>    Apache, Netscape (both enterprise and fasttrack), and
>    Zeus,  Apache is (by far) the slowest.  In fact an SGI
>    employee (Mike Abbott) has done some optimizations which
>    made Apache run 3 (!) times faster on SPECWeb on IRIX.
>    It is our intention to make these patches public soon.
> 
> 2) When we tried to test our Apache patches on IRIX the expected
>    3x speedup was easy to achieve.  However when we ported our
>    changes to Linux (2.2.5), we were surprised to find that
>    we don't even get into the scalability game.  A 200ms delay
>    in connection establishment in the TCP/IP stack in Linux 2.x
>    was preventing Apache to respond to anything more than 5
>    connections per second.  We have been in touch with David Miller
>    on this and sent him a patch by Feng Zhou which eliminates
>    this bottleneck.  This patch I believe has made it into the
>    2.2.6 kernel.   So now we are back into optimizing Apache.

Date: Wed, 5 May 1999 14:54:40 -0400 (EDT)
From: Phillip Ezolt 
To: linux-kernel@vger.rutgers.edu
Cc: jg@pa.dec.com, greg.tarsa@digital.com
Subject: Overscheduling DOES happen with high web server load.

Hi all,

In doing some performance work with SPECWeb96 on ALpha/Linux with apache,
it looks like "schedule" is the main bottleneck. 

(Kernel v2.2.5, Apache 1.3.4, egcs-1.1.1, iprobe-4.1)

When running a SPECWeb96 strobe run on Alpha/linux, I found that when the
CPU is pegged, 18% of the time is spent in the scheduler.

Using Iprobe, I got the following function breakdown: (only functions >1%
are shown)

Begin            End                                    Sample Image Total
Address          Address          Name                   Count   Pct   Pct
-------          -------          ----                   -----   ---   ---
0000000000000000-00000000000029FC /usr/bin/httpd        127463        18.5 
00000001200419A0-000000012004339F   ap_vformatter        15061  11.8   2.2 
FFFFFC0000300000-00000000FFFFFFFF vmlinux               482385        70.1 
FFFFFC00003103E0-FFFFFC000031045F   entInt                7848   1.6   1.1
FFFFFC0000315E40-FFFFFC0000315F7F   do_entInt            48487  10.1   7.0
FFFFFC0000327A40-FFFFFC0000327D7F   schedule            124815  25.9  18.1
FFFFFC000033FAA0-FFFFFC000033FCDF   kfree                 7876   1.6   1.1
FFFFFC00003A9960-FFFFFC00003A9EBF   ip_queue_xmit         8616   1.8   1.3
FFFFFC00003B9440-FFFFFC00003B983F   tcp_v4_rcv           11131   2.3   1.6
FFFFFC0000441CA0-FFFFFC000044207F   do_csum_partial      43112   8.9   6.3 
                                    _copy_from_user 

I can't pin it down to the exact source line, but the cycles are spent in
close proximity of one another. 

FFFFFC0000327A40 schedule vmlinux
FFFFFC0000327C1C   01DC      2160 (  1.7) *
FFFFFC0000327C34   01F4     28515 ( 22.8) **********************
FFFFFC0000327C60   0220      1547 (  1.2) *
FFFFFC0000327C64   0224     26432 ( 21.2) *********************
FFFFFC0000327C74   0234     36470 ( 29.2) *****************************
FFFFFC0000327C9C   025C     24858 ( 19.9) *******************       

(For those interested, I have the disassembled code. )

Apache has a fairly even cycle distribution, but in the kernel, 'schedule' 
really sticks out as the CPU burner. 

I think that the linear search for next runnable process is where time is
being spent. 

As an independent test, I ran vmstat while SPECWeb was running.

The leftmost column is the number of processes waiting to run.  These number
are above the 3 or 4 that are normally quoted. 

 procs                  memory    swap        io    system         cpu
 r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id
 0 21 0   208  5968  5240 165712   0   0 4001  303 10263 6519  31  66   4
26 27 1   208  6056  5240 165848   0   0 2984   96 5623 3440  29  60  11
 0 15 0   208  5096  5288 166384   0   0 4543  260 10850 7346  32  66   3
 0 17 0   208  6928  5248 164936   0   0 5741  309 13129 8052  32  65   3
37 19 1   208  5664  5248 166144   0   0 2502  142 6837 3896  33  63   5
 0 14 0   208  5984  5240 165656   0   0 3894  376 12432 7276  32  65   3
 0 19 1   208  4872  5272 166248   0   0 2247  124 7641 4514  32  64   4
 0 17 0   208  5248  5264 166336   0   0 4229  288 8786 5144  31  67   2
56 16 1   208  6512  5248 165592   0   0 2159  205 8098 4641  32  62   6
94 18 1   208  5920  5248 165896   0   0 1745  191 5288 2952  32  60   7
71 14 1   208  5920  5256 165872   0   0 2063  160 6493 3729  30  62   8
 0 25 1   208  5032  5256 166544   0   0 3008  112 5668 3612  31  60   9
62 22 1   208  5496  5256 165560   0   0 2512  109 5661 3392  28  62  11
43 22 1   208  4536  5272 166112   0   0 3003  202 7198 4813  30  63   7
 0 26 1   208  4800  5288 166256   0   0 2407   93 5666 3563  29  60  11
32 17 1   208  5984  5296 165632   0   0 2046  329 7296 4305  31  62   6
23 7 1   208  6744  5248 164904   0   0 1739  284 9496 5923  33  65   2
14 18 1   208  5128  5272 166416   0   0 3755  322 9663 6203  32  65   3
 0 22 1   208  4256  5304 167288   0   0 2593  156 5678 3219  31  60   9
44 20 1   208  3688  5264 167184   0   0 3010  149 7277 4398  31  62   7
29 24 1   208  5232  5264 166248   0   0 1954  104 5687 3496  31  61   9
26 23 1   208  5688  5256 165568   0   0 3029  169 7124 4473  30  60  10
 0 18 1   208  5576  5256 165656   0   0 3395  270 8464 5702  30  63   7      

It looks like the run queue is much longer than expected. 

I imagine this problem is compounded by the number of times "schedule" is
called. 

On a webserver that does not have all of the web pages in memory, an httpd
processes life is the following:

1. Wake up for a request from the network.
2. Figure out what web page to load.
3. Ask the disk for it.
4. Sleep (Schedule()) until the page is ready.

This means that schedule will be called alot. In addition a process will wake 
and sleep in a time much shorter than its allotted time slice. 

Each time we schedule, we have to walk through the entire run queue. This will
cause less requests to be serviced.  This will cause more processes to be stuck
on the run queue,  this will make the walk down the runqueue even longer...

Bottom line, under a heavy web load, the linux kernel seems to spend and
unnecessary amount of time scheduling processes.

Is it necessary to calculate the goodness of every process at every schedule?
Can't we make the goodnesses static?  Monkeying with the scheduler is big 
business, and I realize that this will not be a v2.2 issue, but what about 
v2.3? 

--Phil

Digital/Compaq:                     HPSD/Benchmark Performance Engineering
Phillip.Ezolt@compaq.com                            ezolt@perf.zko.dec.com

ps.  For those interested in more detail there will be a
WIP paper describing this work presented at Linux Expo.

Date: Thu, 6 May 1999 07:38:11 +0200
From: Andi Kleen 
To: rgooch@atnf.csiro.au
Cc: linux-kernel@vger.rutgers.edu, ezolt@perf.zko.dec.com
Subject: Re: Overscheduling DOES happen with high web server load.

>Why don't we see the time taken by the goodness() function?

Because goodness() is inline. 

>> Apache has a fairly even cycle distribution, but in the kernel, 'schedule' 
>> really sticks out as the CPU burner. 
>> 
>> I think that the linear search for next runnable process is where time is
>> being spent. 

>Could well be, especially if the context switches are happening
>between threads rather than separate processes. Thread switches are
>*really* fast under Linux.

??? current apache doesn't use threads, it uses processes.

As a wild guess of the cause: AFAIK apache uses multiple processes 
in a accept() on a single socket (please correct me if I'm wrong, it has
been a long time since I last looked at apache source). accept() always
wakes up all processes waiting on the socket when an event occurs. Some
people speculated that this could cause the thundering herd problem, it
is possible that these benchmarks gave the first proof of this problem.
The long run queues are a strong cue in this direction.

Fix would be to do a special version of wake_up() that only wakes up
the first waiter and use that in the TCP socket data ready call backs. 

-Andi

Date: Thu, 6 May 1999 04:49:20 -0700
From: davem@redhat.com
To: rgooch@atnf.csiro.au
Cc: ezolt@perf.zko.dec.com, linux-kernel@vger.rutgers.edu, jg@pa.dec.com,
     greg.tarsa@digital.com
Subject: Re: Overscheduling DOES happen with high web server load.

   Date:        Thu, 6 May 1999 10:42:18 +1000
   From: Richard Gooch 

   Indeed. As a separate question, we may wonder why so many
   processes/threads are being used, and whether that number
   could/should be reduced. Perhaps the server is doing something
   silly. But that's an aside. Instead, I'd like to explore ways of
   reducing the (already low) scheduler overhead.

No, it is more beneficial to find out why so many tasks are waking up
so much in the first place, _THIS_ is the bug.

This smells of galloping herd phenomenon to me.  And my nose says it's
accept() in this case, and my nose further states that implementing
wake-one semantics in accept() might even make this part of the
profiling go away and increase our TCP server performance
dramatically.  Several people have suggested this to me in private
discussions, and I was skeptical at first, but now I'm starting to
believe it.

This has nothing to do with how many threads are being used, it has to
do with threads getting woken up only to find they have no work to do,
which is the only way the run queues can grow large enough for these
sorts of user processes.

Later,
David S. Miller
davem@redhat.com

Date: Thu, 06 May 1999 08:23:21 +0000
From: Dan Kegel 
To: linux-kernel@vger.rutgers.edu
Cc: Bruce Weiner 
Subject: 2.2.7 fixes Apache problem? (Was: Re: 2.2.5 optimizations for web 
    benchmarks?)

Cacophonix Gaul  wrote:
> I'd like some help with optimizing linux (and apache) 
> for ... specweb96... I expect to 
> see 10K-20K simultaneous active connections.

A few days later, he submitted an Apache bug report,
http://bugs.apache.org/index/full/4268, saying:

> I'm running some specweb tests on apache, and with the specific config file
> from the distribution, performance drastically drops after reaching
> high loads - and remains poor even after the specweb tests stop.

This happened on kernel 2.2.5 with Apache 1.3.4 and 1.3.6.

Fast forward two weeks to the present.  He now reports:
>  The mystery continues. I got round to trying out 1.3.6 again this evening,
>  this time on 2.2.7. I did _not_ see the performance drop off. Just to
verify,
>  I rechecked on the stock 2.2.5 kernel, and the drop off is there.
>  
>  So _something_ has been fixed between 2.2.5 and 2.2.7 that has made this
problem
>  go away. I'll keep plugging away as I get spare time to see if I can get the
>  problem to occur. 

Makes one want to repeat Mindcraft's Apache benchmark with 2.2.7,
doesn't it?
- Dan

http://www.kegel.com/mindcraft_redux.html

Date: Thu, 6 May 1999 09:31:58 -0700 (PDT)
From: Dean Gaudet 
To: davem@redhat.com
Cc: rgooch@atnf.csiro.au, ezolt@perf.zko.dec.com, linux-kernel@vger.rutgers.edu,     jg@pa.dec.com, greg.tarsa@digital.com
Subject: Re: Overscheduling DOES happen with high web server load.

As shipped, apache-1.3.6 on linux uses fcntl() file locking to prevent
more than one process from being inside accept().  I'm not sure if the dec
folks have rebuilt the server with -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT...
if they have, then there's no protection around accept() in servers
listening on a single port. 

At any rate, for apache 1.3.x we require some form of locking (fcntl() in
linux' case) when there are multiple listening sockets... so you also need
to solve the thundering herd problem for fcntl() if it has one.

Last time I brought up wake-on accept(), Alan said it is a hard problem. 
Maybe wake-one fcntl() is easier. 

Dean

Date: Thu, 6 May 1999 14:01:16 -0400 (EDT)
From: Phillip Ezolt 
To: Dean Gaudet 
Cc: davem@redhat.com, rgooch@atnf.csiro.au, linux-kernel@vger.rutgers.edu,
     jg@pa.dec.com, greg.tarsa@digital.com
Subject: Re: Overscheduling DOES happen with high web server load.

On Thu, 6 May 1999, Dean Gaudet wrote:

> 
> 
> On Thu, 6 May 1999, Phillip Ezolt wrote:
> 
> > Dean,
> > > As shipped, apache-1.3.6 on linux uses fcntl() file locking to prevent
> > > more than one process from being inside accept().  I'm not sure if the
dec
> > > folks have rebuilt the server with -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT...
> > > if they have, then there's no protection around accept() in servers
> > > listening on a single port.       
> > 
> > Apache was built with the following flags:  
> > (-DSINGLE_LIST_UNSERIALIZED_ACCEPT is not amoung them. )
> > 
> > ./configure --prefix=/usr --libexecdir=/usr/lib/apache
--sysconfdir=/etc/httpd/conf --datadir=/home/httpd
--includedir=/usr/include/apache --logfiledir=/var/log/httpd
--localstatedir=/var --runtimedir=/var/run  --proxycachedir=/var/cache/httpd 
--enab
le-m
> > odule=all --enable-shared=max --disable-rule=WANTHSREGEX
> > 
> > Would it be better to build with -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT?  Is
> > this something that would help or hurt the "thundering herd" problem? 
> 
> The mindcraft folks did build with -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT
> (you stick it into the CFLAGS environment variable before invoking
> ./configure).  My guess is that you'll see the same results. 

Is that "same results" WORSE or BETTER at thundering herd than without it? 

> 
> I suspect that with accept() or fcntl() we'll need an option to enable the
> wake_one() behaviour -- otherwise it's a pain to deal with process
> priorities and such.  Essentially a flag which says "all processes waiting
> on this are of equal priority, no matter what their cpu time, when their
> last time slice was, blah blah blah".  (And then the kernel should wake
> the one which went to sleep most recently ;)
> 
> Dean
> 
> 
> 
> 

While fixing apache to play nice with linux may be a good solution to the 
SPECWeb problem.  I think that this test, in general, reveals a critical flaw 
in the linux scheduler.  

I have seen the same kind of problem here with other programs that have a 
large number of running processes.  Linux slows to a crawl, and it is NOT
a result of swapping.  This is really a scalability issue.  If linux is
going to move beyond its current segment of the market, (low-cost servers),
these things have to be fixed. 

This problem will not go away with small tweaks, such as Richard Gooch's 
seperate real-time queue. 

They will buy some time, but it won't get rid of the real issue.  This linear
search is bad.  

Right now we are recalculating goodnesses again and again and again. 
Really, only a few values will change.  This is extra work that is a waste
of CPU time.  In an ideal world, we can find the next runnable process in
O(1), not O(runqueue len) time.

I've worked a little here with Greg Gaertner, who designed a scheduler for
Cray version of Unix, and we have come up with an alternate solution.   

However, it would involves a major rethinking of how linux scheduling works. 
(Priority is NOT stored as tick values, processes age as time goes on, realtime
processes don't age and have a high priority).   The current version limps, but
it doesn't fully work, or deal with SMP correctly.  It is basically an array of
linked lists that keep track of the current highest priority.   It can tell 
immediately what the next process to run is.  I'll present the proposed 
solution at LE, and hopefully, we'll be able to talk about it more then.

--Phil

Digital/Compaq:                     HPSD/Benchmark Performance Engineering
Phillip.Ezolt@compaq.com                            ezolt@perf.zko.dec.com

Date: Thu, 6 May 1999 17:58:55 -0700 (PDT)
From: Linus Torvalds 
To: davem@redhat.com
Cc: alan@lxorguk.ukuu.org.uk, ezolt@perf.zko.dec.com,
     dgaudet-list-linux-kernel@arctic.org, rgooch@atnf.csiro.au,
     linux-kernel@vger.rutgers.edu, jg@pa.dec.com, greg.tarsa@digital.com
Subject: Re: Overscheduling DOES happen with high web server load.

On Fri, 7 May 1999 davem@redhat.com wrote:
> 
> [ Linus, please correct me if I'm wrong below, this is about the
>   wake-one scheme you described to us earlier today. ]

Okey-dokey..

>    > [ for everyone else's benefit Linus's suggestion is for the task to
>    >   indicate, when placing himself on the run queue, that he is
>    >   "wake one" capable, then the wake up routines stop doing work
>    >   when they hit the first task which is marked this way and is not
>    >   running already ]
> 
>    Unfortunately you also need to consider a common case where another task
>    POSIXly requires waking. The classic is select(). Such tasks should always
>    be woken.
> 
> I think Linus meant another thing, and I worded it incorrectly above, sorry.

Yes. My way of handling this will _always_ wake up everything that is not
marked exclusive, and that obviously very much includes select().

> I believe he intended that the wakeup scheme be:
> 
> 1) Wakes up everyone not marked as "wake one" capable, this deals with
>    the select issue and is the crux behind why he suggests this scheme.

> 2) Amongst (only) the "wake one" capable tasks, the first one which is
>    not already running is woken, and then no further "wake one" capable
>    tasks are poked.

Well, it's even easier than the above. What you do is:

 - the wakeup-queue is a linked list (surprise, surprise, that's
   how it works already)
 - you add the "exclusive" entries to the end of the list (or at least
   after any oher non-exclusive ones: depending on how the list is
   organized one or the other may be the more efficient way to handle it).
   They are also marked some way - preferably just a bit in the task
   state. 
 - when you do a wakeup, you do exactly what you do now: walk the list,
   waking up each process. The ONLY difference is that if you find an
   exclusive process (easy to test for: you already look at the task state
   anyway) that you woke up, you stop early and go home. You mark the one
   you woke non-exclusive _or_ you make the exclusivity test also verify
   that the thing is not running, so two consecutive wakeup calls will
   always wake up two exclusive processes.

So:
 - nonexclusive waiters are always woken up. They are at the head of the
   list.
 - _one_ non-exclusive process is always woken up. 

> This means you do not indicate "wake one" capability for the listen
> socket polling case, for example.

Indeed. That would be extremely stupid.

The more complex case is for things like "read()", where you _can_ make
use of the exclusive code if you want to (for things that have queue-like
behaviour and thus have a notion of an exclusive head-of-queue: pipes,
sockets, and some character devices), but you have to make sure that if
you don't empty the queue you do another "wakeup()" on the wait queue,
otherwise there might be another exclusive reader that doesn't wake up
when you stop reading.

For accept(), that's not even an issue, as you can just guarantee that if
you wake up and there's a socket to be accepted, you will accept it (so
you never have to wake up anybody else for the half-baked case). 

Note that the thing that makes this really safe is that you're guaranteed
that the _only_ things that change behaviour are the ones you expressly
asked to change. That was always a problem with "wake_up_one()", where it
had non-local changes to behaviour (a wake_up_one() would change how a
sleep somewhere else behaved). 

                        Linus

Date: Fri, 7 May 1999 02:56:20 +0100 (BST)
From: Alan Cox 
To: Linus Torvalds 
Cc: davem@redhat.com, alan@lxorguk.ukuu.org.uk, ezolt@perf.zko.dec.com,
     dgaudet-list-linux-kernel@arctic.org, rgooch@atnf.csiro.au,
     linux-kernel@vger.rutgers.edu, jg@pa.dec.com, greg.tarsa@digital.com
Subject: Re: Overscheduling DOES happen with high web server load.

>  - you add the "exclusive" entries to the end of the list (or at least
>    after any oher non-exclusive ones: depending on how the list is
>    organized one or the other may be the more efficient way to handle it).
>    They are also marked some way - preferably just a bit in the task
>    state. 

For most wake_one situations you want to schedule the last thread to go idle
as it will have the most context in cache. NT actually does this sort of
stuff.

>    anyway) that you woke up, you stop early and go home. You mark the one
>    you woke non-exclusive _or_ you make the exclusivity test also verify
>    that the thing is not running, so two consecutive wakeup calls will
>    always wake up two exclusive processes.

Ok

> that the _only_ things that change behaviour are the ones you expressly
> asked to change. That was always a problem with "wake_up_one()", where it
> had non-local changes to behaviour (a wake_up_one() would change how a
> sleep somewhere else behaved). 

That makes a lot of sense.

Alan

Date: Fri, 7 May 1999 01:26:34 +0100 (BST)
From: Alan Cox 
To: davem@redhat.com
Cc: alan@lxorguk.ukuu.org.uk, ezolt@perf.zko.dec.com,
     dgaudet-list-linux-kernel@arctic.org, rgooch@atnf.csiro.au,
     linux-kernel@vger.rutgers.edu, jg@pa.dec.com, greg.tarsa@digital.com
Subject: Re: Overscheduling DOES happen with high web server load.

> Do these play patches implement it the way Linus has suggested to us?
> 
> He makes a lot of sense, because in the cases I have studied he is
> right, only the task going to sleep has the correct knowledge about
> whether wake-one semantics can work or not.

For my playing they dont. I can demonstrate that for a listening socket
the worst case is we take a wake of all for some weird cases. At least
to my satisfaction, and providing I ignore select on listening for testing
cases. I broke select and have an if(port==80) type check for wake one - its
ugly OK ...

Linus theory btw doesnt work either.

> [ for everyone else's benefit Linus's suggestion is for the task to
>   indicate, when placing himself on the run queue, that he is
>   "wake one" capable, then the wake up routines stop doing work
>   when they hit the first task which is marked this way and is not
>   running already ]

Unfortunately you also need to consider a common case where another task
POSIXly requires waking. The classic is select(). Such tasks should always
be woken.

You need to be a bit smarter thats all. wake_one must wake the first
sleeping  (that is important - a task on the wait queue already running being
woken alone raises all sorts of funky races) wake_one mode task and anyone 
who is 'wake my always'.

Not much more complex. With a very scalable system it would be good to keep
a count of tasks in wake all to avoid an O(n) list walk for the non select
cases.

Alan

Date: Fri, 7 May 1999 01:05:43 +0100 (BST)
From: Alan Cox 
To: Phillip Ezolt 
Cc: dgaudet-list-linux-kernel@arctic.org, davem@redhat.com,
     rgooch@atnf.csiro.au, linux-kernel@vger.rutgers.edu, jg@pa.dec.com,
     greg.tarsa@digital.com
Subject: Re: Overscheduling DOES happen with high web server load.

> While fixing apache to play nice with linux may be a good solution to the 
> SPECWeb problem.  I think that this test, in general, reveals a critical flaw > in the linux scheduler.  

I think it shows up two things. One is the thundering herd problem on accept.
I've got some play patches for that now. They turn out (for that case) fairly
easy to do and to do roughly right.

Accept is only part of the problem though.

> This problem will not go away with small tweaks, such as Richard Gooch's 
> seperate real-time queue. 

Agreed

> Really, only a few values will change.  This is extra work that is a waste
> of CPU time.  In an ideal world, we can find the next runnable process in
> O(1), not O(runqueue len) time.

Currently a sleep/wake_one is O(1). That is hard to keep with an O(1) 
scheduler.

p->counter has a fixed range. This means we can bucket sort the tasks for
O(1) insert delete.

If as well as the bucket pointers the list is chained then you can handle
the next process as O(1) for uniprocessor. update_process_times
continues to be O(1) but does slightly more work to move the task between
lists.

Why haven't I done this yet ?

Im stumped on how to handle the p->mm == prev->mm bonus without going
walking down the list of that priority and previous. Likewise the SMP
bonuses, although I can see one way to do it which doesnt sound nice
for a large array of CPU's - keep each runnable task array once per cpu
with that CPU's priorities.

> immediately what the next process to run is.  I'll present the proposed 
> solution at LE, and hopefully, we'll be able to talk about it more then.

Excellent.  I look forward to this.

Alan

Date: Fri, 7 May 1999 00:44:15 -0700
From: davem@redhat.com
To: alan@lxorguk.ukuu.org.uk
Cc: alan@lxorguk.ukuu.org.uk, ezolt@perf.zko.dec.com,
     dgaudet-list-linux-kernel@arctic.org, rgooch@atnf.csiro.au,
     linux-kernel@vger.rutgers.edu, jg@pa.dec.com, greg.tarsa@digital.com,
     torvalds@transmeta.com
Subject: Re: Overscheduling DOES happen with high web server load.

   From: alan@lxorguk.ukuu.org.uk (Alan Cox)
   Date: Fri, 7 May 1999 01:26:34 +0100 (BST)

   Linus theory btw doesnt work either.

I think they do, and it's an issue of miscommunication :-)

[ Linus, please correct me if I'm wrong below, this is about the
  wake-one scheme you described to us earlier today. ]

   > [ for everyone else's benefit Linus's suggestion is for the task to
   >   indicate, when placing himself on the run queue, that he is
   >   "wake one" capable, then the wake up routines stop doing work
   >   when they hit the first task which is marked this way and is not
   >   running already ]

   Unfortunately you also need to consider a common case where another task
   POSIXly requires waking. The classic is select(). Such tasks should always
   be woken.

I think Linus meant another thing, and I worded it incorrectly above, sorry.

I believe he intended that the wakeup scheme be:

1) Wakes up everyone not marked as "wake one" capable, this deals with
   the select issue and is the crux behind why he suggests this scheme.

2) Amongst (only) the "wake one" capable tasks, the first one which is
   not already running is woken, and then no further "wake one" capable
   tasks are poked.

And furthermore, only in specific places like accept() do you indicate
the "wake one" capability when adding yourself to the wait queue.  And
in such places you make damn sure that you "eat" the event or do
another wakeup if you cannot for some reason (failed allocation of
some structure, etc.)

This means you do not indicate "wake one" capability for the listen
socket polling case, for example.

Linus, did I get it right? :-)

Later,
David S. Miller
davem@redhat.com

Date: Thu, 6 May 1999 17:35:53 -0700 (PDT)
From: Dean Gaudet 
To: Alan Cox 
Cc: davem@redhat.com, rgooch@atnf.csiro.au, ezolt@perf.zko.dec.com,
     linux-kernel@vger.rutgers.edu, jg@pa.dec.com, greg.tarsa@digital.com
Subject: Re: Overscheduling DOES happen with high web server load.

On Fri, 7 May 1999, Alan Cox wrote:

> Why do you use fcntl locks not System5 semaphores - which oddly enough I
> think on Linux do have wake one semantics...

I didn't switch to sysvsems because I was worried about them not working
in older kernels... whereas we've been using fcntl() for a long time. 

I actually did some (crude) timings before apache 1.3.x on linux 2.0.35 --
fcntl() versus flock() versus sysvsem().  I found flock() was the fastest,
by some small amount.  I bit the bullet and switched to using flock(),
because I figured mail clients had it tested well.  But then a few random
bug reports came in which were solved by the user switching back to
fcntl(), so for 1.3.6 I switched linux back to fcntl().  (Interested folks
can visit bugs.apache.org, search for flock.) 

SysVSem on other platforms have some annoying limitations, such as a small
max# of folks that can attach to a semaphore.  Plus sysvsems aren't
cleaned up on process exit... whereas a fcntl()-locked file can be
unlink()ed as long as everyone already has the fd.

In theory you can edit src/include/ap_config.h, search for LINUX, and
change the USE_FCNTL_SERIALIZED_ACCEPT to USE_SYSVSEM_SERIALIZED_ACCEPT
... rebuild, and try again. 

Ah, you might want this patch, the sysvsem code isn't as well tested as
the other code. 

Dean

Index: src/include/ap_config.h
===================================================================
RCS file: /home/cvs/apache-1.3/src/include/ap_config.h,v
retrieving revision 1.257
diff -u -r1.257 ap_config.h
--- ap_config.h 1999/05/04 02:57:13     1.257
+++ ap_config.h 1999/05/07 00:33:38
@@ -498,7 +498,7 @@
  * folks to tweak their Configuration to get flock.
  */
 #ifndef USE_FLOCK_SERIALIZED_ACCEPT
-#define USE_FCNTL_SERIALIZED_ACCEPT
+#define USE_SYSVSEM_SERIALIZED_ACCEPT
 #endif

 #define SYS_SIGLIST    _sys_siglist
Index: src/main/http_main.c
===================================================================
RCS file: /home/cvs/apache-1.3/src/main/http_main.c,v
retrieving revision 1.435
diff -u -r1.435 http_main.c
--- http_main.c 1999/05/05 20:42:58     1.435
+++ http_main.c 1999/05/07 00:33:39
@@ -741,17 +741,21 @@

 static void accept_mutex_on(void)
 {
-    if (semop(sem_id, &op_on, 1) < 0) {
-       perror("accept_mutex_on");
-       clean_child_exit(APEXIT_CHILDFATAL);
+    while (semop(sem_id, &op_on, 1) < 0) {
+       if (errno != EINTR) {
+           perror("accept_mutex_on");
+           clean_child_exit(APEXIT_CHILDFATAL);
+       }
     }
 }

 static void accept_mutex_off(void)
 {
-    if (semop(sem_id, &op_off, 1) < 0) {
-       perror("accept_mutex_off");
-       clean_child_exit(APEXIT_CHILDFATAL);
+    while (semop(sem_id, &op_off, 1) < 0) {
+       if (errno != EINTR) {
+           perror("accept_mutex_off");
+           clean_child_exit(APEXIT_CHILDFATAL);
+       }
     }
 }

Date: Fri, 7 May 1999 10:11:37 -0400 (EDT)
From: Phillip Ezolt 
To: Andrea Arcangeli 
Cc: Richard Gooch , linux-kernel@vger.rutgers.edu,
     jg@pa.dec.com, greg.tarsa@digital.com
Subject: Re: Overscheduling DOES happen with high web server load.

On Fri, 7 May 1999, Andrea Arcangeli wrote:

> On Thu, 6 May 1999, Phillip Ezolt wrote:
> 
> >Although this would probably speed up the code, the underlying problem
> >is still there. (The linear search for the next process)  The patch
basically 
> 
> I really don't think the linear search is a big issue. You had at _max_ 90
> task running at the same time. I think the big issue is to avoid the not
> needed schedule(). If you avoid them you drop from 40000 schedule/sec to
> 3000 schedule/sec...


Ok, you are right.  The real problem is we are calculating goodness 
O(A*B).

A= Number of processes on the runqueue
B= Number of times schedule is called

The real answer is to cut out all unnecessary work.   If we can decrease 
B significantly, it may almost be irrelavent how long A takes.  

If you look closely, as the test ramps up, the number of overschedules 
DOES drop to around 3000 schedule/sec.  I think that the 40000 happens when
the machine is mostly idle.  (Compare the id column with the cs column).

However 3000 is still too much, no? 

>  procs                  memory    swap        io    system         cpu
>  r b w  swpd  free  buff cache  si  so   bi   bo   in   cs  us  sy  id

>  0 0 0     0 226872  1544  9816   0   0    0    0 1099 39056   2   2  96
>  0 0 0     0 226872  1544  9816   0   0    0    0 1082 39054   1   2  96
>  0 0 0     0 226872  1544  9816   0   0    0    0 1079 39118   2   2  96
>  0 0 0     0 226872  1544  9816   0   0    0    1 1099 39116   2   2  96
>  0 30 0     0 224744  1616  9816   0   0   75    0 1519 35529   4   8  89
>  0 29 0     0 223120  1672 10376   0   0  451    0 1369 34011   6   8  86
>  0 30 0     0 221968  1744 10776   0   0  344    0 1370 32861   4   9  87
>  8 32 0     0 219312  1816 11208   0   0  399    0 1401 27527   6  10  84
>  0 37 0     0 216648  1864 11984   0   0  406    0 1516 22204   8  13  79
>  0 57 0     0 210360  1920 12944   0   0  643    0 1603 13209  14  18  68
>  4 85 0     0 198544  1976 14048   0   0  730    0 1774 7218  20  30  49
>  0 96 0     0 187520  2016 15176   0   0  743    0 1783 5522  20  34  47
>  0 93 0     0 175776  2048 16632   0   0 1156   14 1993 3728  22  42  37
>  0 96 0     0 173080  2088 18392   0   0 1388    6 2037 4427  14  33  53
>  0 89 0     0 171296  2128 20056   0   0 1365    3 2068 4655  12  34  54
>  0 92 0     0 169960  2160 21176   0   0  840    3 1971 4445  13  32  55
>  0 94 0     0 168320  2192 22720   0   0 1213    2 2036 4314  14  32  54
>  0 86 0     0 166584  2224 24256   0   0 1310    3 2158 4194  13  37  50
>  1 82 0     0 164504  2248 26144   0   0 1539    3 2250 3879  15  37  48
>  0 88 0     0 162992  2296 27488   0   0 1073    3 2232 3799  16  37  47
>  0 87 0     0 161264  2336 29128   0   0 1284    4 2356 4200  16  35  49
>  0 85 0     0 158936  2368 31136   0   0 1817    2 2230 4457  14  34  52
> 12 71 0     0 157096  2400 32632   0   0 1328    3 2304 3636  16  39  46
>  0 79 0     0 155168  2440 34464   0   0 1599    2 2351 3985  15  38  47
>  0 87 0     0 153432  2480 35840   0   0 1299    3 2291 3705  17  38  45
>  3 70 0     0 150880  2520 38088   0   0 1948    2 2416 4069  16  37  47
>  0 72 0     0 148496  2552 40336   0   0 2013    4 2731 3902  17  39  44
>  0 79 0     0 146976  2600 41720   0   0 1154    2 2626 3539  18  41  41
> 17 73 0     0 144952  2648 43704   0   0 1886    2 2445 3487  18  41  42
>  0 79 0     0 143056  2688 45464   0   0 1595    2 2211 3856  14  39  47
>  0 76 0     0 140192  2728 47920   0   0 2284    2 2880 3059  20  46  35
>  0 79 0     0 138832  2768 49224   0   0 1242    3 2442 3681  16  41  43
>  0 70 0     0 136288  2816 51544   0   0 2171    4 3014 3583  20  41  38
> 15 64 0     0 134432  2872 53176   0   0 1466    3 2875 3007  20  45  35
>  0 67 0     0 132448  2928 54984   0   0 1690    3 3134 2712  22  48  30
>  5 63 0     0 130704  2984 56656   0   0 1519    3 2825 3006  23  41  36
>  0 70 0     0 127936  3040 58952   0   0 2070    2 3159 2584  23  48  29

> 
> And using an heap would impact all cases where the machine is not
> overloaded but it has only 5/6 tasks running all the time.
> 
> BTW, Is your http client freely available?

Hmph.  It is the SPECWeb96 client.  Unfortunately, it is not freely available. 
Check out http://www.spec.org/ for more info.  

It might make sense for Redhat or someone to purchase a copy for system
performance testing.  It actually might be able to head of some of this
mindcraft hoopla. 

> 
> Andrea Arcangeli
> 
> 


--Phil

Digital/Compaq:                     HPSD/Benchmark Performance Engineering
Phillip.Ezolt@compaq.com                            ezolt@perf.zko.dec.com

Date: Fri, 7 May 1999 11:52:01 -0400 (EDT)
From: Greg Lindahl 
To: Phillip Ezolt 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Overscheduling DOES happen with high web server load.

> Ok, you are right.  The real problem is we are calculating goodness 
> O(A*B).
> 
> A= Number of processes on the runqueue
> B= Number of times schedule is called

This is the amount of work we are doing, but I think you're on to the
wrong solution. Phil did a test: he patched schedule() to pick the
first schedulable process instead of the best one. That dropped the
amount of time consumed in schedule() from 20% to 1%. That's the same
as reducing the work from O(A*B) to O(B).

Why is schedule called so frequently? The thundering herd. You get a
new connection, everyone wakes up, only one gets work, and everyone
else goes back to sleep, each causing O(A) work to reschedule.  M hits
per second, N processes, B=M*N. A=N. So the total work is O(M*N^2).

These are separate problems. The thundering herd is fixed by
wake_one. The cost of scheduling is still a problem; if we had
many-cpu SMP linux boxes with high loads, all the cpus would sit
around waiting for the scheduler lock. Of course, there would be other
problems too. Since we can't fix all possible thundering herd
situations, I think we should fix the scheduler too.

Phil, did the SpecWeb score rise with the patch? It should have. Of
course, that hacked scheduler is pretty broken...

-- g

Date: Fri, 7 May 1999 20:59:20 -0700
From: David S. Miller 
To: ak@muc.de
Cc: lindahl@cs.virginia.edu, linux-kernel@vger.rutgers.edu,
     ezolt@perf.zko.dec.com
Subject: Re: Overscheduling DOES happen with high web server load.

   From: Andi Kleen 
   Date:        Fri, 7 May 1999 20:24:39 +0200

   It was shown earlier in the thread that no thundering herd occured
   in the test, because apache serializes the accept with a lock.

There is no global consensus of this fact.  Apache may be using flock
to serialize, but nobody has stated that flock does not present the
same problem on the kernel side (everyone waking up and fighting for
the flock, one winning and everyone else going back to sleep).

The scheduling rate is absurd, and so are the run queue lengths which
must be leading to this behavior.  I am willing to be proven
otherwise, but my reading of the data is still galloping herd.

Someone could put this to rest by implementing a quick profiling hack
in the kernel, do something similar to what the timer based profiling
in the kernel does already, but instead record profiling ticks for
calls to the scheduler and record the tick for who it is that called
the scheduler (essentially you're profiling the WCHAN).

Data produced from this during one of these tests would prove
extremely useful.

Later,
David S. Miller
davem@redhat.com

Date: Fri, 7 May 1999 14:37:01 -0700 (PDT)
From: Dean Gaudet 
To: linux-kernel@vger.rutgers.edu
Subject: Re: Overscheduling DOES happen with high web server load.

On 6 May 1999, Linus Torvalds wrote:

> In article ,
> Dean Gaudet   wrote:
> >
> >Last time I brought up wake-on accept(), Alan said it is a hard problem. 
> >Maybe wake-one fcntl() is easier. 
> 
> No, wake-on-accept is the _much_ easier one, please don't use fcntl
> locking.

For a single listening socket, what you say is feasible.  For multiple
listening sockets, with multiple tasks using accept() we really need
some other wake-one interface outside of accept(). 

Suppose multiple tasks go into select() to find a socket which is ready
to accept.  When a connection arrives on one of the listening sockets,
all the tasks are awakened.  Then they all rush to the accept().
One succeeds.  If the listening socket is blocking, the rest are stuck
in accept(), and the webserver is somewhat screwed (because it now has
a bunch of children blocked in accept() on some arbitrary socket which
may not get another request for days).

If the listening sockets are non-blocking, all the children rush back
up to select()... and all we've accomplished is extending the wake-all
loop from kernel to userland with an extra syscall to boot.

Dean

Date: Sat, 8 May 1999 23:51:50 -0700
From: David S. Miller 
To: cacophonix@yahoo.com
Cc: linux-kernel@vger.rutgers.edu, dank@alumni.caltech.edu
Subject: Re: 2.2.7 fixes Apache problem? (Was: Re: 2.2.5 optimizations for web
    benchmarks?)

   Date:        Sat, 8 May 1999 16:48:40 -0700 (PDT)
   From: Cacophonix Gaul 

   I'm not sure if I can say that 2.2.7 _fixes_ the apache problem or
   merely _masks_ it. 

It does fix the problem, there were erroneous 200ms delays at the
start and end of every TCP connection when talking to broken BSD and
Microsoft stacks and a workaround for these systems was implemented by
me in 2.2.7 which cures it.

Plain and simple.

Later,
David S. Miller
davem@redhat.com

Date: Sun, 09 May 1999 05:01:13 +0000
From: Dan Kegel 
To: Cacophonix Gaul 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: 2.2.7 fixes Apache problem? (Was: Re: 2.2.5 optimizations for web 
    benchmarks?)

Cacophonix Gaul wrote:
> With 2.2.7, the tcp performance (to certain clients) has
> .. gone up so much, that apache is no longer bottlenecked
> in my setup - I'm now constrained by disk I/O performance...
> the apache problem I refer to is the problem where
> apache is pushed "over the limit", and performance drops
> to below 30 connections/second. After reaching this state,
> and _after_ the specweb run is over, performance remains 
> low. _Every_ connection after that incurs a ~4 second
> latency (even in unloaded conditions), until apache is
> restarted.  ...  The problem does not occur in 2.2.7, but 
> that's possibly because I'm unable to saturate apache on my
> system.

Can you try a smaller fileset?  I bet if you used a 64MByte
fileset, like Mindcraft did, you might be able to fit it 
in RAM.  With the disk bottleneck out of the way, maybe you
could push Apache over the limit again.
- Dan

Date: Sun, 9 May 1999 04:40:11 -0700 (PDT)
From: Alex Belits 
To: masp0008@stud.uni-sb.de
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Why does Mindcraft insist on 4* 100BaseTX?

On Sun, 9 May 1999, Manfred Spraul wrote:

> I've read the Mindcraft Open benchmark invitation,
> and Mindcraft (ie Microsoft) require 4*100BaseTX.
> 
> It that a common installation, or do you usually
> use Gigabyte ethernet?
> I know that Windows NT can bind the 4 Interrupts to
> the 4 CPU and improve the interrupt throuput,
> but has anyone performed tests with Linux and this
> hardware combination?
> 
> I'm sure Microsoft did these tests before they have
> choosen the hardware.

  This configuration is necessary to create high enough load with clients
that don't have gigabyte ethernet.

-- 
Alex

Date: Sun, 9 May 1999 16:38:03 +0200 (CEST)
From: Andrea Arcangeli 
To: Greg Lindahl 
Cc: Phillip Ezolt , linux-kernel@vger.rutgers.edu
Subject: Re: Overscheduling DOES happen with high web server load.

On Fri, 7 May 1999, Greg Lindahl wrote:

>Why is schedule called so frequently? The thundering herd. You get a
>new connection, everyone wakes up, only one gets work, and everyone
                                    ^^^^^^^^^^^^^^^^^^ wrong
>else goes back to sleep, each causing O(A) work to reschedule.  M hits

The reality until 2.2.7 is that everyone wakes up, need_resched of the
current task is set to 1 so the rescheduled CPU will issue a schedule()
very shortly, but nobody of the wakenup tasks gets rescheduled and
schedule() become a nono. This is the main wastage of resources. I think
it's just fixed in pre-2.2.8-4 though ;).

Andrea Arcangeli

Date: Sun, 9 May 1999 16:42:24 -0400 (EDT)
From: Greg Lindahl 
To: masp0008@stud.uni-sb.de
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Why does Mindcraft insist on 4* 100BaseTX?

> I've read the Mindcraft Open benchmark invitation,
> and Mindcraft (ie Microsoft) require 4*100BaseTX.

It isn't that expensive to put 1 gigabit port on many modern 100mbit
switches. However, many company networks are arranged so that the
servers plug into N separate 100mbit switches.

But this points out a bit of the unreality of the test. If you're
thinking about someone serving web pages to the Internet, I assure you
that 400 megabits of bandwidth costs a hell of a lot more than 1 4-cpu
PC. Or two. Or ten.

-- g

Date: Mon, 10 May 1999 10:35:32 -0700
From: David S. Miller 
To: mingo@chiara.csoma.elte.hu
Cc: krooger@debian.org, linux-kernel@vger.rutgers.edu
Subject: Re: [patch] new scheduler

   Date:        Mon, 10 May 1999 12:21:58 +0200 (CEST)
   From: Ingo Molnar 

   And this all is a rather stupid testcase with no RL significance
   IMO, designed to show alleged recalculation costs. Jonathan, WHERE
   is that 'MAJOR bottleneck'?

Ok, then what I personally want is a firm quantification of where the
scheduling cost is coming from in the web server benchmarks.

I'm willing to accept any well founded explanation, and this is where
most of the concern has been coming from.

If it's galloping herd from some event queue, this should be painfully
easy to test for.  My suggested scheme would be to have a "counter per
PC value" type array similar to what the normal kernel profiler uses,
but instead you record caller-PC values for entry into __wake_up().

Furthermore you could "scale" the counter bumps by adding, instead of
'1' for each __wake_up() call, the number of tasks woken during that
call.

The important thing to capture is "who is doing the wakeups" and "how
much waking up each time".  You need to be slightly careful for some
of the networking stuff, because the true source of the wake up could
be 2 or 3 stack frame above the __wake_up() invocation.

Dump these values after a web benchmark run, and the answers should
just be there.

Any takers?

Later,
David S. Miller
davem@redhat.com

Date: Tue, 11 May 1999 02:24:28 +0200 (CEST)
From: Andrea Arcangeli 
To: Linus Torvalds 
Cc: David S. Miller , Alan Cox ,
     ezolt@perf.zko.dec.com, dgaudet-list-linux-kernel@arctic.org,
     rgooch@atnf.csiro.au, linux-kernel@vger.rutgers.edu, jg@pa.dec.com,
     greg.tarsa@digital.com
Subject: [patch] wake_one for accept(2) [was Re: Overscheduling DOES happen
    with high web server load.]

On Thu, 6 May 1999, Linus Torvalds wrote:

>Well, it's even easier than the above. What you do is:

> - the wakeup-queue is a linked list (surprise, surprise, that's
>   how it works already)
> - you add the "exclusive" entries to the end of the list (or at least
>   after any oher non-exclusive ones: depending on how the list is
>   organized one or the other may be the more efficient way to handle it).
>   They are also marked some way - preferably just a bit in the task
>   state. 
> - when you do a wakeup, you do exactly what you do now: walk the list,
>   waking up each process. The ONLY difference is that if you find an
>   exclusive process (easy to test for: you already look at the task state
>   anyway) that you woke up, you stop early and go home. You mark the one
>   you woke non-exclusive _or_ you make the exclusivity test also verify
>   that the thing is not running, so two consecutive wakeup calls will
>   always wake up two exclusive processes.

>So:
> - nonexclusive waiters are always woken up. They are at the head of the
>   list.
> - _one_ non-exclusive process is always woken up. 

I implemented the thing you described above. It seems to work :).

To test it I developed a proggy that forks a child and then in the parent
pthread_create 60 pthread server. Each server does an accept(2) (blocking)
loop. The previously forked child instead does only a connect(2) loop.

Without the wake-one patch below in order to full all the
connecting/client tcp port I need many seconds of very high load of the
machine. With the patch appyed less then a second and the machine seems
not overloaded during such little time. I have not exact numbers though (I
measured only with my eyes so far ;).

I am not 100% sure the patch has no deadlock-conditions, but it looks like
safe. The wait_for_connect code first check if there is a connection
available before looking at signals.

Index: kernel/sched.c
===================================================================
RCS file: /var/cvs/linux/kernel/sched.c,v
retrieving revision 1.1.1.9
diff -u -r1.1.1.9 sched.c
--- linux/kernel/sched.c        1999/05/07 00:01:50     1.1.1.9
+++ linux/kernel/sched.c        1999/05/11 00:05:35
@@ -718,31 +674,20 @@
                goto move_rr_last;
 move_rr_back:

-       switch (prev->state) {
-               case TASK_INTERRUPTIBLE:
-                       if (signal_pending(prev)) {
-                               prev->state = TASK_RUNNING;
-                               break;
-                       }
-               default:
-                       del_from_runqueue(prev);
-               case TASK_RUNNING:
-       }
-       prev->need_resched = 0;

-repeat_schedule:

        /*
         * this is the scheduler proper:
         */
+       prev->need_resched = 0;

+repeat_schedule:
+       if (prev->state != TASK_RUNNING)
+               goto prev_not_runnable;
+repeat_schedule_runnable:
+       c = prev_goodness(prev, prev, this_cpu);
+       next = prev;
+prev_not_runnable_back:

        p = init_task.next_run;
-       /* Default process to select.. */
-       next = idle_task(this_cpu);
-       c = -1000;
-       if (prev->state == TASK_RUNNING)
-               goto still_running;
-still_running_back:

        /*
         * This is subtle.
@@ -836,13 +781,22 @@
                        p->counter = (p->counter >> 1) + p->priority;
                read_unlock(&tasklist_lock);
                spin_lock_irq(&runqueue_lock);
+               if (prev->state != TASK_RUNNING)
+                       add_to_runqueue(prev);
                goto repeat_schedule;
        }

-still_running:
-       c = prev_goodness(prev, prev, this_cpu);
-       next = prev;
-       goto still_running_back;
+prev_not_runnable:
+       if (prev->state & TASK_INTERRUPTIBLE && signal_pending(prev))
+       {
+               prev->state = TASK_RUNNING;
+               goto repeat_schedule_runnable;
+       }
+       del_from_runqueue(prev);
+       /* Default process to select.. */
+       next = idle_task(this_cpu);
+       c = -1000;
+       goto prev_not_runnable_back;

 handle_bh:
        do_bottom_half();
@@ -879,6 +833,7 @@
 {
        struct task_struct *p;
        struct wait_queue *head, *next;
+       int wake_one = 0;

         if (!q)
                goto out;
@@ -897,6 +852,13 @@
                p = next->task;
                next = next->next;
                if (p->state & mode) {
+                       if (p->state & TASK_WAKE_ONE)
+                       {
+                               if (wake_one)
+                                       continue;
+                               p->state &= ~TASK_WAKE_ONE;
+                               wake_one = 1;
+                       }
                        /*
                         * We can drop the read-lock early if this
                         * is the only/last process.
@@ -1198,7 +1158,7 @@
        read_lock(&tasklist_lock);
        for_each_task(p) {
                if ((p->state == TASK_RUNNING ||
-                    p->state == TASK_UNINTERRUPTIBLE ||
+                    p->state & TASK_UNINTERRUPTIBLE ||
                     p->state == TASK_SWAPPING))
                        nr += FIXED_1;
        }
Index: kernel/signal.c
===================================================================
RCS file: /var/cvs/linux/kernel/signal.c,v
retrieving revision 1.1.1.3
diff -u -r1.1.1.3 signal.c
--- linux/kernel/signal.c       1999/05/07 00:01:50     1.1.1.3
+++ linux/kernel/signal.c       1999/05/10 23:12:04
@@ -387,7 +387,7 @@

 out:
        spin_unlock_irqrestore(&t->sigmask_lock, flags);
-        if (t->state == TASK_INTERRUPTIBLE && signal_pending(t))
+        if (t->state & TASK_INTERRUPTIBLE && signal_pending(t))
                 wake_up_process(t);

 out_nolock:
Index: net/ipv4/tcp.c
===================================================================
RCS file: /var/cvs/linux/net/ipv4/tcp.c,v
retrieving revision 1.1.1.6
diff -u -r1.1.1.6 tcp.c
--- linux/net/ipv4/tcp.c        1999/04/28 20:46:58     1.1.1.6
+++ linux/net/ipv4/tcp.c        1999/05/10 23:56:51
@@ -1575,7 +1575,7 @@

        add_wait_queue(sk->sleep, &wait);
        for (;;) {
-               current->state = TASK_INTERRUPTIBLE;
+               current->state = TASK_INTERRUPTIBLE | TASK_WAKE_ONE;
                release_sock(sk);
                schedule();
                lock_sock(sk);
Index: include/linux/sched.h
===================================================================
RCS file: /var/cvs/linux/include/linux/sched.h,v
retrieving revision 1.1.1.7
diff -u -r1.1.1.7 sched.h
--- linux/include/linux/sched.h 1999/05/07 00:01:33     1.1.1.7
+++ linux/include/linux/sched.h 1999/05/11 00:10:01
@@ -79,6 +79,7 @@
 #define TASK_ZOMBIE            4
 #define TASK_STOPPED           8
 #define TASK_SWAPPING          16
+#define TASK_WAKE_ONE          32

 /*
  * Scheduling policies

(as usual there is also a not too much related change in my patch ;), this
time it's in the scheduler: the case where prev is still runnable was
considered a slow-path in pre-2.2.8-5, while instead it's the fast-path so
I included the fix in the patch because I was just forced to change a bit
such part of code to takes care of the WAKE_ONE bit).

The patch above will apply correctly against pre-2.2.8-5 (really here I
just killed also the TASK_SWAPPING state from ages but I removed such
change to show only the worthwile changes).

I am not sure I've understood well the problem and if the real issue was
the case where many threads are sleeping in accept(2) and then at the
first connection all the threads are getting wakenup but only the first
scheduled one will really become the peer of the connection. The reason I
am not sure is that here apache seems to issue an accept only in one task
and flock on all other brother-tasks... (so to get the improvement with my
current apache we should make flock wake-one instead of accept...).

Comments?

Andrea Arcangeli

Date: Tue, 11 May 1999 00:58:46 -0700
From: David S. Miller 
To: andrea@e-mind.com
Cc: torvalds@transmeta.com, alan@lxorguk.ukuu.org.uk, ezolt@perf.zko.dec.com,
     dgaudet-list-linux-kernel@arctic.org, rgooch@atnf.csiro.au,
     linux-kernel@vger.rutgers.edu, jg@pa.dec.com, greg.tarsa@digital.com
Subject: Re: [patch] wake_one for accept(2) [was Re: Overscheduling DOES happen
    with high web server load.]

   Date: Tue, 11 May 1999 02:24:28 +0200 (CEST)
   From: Andrea Arcangeli 

   -        if (t->state == TASK_INTERRUPTIBLE && signal_pending(t))
   +        if (t->state & TASK_INTERRUPTIBLE && signal_pending(t))

Andrea, watch out!  The presedence rules of C have bitten you here.

You want:

   +        if ((t->state & TASK_INTERRUPTIBLE) && signal_pending(t))

Later,
David S. Miller
davem@redhat.com

Date: Tue, 11 May 1999 19:46:54 +0100 (BST)
From: Stephen C. Tweedie 
To: Alan Cox 
Cc: Linus Torvalds , davem@redhat.com,
     ezolt@perf.zko.dec.com, dgaudet-list-linux-kernel@arctic.org,
     rgooch@atnf.csiro.au, linux-kernel@vger.rutgers.edu, jg@pa.dec.com,
     greg.tarsa@digital.com
Subject: Re: Overscheduling DOES happen with high web server load.

Hi,

On Fri, 7 May 1999 02:56:20 +0100 (BST), alan@lxorguk.ukuu.org.uk (Alan
Cox) said:

> For most wake_one situations you want to schedule the last thread to go idle
> as it will have the most context in cache. 

No, you want to find the thread which was most recently active on a
currently idle CPU if there is one.  That's a big difference.  If you
have multiple idle CPUs, just waking up the most recent thread isn't any
guarantee of cache locality.

--Stephen

Date: Tue, 11 May 1999 12:14:49 -0700 (PDT)
From: Linus Torvalds 
To: Stephen C. Tweedie 
Cc: Alan Cox , davem@redhat.com,
     ezolt@perf.zko.dec.com, dgaudet-list-linux-kernel@arctic.org,
     rgooch@atnf.csiro.au, linux-kernel@vger.rutgers.edu, jg@pa.dec.com,
     greg.tarsa@digital.com
Subject: Re: Overscheduling DOES happen with high web server load.

On Tue, 11 May 1999, Stephen C. Tweedie wrote:
> 
> No, you want to find the thread which was most recently active on a
> currently idle CPU if there is one. 

Too complex. I would say "wake up a recent process, and let the scheduler
try to figure out what CPU is the most advantageous".

If you have idle CPU's the choice is pretty much always going to be to try
to get a new CPU for the new connection - regardless of where the cache
was.

If you don't have idle CPU's, that means that somebody else filled your
CPU already, and you might as well just try to find the most recent
process and if possible re-instate it on the same CPU it was on last time. 

So I think you're right in theory, but wrong in practice, and that Andrea
is right in practice and wrong in theory.

                        Linus

Date: Thu, 13 May 1999 08:29:01 +0200 (CEST)
From: Ingo Molnar 
To: Andrea Arcangeli 
Cc: Phillip Ezolt , linux-kernel@vger.rutgers.edu,
     jg@pa.dec.com, greg.tarsa@digital.com
Subject: Re: [RFT] 2.2.8_andrea1 wake-one [Re: Overscheduling DOES happen with
    high web server load.]


On Wed, 12 May 1999, Andrea Arcangeli wrote:

> Note: it has also my wake-one on accept that just address completly the

> I would like if you would make comparison with a clean 2.2.8 (or with
> pre-2.3.1 even if I have not seen it yet).

note that pre-2.3.1 already has a wake-one implementation for accept() ... 
and more coming up. 

-- mingo

Date: Wed, 12 May 1999 23:55:23 -0700 (PDT)
From: Dean Gaudet 
To: Andrea Arcangeli 
Cc: Phillip Ezolt , linux-kernel@vger.rutgers.edu,
     jg@pa.dec.com, greg.tarsa@digital.com
Subject: Re: [RFT] 2.2.8_andrea1 wake-one [Re: Overscheduling DOES happen with
    high web server load.]



On Wed, 12 May 1999, Andrea Arcangeli wrote:

> Note: it has also my wake-one on accept that just address completly the
> overscheduling problem. But to achieve performances by it you must make
> sure that _all_ apache tasks are sleeping in accept(2) and not in
> flock(2)/fcntl(2)/whatever. Maybe you'll need to patch apache to achieve

no patch required, just do this to configure apache: 

env CFLAGS='-DSINGLE_LISTEN_UNSERIALIZED_ACCEPT' ./configure

... and make sure you have only one listening socket.

Dean

Date: Thu, 13 May 1999 16:26:39 +0100 (BST)
From: Malcolm Beattie 
To: linux-kernel@vger.rutgers.edu
Subject: Re: 2.3.x wish list?

David S. Miller writes:
>    Date:      Wed, 12 May 1999 09:14:17 -0400
>    From: 
>     - ext2 is showing its age on larger partitions; my 3 9 gig drives
>       take about a half hour to fsck, and up to a minute just to mount.
>       My database server will need support for +2gig files by the end
>       of the year.  "go to a 64 bit machine" is not reasonable for
>       everyone.  8^)
> 
> Solved by Stephen Tweedie's ongoing logging filesystem work.

Even without that, the performance of fsck (and mount, come to that) is
massively improved by doing a mke2fs with 4k blocks instead of 1k.
Here are some figures I sent to linux-raid a few months ago:

    Hardware: 350 MHz Pentium II PC, 512 MB RAM, BT958D SCSI adapter.
              Sun D1000 disk array with 6 x 9 GB 10000 RPM disks.
    Software: Linux 2.0.36 + latest RAID patch.
    Filesystem configured as a single 43 GB RAID5 ext filesystem with
    4k blocks and 64k RAID5 chunk-size.

    I created 25 subdirectories on the filesystem and in each untarred
    four copies of the Linux 2.2.1 source tree (each is ~4000 files
    totalling 63 MB untarred).

    fsck took 8 minutes.

    Then I added 100 subdirectories in each of those subdirectories and
    into each of those directories put five 1MB files. (The server is
    actually going to be an IMAP server and this mimics half-load quite
    well). The result is 18 GB used on the filesystem.

    fsck took 10.5 minutes.

    Then I added another 100 subdirectories in each of the 25 directories
    and put another five 1MB files in each of those. The result is 30 GB
    used on the filesystem.

    fsck took 13 minutes.

The upshot is that although it's not as fast at fscking as a
journalled filesystem, with 4k blocks it's adequate for many more
uses than you'd expect if you stick with the default 1k blocks.
(This is separate from the current 2GB VFS limit on 32-bit
architectures of course.)

--Malcolm

Date: Fri, 14 May 1999 14:44:08 +0000
From: Dan Kegel 
To: new-httpd@apache.org, linux-kernel@vger.rutgers.edu
Subject: /dev/poll vs. aio_ (was: Re: Proposal: Get rid of most accept 
    mutex)calls on hybrid server.)

Dean Gaudet wrote:
> (A person at Sun wrote:)
> > As of Solaris 7 a scheme refered to as /dev/poll was implemented such that
> > pollfd_t's are registered with the underlying FS (i.e. UFS, SOCKFS, ...)
> > and the FS does asynchronous notification. The end result is that poll()
> > now scales to tens of thousands of FDs per LWP (as well as a new API for
> > /dev/poll such that you open /dev/poll and do write()s (to register a
number
> > of pollfd's) and read()s (to wait for, or in the case of nonblocking check
> > for, pollfd event(s)), using the /dev/poll API memory is your only limit
> > for scalability.

> Now that's real nice.  I've been advocating this on linux kernel for a
> long time.  Say hello to completion ports the unix way.  I'm assuming they
> do the "right thing" and wake up in LIFO order, and allow you to read
> multiple events at once.

I have yet to use aio_ or F_SETSIG, but reading ready fd's from
/dev/poll
makes more sense to me than listening for realtime signals from aio_,
which according to http://www.deja.com/getdoc.xp?AN=366163395
can overflow, in which case the kernel sends a SIGIO to say 'realtime 
signals overflowed, better do a full poll'.  I'm contemplating writing
a server that uses aio_; that case kind of defeats the purpose of
using aio_, and handling it sounds annoying and suboptimal.

/dev/poll would never overflow in that way.

- Dan

Date: Fri, 14 May 1999 16:09:21 -0400 (EDT)
From: Phillip Ezolt 
To: Andrea Arcangeli 
Cc: linux-kernel@vger.rutgers.edu, jg@pa.dec.com, greg.tarsa@digital.com
Subject: Great News!! Was: [RFT] 2.2.8_andrea1 wake-one  

Hi all, (especially Andrea)
        I've been doing some more SPECWeb96 tests, and with Andrea's
patch to 2.2.8 (ftp://e-mind.com/pub/andrea/kernel/2.2.8_andrea1.bz)

**On identical hardware, I get web-performance nearly identical to Tru64!**

Previously, Linux response times had been 100ms while tru64 had been ~4ms. 

However, with this patch applied, linux reponse times almost mirror Tru64. 

Tru64   ~4ms
2.2.5   ~100ms
2.2.8   ~9ms
2.2.8_a ~4ms

I realize that 2.3.1 has a more efficient wakeone patch applied, and haven't
yet had a chance to try it.  (Maybe tonight)

Time spent in schedule has decreased, as shown by this Iprobe data:

2.2.8 (pure)

Begin            End                                    Sample Image Total
Address          Address          Name                   Count   Pct   Pct
-------          -------          ----                   -----   ---   ---
0000000000000000-0000000120006F2F /usr/bin/httpd        121077        19.2 
0000000120041A00-00000001200433FF   ap_vformatter        14777  12.2   2.3 
FFFFFC0000300000-00000000FFFFFFFF vmlinux               428086        67.9 
FFFFFC0000315FA0-FFFFFC00003160DF   do_entInt            40185   9.4   6.4 
FFFFFC0000327D20-FFFFFC000032805F   schedule            126434  29.5  20.0 
FFFFFC00003B9CC0-FFFFFC00003BA0BF   tcp_v4_rcv           11701   2.7   1.9 
FFFFFC00003DB3A0-FFFFFC00003DBA5F   make_request          6879   1.6   1.1 
FFFFFC00004446E0-FFFFFC0000444ABF   do_csum_partial      27835   6.5   4.4 
                                    _copy_from_user           
FFFFFC0000445340-FFFFFC0000445513   __copy_user           9722   2.3   1.5 


2.2.8 (w/2.2.8_andrea1.bz)

Begin            End                                    Sample Image Total
Address          Address          Name                   Count   Pct   Pct
-------          -------          ----                   -----   ---   ---
0000000000000000-0000000120006F2F /usr/bin/httpd        121882        22.5 
0000000120041A00-00000001200433FF   ap_vformatter        15166  12.4   2.8 
0000020000590000-0000020000772FFF /lib/libc-2.0.7.so     66412        12.2 
00000200005F4E20-00000200005F4F7F   memcpy                6168   9.3   1.1 
FFFFFC0000300000-00000000FFFFFFFF vmlinux               343294        63.2 
FFFFFC0000316020-FFFFFC000031615F   do_entInt            42469  12.4   7.8 
FFFFFC0000327DA0-FFFFFC000032811F   schedule             37676  11.0   6.9 
FFFFFC0000328120-FFFFFC00003281FF   __wake_up            21703   6.3   4.0 
FFFFFC00003AF940-FFFFFC00003AFA7F   wait_for_connect      5489   1.6   1.0 
FFFFFC00003BA3C0-FFFFFC00003BA7BF   tcp_v4_rcv            7012   2.0   1.3 
FFFFFC0000444F80-FFFFFC000044535F   do_csum_partial      27679   8.1   5.1 
                                    _copy_from_user           
FFFFFC0000445BE0-FFFFFC0000445DB3   __copy_user           9188   2.7   1.7 

The number of SPECWeb96 MaxOps per second have jumped has well.

**Please, put the wakeone patch into the 2.2.X kernel if it isn't already. **

--Phil

Compaq                              HPSD/Benchmark Performance Engineering
Phillip.Ezolt@compaq.com                            ezolt@perf.zko.dec.com


On Wed, 12 May 1999, Andrea Arcangeli wrote:

> On Fri, 7 May 1999, Andrea Arcangeli wrote:
> 
> >I'll provide you a patch shortly to try out.
> 
> Phillip, could you try it out:
> 
>       ftp://e-mind.com/pub/andrea/kernel/2.2.8_andrea1.bz2
> 
> under heavy web load? (should run fine on alpha too as far as stock-2.2.8
> is just fine too)
> 
> Note: it has also my wake-one on accept that just address completly the
> overscheduling problem. But to achieve performances by it you must make
> sure that _all_ apache tasks are sleeping in accept(2) and not in
> flock(2)/fcntl(2)/whatever. Maybe you'll need to patch apache to achieve
> that (I also seen a patch floating on the list, maybe you only need to
> grap such patch and apply/reverse it over the apache tree).
> 
> I would like if you would make comparison with a clean 2.2.8 (or with
> pre-2.3.1 even if I have not seen it yet).
> 
> Andrea Arcangeli
> 
> 
>

Date: Mon, 17 May 1999 12:54:43 +0200
From: Juergen Schmidt 
To: linux-kernel@vger.rutgers.edu
Subject: Bad apache perfomance wtih linux SMP

Hello,

first of all, please excuse this off-topic posting. I'm doing a test
with apache on a 4CPU-Box (Siemens Primergy 870) with Linux/apache and
NT/IIS. Now I got some *very* strange results and want to make sure,
that you -- the developers -- have the occasion to comment on them,
*before* I publish them. 
If there's a better way to do this, please feel free to tell me.

So now to the facts:

I've got a Siemens Primergy 870 with:

4 CPUs PII Xeon 450 MHz 
2 GByte RAM
Mylex RAID Controler DAC 960 (64MB RAM), RAID5
2 x Intel EEpro 100 (only one is used, the other is not ifconfig'ed but
detected)

Linux 2.2.8 (SuSE 6.1., glibc based)
Apache 1.3.6 
(NT 4.0 SP4, IIS 4.0)

Network: switched 100 MBit/s, half duplex

Neither system is intensivly tuned -- apart from the obvious stuff, like
disabling not needed modules, turning off hostname lookups, ...

I'm measuring plain HTTP-GET on a static html-file with 8
(Linux-)clients each running up to 64 processes (for a total of 512)
doing HTTP-GET-requests in a tight loop. All files come from a partition
on the RAID-array (RAID 5) which is used for logging, too. File size is
4KByte (I measured 1k and 8k with similar results).

Results:

-- with one CPU both NT/IIS and Linux/apache deliver about the same
perfomance (+- 10%). Even Linux-SMP and Non-SMP only differ by < 10%.

-- with 4 CPUs NT/IIS gives sligthly more Reqeusts/sec (< 10%): that's
mainly because my setup doesn't support real heavy load by doing plain
HTTP-GETs. One CPU obviously is sufficient for this task here (as in
most real life setups with plain HTML serving)

-- Linux with 4 CPUs is disastrous: It delivers significantly less RPS
than the single-CPU version -- about a factor of 4 ! Only at high loads
(256 procs und up) it catches up.

One other strange thing (but I still have to doublecheck this):

- Linux 4 CPUs: I get slightly better perfomance if the processes fetch
a random page (out of 10000 files, so that they still fit in the buffer
cache). All the other combinations (Linux 1 CPU, NT) are a bit slower in
comparison to fetching only one page.

For clarity I included a picture (it says more than a lot of words.

The red line is the perfomance of linux with 4 CPUs, blue with random
files, green Linux one CPU, pink is NT 1 CPU (4 CPUS are slightly above
the linux curve). The drop at 512 processes is due to connect()-failures
-- I'm currently investigating this too.

Do you have any ideas, what's happening there?
Or even better, how to fix this?

It seems to me, that this might exactly be the factor 4 those mindcraft
people have measured. Do you think, this is possible?

bye, juergen

BTW: In another test with cgi-scripts I set NR_OPEN in the linux kernel
from 1024 to 2048 (in include/linux/limits.h, fs.h and posix_types.h)
and recompiled apache. 
So I got rid of the open() errors that occured under heavy load. But
therefore I get only about half the rps with <=128 processes. Did I
forget something?

PS: Please CC me per mail, as I have not subscribed to the list.
-- 
Juergen Schmidt   Redakteur/editor  c't magazin      PGP-Key available
Verlag Heinz Heise GmbH & Co KG, Helstorferstr. 7, D-30625 Hannover
EMail: ju@ct.heise.de - Tel.: +49 511 5352 300 - FAX: +49 511 5352 417

Date: Tue, 18 May 99 14:13:22 EDT
From: Larry Sendlosky 
To: Andrea Arcangeli 
Cc: Phillip Ezolt , linux-kernel@vger.rutgers.edu,
     jg@pa.dec.com, greg.tarsa@digital.com, larry@scrugs.lkg.dec.com,
     maurice.marks@compaq.com
Subject: Re: Great News!! Was: [RFT] 2.2.8_andrea1 wake-one

Hi Andrea,

Your 2.2.8 patch really helps apache performance on a single cpu system,
but there is really no performance improvement on a 2 cpu SMP system.

I'm running Webstone 2.5 file tests on a DP264. Basically, Linux SMP
without your patches gets maybe 5% better performance with 2 CPUs compared
to only 1 CPU. With your 2.2.8 patches, single CPU performance increases
about 14%, approaching Tru64 Unix performance levels on the same
hardware. (Phil says he's seen the same with SpecWeb). However, there
is no improvement with 2 CPUs. The only thing I see for the better
(I can't Iprobe - it's not yet ported to EV6) is that user mode,
as reported by vmstat, goes from about 18-19% to 23-24% (no patches vs
your 2.2.8 patch).

larry

______________________________________________________________________
 Larry Sendlosky                                            AMT
 larry@scrugs.lkg.dec.com (978) 506-6640                    Compaq
----------------------------------------------------------------------

Date: 18 May 1999 15:02:08 +0200
From: Andi Kleen 
To: Juergen Schmidt 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Bad apache perfomance wtih linux SMP

Juergen Schmidt  writes:
> Do you have any ideas, what's happening there?
> Or even better, how to fix this?

One culprit is most likely that the data copy for TCP sending runs completely
serialized. This can be fixed by doing replacing the

                        skb->csum = csum_and_copy_from_user(from,
                                        skb_put(skb, copy), copy, 0, &err);

in tcp.c:tcp_do_sendmsg with

                        unlock_kernel(); 
                        skb->csum = csum_and_copy_from_user(from,
                                        skb_put(skb, copy), copy, 0, &err);
                        lock_kernel(); 

The patch does not violate any locking requirements in the kernel, because
the kerne lock could have been dropped at any time anyways when the
copy_from_user
slept to swap a page in.
(I'm not sure if running a published benchmark with such a patch is fair
though.
 On the other hand Microsoft did some many hidden changes in their service
packs
 that probably everything is allowed ;)

Another problem is that Linux 2.2 per default uses only 1GB of memory. This can be tuned by changing the PAGE_OFFSET constant in include/asm/page.h and
arch/i386/vmlinux.lds from 0xc0000000 to 0x80000000 or so and recompiling
(the tradeoff is that that it limits the per process virtual memory to ~1.8GB,
but increases the overall physical memory that can be mapped). 

> 
> It seems to me, that this might exactly be the factor 4 those mindcraft
> people have measured. Do you think, this is possible?
> 
> bye, juergen
> 
> BTW: In another test with cgi-scripts I set NR_OPEN in the linux kernel
> from 1024 to 2048 (in include/linux/limits.h, fs.h and posix_types.h)
> and recompiled apache. 
> So I got rid of the open() errors that occured under heavy load. But
> therefore I get only about half the rps with <=128 processes. Did I
> forget something?

Probably increasing the global file table size. 

Try:

echo 32768 > /proc/sys/fs/file-max
echo 65536 > /proc/sys/fs/inode-max

Overall it should be clear that the current Linux kernel doesn't scale
to 4CPUs for system load (user load is fine). I blame the Linux vendors
for advertising it, although it is not true. 

If you're interested I can send you a profiling patch that shows how much
of the system CPU time is spent in locks. Another easy way is to boot
with profile=2 and to run /usr/sbin/readprofile to see where the time is spent.

Work to fix all these problems is underway. 

-Andi

-- 
This is like TV. I don't like TV.

Date: Tue, 18 May 1999 19:44:25 +0200
From: Juergen Schmidt 
To: Andi Kleen 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Bad apache perfomance wtih linux SMP

Andi Kleen wrote:
> One culprit is most likely that the data copy for TCP sending runs completely
> serialized. This can be fixed by doing replacing the
> 
>                         skb->csum = csum_and_copy_from_user(from,
>                                         skb_put(skb, copy), copy, 0, &err);
> 
> in tcp.c:tcp_do_sendmsg with
> 
>                         unlock_kernel();
>                         skb->csum = csum_and_copy_from_user(from,
>                                         skb_put(skb, copy), copy, 0, &err);
>                         lock_kernel();

Bingo !!! 

This raised performance from 270 rps to 802 rps when 64 clients were
pulling a 4k HTML-page. Single CPU perfomance lies by 890 rps -- but the
new numbers are just from a very short run. (BTW: NT/IIS on 4 CPUs
deliver 840 rps :-)

Or with apaches ab:

ab -c 8 -t 120 127.0.0.1/4k.html

produces:

2.2.8 4 CPUs:                350.95 
2.2.8 4 CPUs with patch:    1334.19
2.2.8 no SMP:               1540.22

BTW: I'm going to release my test program under GPL after I cleaned it
up a little. ab is not working for me, because it dies with "broken
pipe" when I try it over the network -- perhaps because they are doing
non-blocking I/O...

> The patch does not violate any locking requirements in the kernel, because
> the kerne lock could have been dropped at any time anyways when the
copy_from_user
> slept to swap a page in.

I'd like to here some comments from other people on this. Is this a
proper patch or is it dangerous in any way ?

Linus, Alan, would you recommend to run a machine with that patch ?

> (I'm not sure if running a published benchmark with such a patch is fair
though.

My intention is not to publish benchmark results and let others explain
them. I want to understand, what I'm measuring and why. For this, your
suggestion is excellent. 

If I can even present a patch, that might help people, it's a lot better
than shouting out the latest "records". 

> Another problem is that Linux 2.2 per default uses only 1GB of memory. This
can

I've already patched that :-) 

> Probably increasing the global file table size.
> 
> Try:
> 
> echo 32768 > /proc/sys/fs/file-max
> echo 65536 > /proc/sys/fs/inode-max

will do, asap

> Overall it should be clear that the current Linux kernel doesn't scale
> to 4CPUs for system load (user load is fine). I blame the Linux vendors
> for advertising it, although it is not true.

Thanks for the open statement.

> If you're interested I can send you a profiling patch that shows how much
> of the system CPU time is spent in locks. Another easy way is to boot
> with profile=2 and to run /usr/sbin/readprofile to see where the time is
spent.

Yes, please send me patch. I'll try the other way, too.

> Work to fix all these problems is underway.

Will it come into 2.2 or 2.3 only ?

Thanks for your help, juergen

-- 
Juergen Schmidt   Redakteur/editor  c't magazin      PGP-Key available
Verlag Heinz Heise GmbH & Co KG, Helstorferstr. 7, D-30625 Hannover
EMail: ju@ct.heise.de - Tel.: +49 511 5352 300 - FAX: +49 511 5352 417

Date: Tue, 18 May 1999 20:25:20 +0200
From: Andi Kleen 
To: Juergen Schmidt 
Cc: Andi Kleen , linux-kernel@vger.rutgers.edu
Subject: Re: Bad apache perfomance wtih linux SMP

On Tue, May 18, 1999 at 07:44:25PM +0200, Juergen Schmidt wrote:
> Andi Kleen wrote:
> > One culprit is most likely that the data copy for TCP sending runs
completely
> > serialized. This can be fixed by doing replacing the
> > 
> >                         skb->csum = csum_and_copy_from_user(from,
> >                                         skb_put(skb, copy), copy, 0, &err);
> > 
> > in tcp.c:tcp_do_sendmsg with
> > 
> >                         unlock_kernel();
> >                         skb->csum = csum_and_copy_from_user(from,
> >                                         skb_put(skb, copy), copy, 0, &err);
> >                         lock_kernel();
> 
> Bingo !!! 
> 
> This raised performance from 270 rps to 802 rps when 64 clients were
> pulling a 4k HTML-page. Single CPU perfomance lies by 890 rps -- but the
> new numbers are just from a very short run. (BTW: NT/IIS on 4 CPUs
> deliver 840 rps :-)

Cool :)

>    
> > The patch does not violate any locking requirements in the kernel, because
> > the kerne lock could have been dropped at any time anyways when the
copy_from_user
> > slept to swap a page in.
> 
> I'd like to here some comments from other people on this. Is this a
> proper patch or is it dangerous in any way ?
> 
> Linus, Alan, would you recommend to run a machine with that patch ?

They are both away (in Finnland and at LinuxExpo)

Various more extensive versions of this idea (it is originally from Stephen
Tweedie I think) have been tested and developed by Ingo Molnar,David Miller 
and others. Unfortunately some versions caused crashed, but they were 
explained by compiler overoptimizations (egcs 1.1 moved some instructions 
across the optimization barrier in the locking macro, which caused races 
under heavy load; gcc 2.7.2 should be fine). For the tcp_do_sendmsg case 
there were no problems AFAIK.

If you don't believe me :) you can verify it:

        csum_and_copy_from_user
                does user access
                hits a page that is not present
CPU raises page not present exception
        calls arch/i386/mm/fault.c:do_page_fault() 
        calls mm/memory.c:handle_mm_fault()
        calls handle_pte_fault (same file)
        calls finally do_swap_page (same file)
        and there at the end is:
                unlock_kernel();
                return 1;
now the exception processing ends.
in the interrupt return code (arch/i386/kernel/entry.S:ret_with_reschedule)
it may switch to another process etc.


> > Work to fix all these problems is underway.
> 
> Will it come into 2.2 or 2.3 only ?

I expect the work will be first tested in 2.3, and then after some time
backported to a 2.2 "enterprise" release (similar to the 2.0.30 release
which backported the 2.1 socket hash optimizations to make Linux scale
to a really huge number of sockets)


-Andi
-- 
This is like TV. I don't like TV.

Date: Tue, 18 May 1999 11:10:57 +0200
From: Alexander Kjeldaas 
To: Richard Gooch , linux-kernel@vger.rutgers.edu
Subject: Re: send_sigio() scalability

On Mon, May 17, 1999 at 01:17:59PM +1000, Richard Gooch wrote:
>   Hi, all. I just noticed that send_sigio() walks the task list,
> looking for the process(es) to send a signal to. This appears to be a
> potential scalability problem, as a large number of tasks is going to
> slow this down.
> 
> Has anyone done any benchmarking to evaluate the effect of this? In
> the absence of numbers, how about some convincing handwaving? Is it
> worth exploring options to fix this?
> 
> I can think of one quick and simple hack to fix this for 90% (maybe
> 99%) of cases: record the task pointer at fcntl() time. Then at
> send_sigio() time, if the recorded pid and task match, skip the
> task list walk.
> 

In the case where send_sigio is sending a signal to a specific
process, why isn't it using find_task_by_pid()?. For the other case,
I'm working on a more general solution to most of the for_each_task()
uses in the kernel.  I made a patch last year that adds fast
for_each_task_in_pgrp(), and for_each_task_in_session() macros.  I
have measurements that prove that the patch helps a lot for the
fork()+exit() case.  Please look at

http://www.guardian.no/~astor/pidhash/pidhash.gif

for a graph of this that I made for the patch for Linux 2.1.90.  It
shows that when you have a lot of processes running, doing a
fork()+exit() takes a long time.  This is due to for_each_task()'ing
in exit to send signals.  I'd guess that send_sigio() will perform
similar to the above graph.  

The general idea of the patch is to sort the entrires in the
task_list.  Primarily by session-id, secondary by process group, and
tertiary by pid.  In addition to that, we make sure that all pgids,
sids and pids are available in the pidhash-hash-table.  So to traverse
all tasks in a process group, you just look up that pgid in the
pidhash and traverse the task list until the pgid changes.  Likewise
with sids.

So the patch generally does the following:
  - When forking, insert the new process behind its parent instead of
    at the start/end of the task-list.
  - When changing pgid or sid, change position of the process in the
    task-list.
  - Make sure all kernel threads have pid==pgid==sid
  - For some special places that want to do a "signal each children", we have
    to traverse the whole task list if one of the children is ptraced so 
    we have a counter of the number of ptraced children of a process
    so we can optimize the common case.

I've started to port the patch to 2.2.8.

astor

-- 
 Alexander Kjeldaas, Fast Search & Transfer, Trondheim, Norway

Date: Tue, 18 May 1999 22:16:33 +1000
From: Richard Gooch 
To: Alexander Kjeldaas 
Cc: linux-kernel@vger.rutgers.edu, lm@bitmover.com
Subject: Re: send_sigio() scalability

Alexander Kjeldaas writes:
> On Mon, May 17, 1999 at 01:17:59PM +1000, Richard Gooch wrote:
> >   Hi, all. I just noticed that send_sigio() walks the task list,
> > looking for the process(es) to send a signal to. This appears to be a
> > potential scalability problem, as a large number of tasks is going to
> > slow this down.
> > 
> > Has anyone done any benchmarking to evaluate the effect of this? In
> > the absence of numbers, how about some convincing handwaving? Is it
> > worth exploring options to fix this?
> > 
> > I can think of one quick and simple hack to fix this for 90% (maybe
> > 99%) of cases: record the task pointer at fcntl() time. Then at
> > send_sigio() time, if the recorded pid and task match, skip the
> > task list walk.
> 
> In the case where send_sigio is sending a signal to a specific
> process, why isn't it using find_task_by_pid()?.

Good question. I think the single-process case is the general
case. Certainly, for a WWW server which wants to use Linux completion
ports, it doesn't make sense to have multiple processes signalled on
an event, since then we get back to the thundering hurd(sic) problem.

I'll send in a patch to use find_task_by_pid(). Thanks for pointing it
out.

Larry: I know this isn't the Grand Unified Abstraction[tm] you were
hoping for, but how does this solution grab you? It's certainly less
hackish than the simple fix I suggested, and should work reasonably
well provided the underlying hash function distributes well. I quite
like it.

> For the other case, I'm working on a more general solution to most
> of the for_each_task() uses in the kernel.  I made a patch last year
> that adds fast for_each_task_in_pgrp(), and
> for_each_task_in_session() macros.  I have measurements that prove
> that the patch helps a lot for the fork()+exit() case.  Please look
> at

> http://www.guardian.no/~astor/pidhash/pidhash.gif

Looks impressive. I'm glad someone is looking at this problem.

> for a graph of this that I made for the patch for Linux 2.1.90.  It
> shows that when you have a lot of processes running, doing a
> fork()+exit() takes a long time.  This is due to for_each_task()'ing
> in exit to send signals.  I'd guess that send_sigio() will perform
> similar to the above graph.

That feels right.

> I've started to port the patch to 2.2.8.

How about doing it for 2.3.3 instead? My guess is that 2.2.x is
off-limits for this kind of development.

                                Regards,

                                        Richard....

Date: Wed, 19 May 1999 07:40:28 -0700
From: Dan Kegel 
To: "linux-kernel@vger.rutgers.edu" 
Subject: nonblocking disk I/O?

Dean Gaudet wrote:
> sendfile() blocks as well [on disk i/o]. ...
> There is no "completion" call for sendfile() -- you need a 
> completion call in order to do things asynchronously.
> 
> Or you can peek at the linux kernel source, mm/filemap.c, search for
> do_generic_file_read, notice the wait_on_page() call.

Dang.  I notice that the subject of nonblocking disk I/O
has come up several times in the past (e.g.
http://www.deja.com/getdoc.xp?AN=373588318
http://x31.deja.com/getdoc.xp?AN=459141949 ).

It'd be real nice to be able to write a single-threaded
http server that didn't block all clients when one
client needed to do disk I/O.  As it stands, this seems
impossible with Linux.  (And aio_read won't help, I hear
it uses threads, which would be cheating.)

Is this something that we could add to the wish list for 2.3?

Would it require adding something like the minischeduler
built in to RPC (net/sunrpc/sched.c)?

- Dan

p.s. This linuxhq.com thing is getting me down. 
I can't read the mailing list the way I like to...
even Jim Pick's alternate site (204.209.212.113) seems
broken...

Date: Thu, 03 Jun 1999 15:24:02 +0200
From: Juergen Schmidt 
To: linux-kernel@vger.rutgers.edu, new-httpd@apache.org
Subject: Linux and Apache performance, Update

Hello all,

I promised to keep you updated on my results, concerning the tests of my
comparison between NT/IIS and Linux/Apache.

The general result is: with one 100 MBit interface nothing really
spectacular happens. Differences are in the range of 10%

With two interfaces, Linux clearly looses.

I set up a page with some detailled infos
(http://www.heise.de/ct/Redaktion/ju/linux-perf.html).
Please don't link to this page, as it will disappear, as soon as my
article is ready. It is meant as an information for you, not as a public
ressource.

Some quite interesting results are about SMP-Performnace and Andrea's
patches.

Comments are welcome (I have not subscribed to the lists, so please CC
me)

thanks again for your help, ju

-- 
Juergen Schmidt   Redakteur/editor  c't magazin      PGP-Key available
Verlag Heinz Heise GmbH & Co KG, Helstorferstr. 7, D-30625 Hannover
EMail: ju@ct.heise.de - Tel.: +49 511 5352 300 - FAX: +49 511 5352 417

Date: Thu, 03 Jun 1999 09:55:33 -0700
From: Dan Kegel 
To: ju@ct.heise.de
Cc: linux-kernel@vger.rutgers.edu, new-httpd@apache.org
Subject: re: Linux and Apache performance, Update 

Juergen,
thanks for keeping us informed!

First:
I saw one possible problem straight off:
you compiled Apache with 
  -D USE_FCNTL_SERIALIZED_ACCEPT 
To get the benefit of the wake-one kernel patches,
you have to compile instead with
  -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT

As far as I know, the way you compiled Apache, 
the kernel is still waking up all processes
("the thundering herd") when it should only
be waking up one.  This hurts SMP performance.

Please have a look at the section "Suggestions
for future benchmarks" in 
http://www.kegel.com/mindcraft_redux.html

Second: I see your load generating program uses
multiple processes, and you mention that it can't
generate enough load to really hit an SMP server hard.
Perhaps you should try a different load client, e.g. 
http://www.acme.com/software/http_load/  It
might be able to load your server down more effectively.

Thanks,
Dan

Date: 03 Jun 1999 18:53:54 +0200
From: Andi Kleen 
To: dank@alumni.caltech.edu
Cc: linux-kernel@vger.rutgers.edu, ju@ct.heise.de
Subject: Re: Linux and Apache performance, Update

dank@alumni.caltech.edu (Dan Kegel) writes:

> Juergen,
> thanks for keeping us informed!
> 
> First:
> I saw one possible problem straight off:
> you compiled Apache with 
>   -D USE_FCNTL_SERIALIZED_ACCEPT 
> To get the benefit of the wake-one kernel patches,
> you have to compile instead with
>   -DSINGLE_LISTEN_UNSERIALIZED_ACCEPT

In the multiple interfaces case apache has to handle
multiple listen sockets with poll - and the current
thundering herd fix doesn't work in that situation because
that would break old programs.

It doesn't all look too well - but in half a year this
will hopefully be very different. 

-Andi

Date: Fri, 04 Jun 1999 00:44:21 -0700
From: Dan Kegel 
To: Alex Belits 
Cc: Andi Kleen , linux-kernel@vger.rutgers.edu, ju@ct.heise.de,
     new-httpd@apache.org
Subject: Re: Linux and Apache performance, Update

Alex Belits wrote:
> On 3 Jun 1999, Andi Kleen wrote:
> > In the multiple interfaces case apache has to handle
> > multiple listen sockets with poll - and the current
> > thundering herd fix doesn't work in that situation because
> > that would break old programs.
> 
>   If all interfaces are used for the same server, INADDR_ANY can be used
> instead of multiple sockets. Will thundering herd fix work in that case?

http://www.apache.org/docs/bind.html says:
"By default, it listens to all addresses on the machine, and to the port 
as specified by the Port directive in the server configuration."
As long as the user doesn't use Listen directives, Apache
shouldn't need to use multiple listen sockets, so we should be fine.

BTW, it looks like http://www.apache.org/docs/misc/perf-tuning.html
doesn't know yet about Unixes with wake-one semantics on new connection
arrival.  Maybe Dean could update it?
- Dan

Date: 4 Jun 1999 07:45:51 GMT
From: Linus Torvalds 
To: linux-kernel@vger.rutgers.edu
Subject: Re: zero-copy TCP fileserving

In article ,
Charles K Hardin   wrote:

>but isn't the real question - should the copy even take case? i have no
>doubt that if a copy occurs, some cheap computation (ie. a cksum can
>easily be hidden in the data transfer through the CPU).

>but, why should the copy even occur? there is easily enough research
>lingering around these days to show the zero copying is good (Unet for
>instance as well as ExoKernel) these were direct access to user space
>without copies, but the same philosophy can hold for kernel space.

Zero copy looks good on benchmarks.

It very seldom wins in real life.  You tend to actually want to _do_
something with the data in most cases, and if the memcpy is even close
to be your limiting factor, that real computation is going to never have
a chance in hell.. 

Zero-copy is mainly useful for routing or for truly pure packet serving. 
The ExoKernel numbers, for example, aren't really from a web-server even
though that's what they claim.  What they really did was a "ethernet
packet server", feeding canned responses to canned input.  It has some
resemblance to web-serving, but not all that much. 

Also, many of the zero-copy schemes depend on doing mmu tricks, which
often suck for latency even on a single CPU, and are _truly_ horrible in
SMP environments.  They get good throughput numbers, but latency numbers
are usually not quoted (or latency was bad enough to start with that it
doesn't much show up as a red flag - quite common). 

There are good arguments for avoiding copying unnecessarily.  However,
often trying to drive that logic to it's extreme is only going to make
other issues so much worse that it really isn't worth it in any normal
load. 

[ Side tracking from another comment in this thread ]

It's also rather dangerous to look at "scalability" as being the
all-important goal to reach for.  In many cases scalability does not
equal performance.  For example, not only is a single gigabit card a
much more realistic schenario than having found 100Mbit cards and
"scaling" from one to four, but it's actually going to perform better.
Scaling is only good if it was fast to begin with ;)

Oh.  And ask your MIS department whether they want to try to load-
balance four 100Mbit networks by hand, or whether they want to add a
gigabit switch somewhere? There are those kind sof issues too..

		Linus

Date: Fri, 4 Jun 1999 07:02:55 -0700
From: Jim Gettys 
To: Jon P. deOng 
Cc: Andi Kleen , dank@alumni.caltech.edu,
     linux-kernel@vger.rutgers.edu, ju@ct.heise.de
Subject: Re: Linux and Apache performance, Update

> Sender: owner-linux-kernel@vger.rutgers.edu
> From: "Jon P. deOng" 
> Date:         Thu, 03 Jun 1999 15:54:15 -0700
> To: Andi Kleen , dank@alumni.caltech.edu
> Cc: linux-kernel@vger.rutgers.edu, ju@ct.heise.de
> Subject: Re: Linux and Apache performance, Update
> -----
> Have you tried benchmarking using linux and the zeus web server. We have
> found it "Really Whips the Llamas Ass" when compared to apache/linux and
> nt/iis4 for that matter.
> my .02
> jpd
> 

This is completely irrelevant to most people: Apache has over 50% of the
web servers on the Internet, for good reason (works well on most systems,
and has the functionality most people want and need).

So the common, valid, comparison most real people will want to see is
Apache/Linux vs. IIS/NT; those are the viable options for most people.
Even Microsoft's Ballmer admits that Apache has the functionality most
people want (which is why apache's market share has continued to increase,
and IIS has at best held steady.

A benchmark from Zeus, while academically interesting, is irrelevant
to real people making decisions in the world: most will want to
run Apache on Linux, despite Zeus's better characteristics on benchmarks.

So one way or the other, we need to get Apache on Linux to work at least
as well as it does on other UNIX systems.  Progress is being made here,
from what I see, (see recent threads on the topic, where with
current patches, performance on a uni is roughtly comparable to other
UNIX's now) but we have further to go (on SMP's, with multiple network
adaptors, where things are not scaling the way they should be able to).
                                - Jim

Date: Fri, 4 Jun 1999 16:25:32 -0400 (EDT)
From: Greg Ganger 
To: linux-kernel@vger.rutgers.edu
Cc: Greg Ganger 
Subject: Re: zero-copy TCP fileserving

While I applaud Linus for sticking to his philosophical guns, I hope
that few people are compelled to ignore 20 years of networking and
OS research (and practice) based on his misinformed commentary.

> Zero copy looks good on benchmarks.
> 
> It very seldom wins in real life.  You tend to actually want to _do_
> something with the data in most cases, and if the memcpy is even close
> to be your limiting factor, that real computation is going to never have
> a chance in hell.. 

Actually, network data servers (e.g., NFS, FTP, non-CGI HTTP, ...)
really do little to no "real computation".  Further, computation
rates continue to grow faster than memcpy rates (particularly, when
dealing with I/O devices).  As a result, those in industry who build
these kinds of products do, in fact, understand the importance of copy
elimination, and they spend significant energy to achieve it.

> Zero-copy is mainly useful for routing or for truly pure packet serving. 
> The ExoKernel numbers, for example, aren't really from a web-server even
> though that's what they claim.  What they really did was a "ethernet
> packet server", feeding canned responses to canned input.  It has some
> resemblance to web-serving, but not all that much. 

This paragraph is complete hogwash.  First, Linus clearly does not know
what he is talking about with respect to the exokernel's Cheetah web
server -- although it did not support CGI scripts (much like NetApp's
servers don't), it did in fact do HTTP/1.0 for real.  Far more importantly,
though, web service can in fact benefit significantly from zero-copy
techniques.  If you choose to ignore the lessons taught by the exokernel
work, perhaps you will be more compelled by the more recent Rice work
(IO-lite, which won Best Paper at the recent OSDI'99 conference).

Further, other domains (e.g., IPC, high-speed I/O, cluster computing)
benefit significantly from zero-copy cross-domain transfers.  There
are any number of research projects (e.g., U-net, Fbufs) and industry
efforts (e.g., VIA, SiliconTCP) that clearly demonstrate the importance
of copy avoidance.

> Also, many of the zero-copy schemes depend on doing mmu tricks, which
> often suck for latency even on a single CPU, and are _truly_ horrible in
> SMP environments.  They get good throughput numbers, but latency numbers
> are usually not quoted (or latency was bad enough to start with that it
> doesn't much show up as a red flag - quite common). 
> 
> There are good arguments for avoiding copying unnecessarily.  However,
> often trying to drive that logic to it's extreme is only going to make
> other issues so much worse that it really isn't worth it in any normal
> load. 

No arguments here; it is always important to balance performance with
issues of complexity and other systemic properties.  However, such
platitudes hardly provide compelling evidence for ignoring 20 years
of networking/OS research and architecting unnecessary copies into the
core of an OS that wants to be taken seriously...

Greg Ganger
Carnegie Mellon University

Date: Fri, 4 Jun 1999 13:57:58 -0700
From: David S. Miller 
To: ganger@gauss.ece.cmu.edu
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: zero-copy TCP fileserving

   From: Greg Ganger 
   Date:        Fri, 4 Jun 1999 16:25:32 -0400 (EDT)

   As a result, those in industry who build these kinds of products
   do, in fact, understand the importance of copy elimination, and
   they spend significant energy to achieve it.

Guess where that energy goes?  It goes into your latency, and that's a
fact.  Take a look at the latency most of them get, it sucks.

Ask yourself, why can't such networking stacks move a byte through the
TCP stack, end to end, on the order of 100usec?  It's because they
have all of the complexity in there to provide a zero copy framework.
For them it does cost something, even when you don't use it.

I have a hard time just blindly consuming the "increasing
computational speed vs. memory speed" argument, because that logic
leads just to bolting on more crap to the system and thus detracting
from the latency reduction which we should be realizing due to the
increased CPU power.

Furthermore, make no mistake, for transmit we will at some point have
a zero copy scheme available.  But when we get it, it will be done
cheaply and in a well thought out manner.  And you can be certain that
when it does happen, the end to end latency will not suffer like it
does on other systems for the cases where zero copy makes no sense at
all.

Linus knows what he is talking about, latency and simplicity are two
extremely important qualities to preserve.

Later,
David S. Miller
davem@redhat.com

Date: Fri, 4 Jun 1999 16:52:11 -0400 (EDT)
From: Greg Lindahl 
To: Greg Ganger 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: zero-copy TCP fileserving

> Further, computation
> rates continue to grow faster than memcpy rates (particularly, when
> dealing with I/O devices).  As a result, those in industry who build
> these kinds of products do, in fact, understand the importance of copy
> elimination, and they spend significant energy to achieve it.

With all due respect, I do supercomputing on LinuxAlpha clusters, and
you are wrong. Copy rates to I/O devices are growing slowly (PCI is
too slow for gigabit networking :-(), but memcpy in main memory is
fast and is getting faster. The latest generation of Alphas tripled
main memory memcpy while only doubling CPU power, and Intel ought to
be able to match this. It *is* important to keep copies to as few as
possible, but getting it all the way to _zero_ isn't necessarily a
huge win worth a huge cost. Does great for microbenchmarks, but not
applications.

> Further, other domains (e.g., IPC, high-speed I/O, cluster computing)
> benefit significantly from zero-copy cross-domain transfers.  There
> are any number of research projects (e.g., U-net, Fbufs) and industry
> efforts (e.g., VIA, SiliconTCP) that clearly demonstrate the importance
> of copy avoidance.

Yeah, right. I'm using a user-level interface in the style of U-Net.
The main win is not having to go into the kernel to send or receive
messages. Even if my communications system supported 0 copy, I'd have
to rewrite the program more than a bit to take advantage of 0 copy.

And I already know that my 9.2 gigaflop result (64 old alphas with
myrinet) on MM5 isn't communications limited.

Summary: copies *are* bad. But it isn't necessarily worth a huge
effort to get rid of all of them.

Greg Lindahl
High Performance Technologies, Inc
http://legion.virginia.edu/centurion/Applications.html

Date: Fri, 4 Jun 1999 22:58:16 +0200 (CEST)
From: Ingo Molnar 
To: Greg Ganger 
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: zero-copy TCP fileserving

On Fri, 4 Jun 1999, Greg Ganger wrote:

> Actually, network data servers (e.g., NFS, FTP, non-CGI HTTP, ...)
> really do little to no "real computation". [...]

and thats about it. You have just listed _3_ applications which happen to
make good use of zero-copy - but thats it. Yes, file servers are an
important category, but a fileserver is mostly IO-limited anyway ... 

> This paragraph is complete hogwash.  First, Linus clearly does not know
> what he is talking about with respect to the exokernel's Cheetah web
> server -- although it did not support CGI scripts [..]

it 'just' doesnt support CGI scripts? What is your current estimation,
what is the ratio between dynamic and static content amongst say the
top100 web sites on the web? Can we say 0% static pages? [yes, i accept
Dean Gaudet's point that webpages can be split up into dynamic and static
parts, probably that will be the future.]

> servers don't), it did in fact do HTTP/1.0 for real.  Far more importantly,
> though, web service can in fact benefit significantly from zero-copy
> techniques.  If you choose to ignore the lessons taught by the exokernel
> work, perhaps you will be more compelled by the more recent Rice work
> (IO-lite, which won Best Paper at the recent OSDI'99 conference).

these things are not being ignored, really. It's just a limited subset of
applications - and zero-copy makes a RL difference only in a small part of
those uses. It can make a difference, but due to it's limited use it is
only acceptable in a generic OS if the framework puts no complexity into
other parts of the kernel. _This_ is the main difference. Yes we like
zero-copy, but only if it's a by-product of another, useful concept. (eg.
the newest pagecache in the works will enable us to do receivefile(),
pushfile(), copyfile() and movefile() - basically IO-lite.) I very much
dislike zero-copy-maniac designs which give up just about everything to
get nice bandwith numbers.

> > Also, many of the zero-copy schemes depend on doing mmu tricks, which
> > often suck for latency even on a single CPU, and are _truly_ horrible in
> > SMP environments.  They get good throughput numbers, but latency numbers
> > are usually not quoted (or latency was bad enough to start with that it
> > doesn't much show up as a red flag - quite common).
> > 
> > There are good arguments for avoiding copying unnecessarily.  However,
> > often trying to drive that logic to it's extreme is only going to make
> > other issues so much worse that it really isn't worth it in any normal
> > load. 
> 
> No arguments here; it is always important to balance performance with
> issues of complexity and other systemic properties.  However, such

no, it doesnt have to be balanced. If doing bandwith vs. latency decisions
then latency is _the_ top goal. Yes we want to have bandwith too - if
possible.

> platitudes hardly provide compelling evidence for ignoring 20 years
> of networking/OS research and architecting unnecessary copies into the
> core of an OS that wants to be taken seriously...

dont forget that Linux became only possible because 20 years of OS
research was carefully studied, analyzed, discussed and thrown away.

-- mingo

Date: Fri, 4 Jun 1999 22:07:15 +0100 (BST)
From: Alan Cox 
To: Greg Ganger 
Cc: linux-kernel@vger.rutgers.edu, ganger@gauss.ece.cmu.edu
Subject: Re: zero-copy TCP fileserving

> While I applaud Linus for sticking to his philosophical guns, I hope
> that few people are compelled to ignore 20 years of networking and
> OS research (and practice) based on his misinformed commentary.

A while ago I also did the research. Linus is right for most cases, and the
cases that matter otherwise are streaming existing data - ie sendfile. Thats
all you really need to be zero copy. That and message passing, which is a 
different game (there its almost pure latency not bandwidth) but with the
same needs.

> server -- although it did not support CGI scripts (much like NetApp's
> servers don't), it did in fact do HTTP/1.0 for real.  Far more importantly,
> though, web service can in fact benefit significantly from zero-copy
> techniques.  If you choose to ignore the lessons taught by the exokernel

Static web serving is shipping canned responses off disk (or hopefully out
of disk cache). NFS is canned responses, video streaming is canned
responses. All of this is 'slap on a header and dump the disk to the 
network card'. Everything else is cache optimisation.

Thus Im not actually sure you are disagreeing, just your definition
of 'canned' is different.

> work, perhaps you will be more compelled by the more recent Rice work
> (IO-lite, which won Best Paper at the recent OSDI'99 conference).

IO-lite type stuff is on the 2.3 plan - thats why Stephen is working on the
kiovec stuff. 

> Further, other domains (e.g., IPC, high-speed I/O, cluster computing)
> benefit significantly from zero-copy cross-domain transfers.  There
> are any number of research projects (e.g., U-net, Fbufs) and industry
> efforts (e.g., VIA, SiliconTCP) that clearly demonstrate the importance
> of copy avoidance.

They are almost entirely based on message passing. Anyone can do zero
copy message passing with clever hardware. Thats what VI architecture is
(dont call it VIA, they are a chip manufacturer and quite fed up of being
confused)

Alan

Date: Fri, 4 Jun 1999 18:29:25 -0400 (EDT)
From: Greg Ganger 
To: Alan Cox 
Cc: ganger@gauss.ece.cmu.edu, linux-kernel@vger.rutgers.edu
Subject: Re: zero-copy TCP fileserving

I think I agree with you here -- in fact, had I known that copy
avoidance via IO-lite like buffer management was forthcoming, I
would not have chimed in.  I like Linux's growth and potential,
and I was told that influential people were arguing against copy
avoidance on the grounds that it doesn't matter.  The current
code base and the message to which I reacted both fairly strongly
supported that impression.

If what was really being said is simply that copy avoidance matters
most in the important (but certainly not all-encompassing) class
of lightly-processed (per-byte) data movement activities, then I
agree and I look forward to the new copy-eliminating buffer
management support.

Greg

Date: 8 Jun 1999 11:31:37 -0000
From: felix@convergence.de
To: alan@lxorguk.ukuu.org.uk, hpa@transmeta.com
Cc: linux-kernel@vger.rutgers.edu
Subject: Re: Preparations for ZD's upcoming Apache/Linux benchmark

In local.linux-kernel, you wrote:
> > Besides, it's (a) optional and (b) localized (doesn't mess around with
> > any other kernel code).  You wouldn't want it in a machine that wasn't
> > a web server as its main function, that's for sure.
> Or was a real world web server. 
> 
> But how does khttpd compare to Zab's phhttpd work, which is user space ?

Do you have a list of URLs of fast web servers?
I never heard of phhttpd and I didn't get the URL of khttpd because I
missed a few days worth of email.  The fastest free user-space web
server I know is thttpd from http://www.acme.com/software/thttpd/, which
uses select().

Felix

Date: Tue, 8 Jun 1999 11:36:28 -0400 (EDT)
From: Zach Brown 
To: linux-kernel@vger.rutgers.edu
Subject: Re: Preparations for ZD's upcoming Apache/Linux benchmark

> Once again, do in kernel space what *makes sense* to do in kernel
> space.  In this case, static serving with a policy from user space
> makes pretty good sense to do in kernel space (like knfsd vs unfsd)
> whereas it would be idiotic to do dynamic serving or set policy there.

I think we're far from the case where doing http protocol work in the
kernel makes sense.  comparing serving static http to nfs in terms of
reasons to put it in kernel space is comparing apples to oranges.  on the
moon.  There are a ton of things we can do to scale real http serving.  

we have threads+blocking sendfile().  now remember, in the real world you
have 87 gazillion "long term" modem connections sucking data down. they're
going to be sitting around tossing packets out tcp and occasionally
needing more data.   blocking sendfile handles this case happily..
threads sit there spewing data into tcp from the page cache..

but this sucks in the universe where its important to fill n 100mb pipes
with unrealistic traffic so you can put your OS on the cover of magazines.  
In this world you have churn tons and tons of very quick connections.  
the overhead of thread management and scheduling and stuff starts to
stink.

so this is where PHhttpd turns out to be good.  I hacked it up to play
with stephen's siginfo patches and just tossed the http stuff around it
because I needed something to generate IO.  Oops, insta static http
engine. It caches a set of files with precomputed headers on the front and
spins on sigwaitinfo() for work to do.  It goes like smoke, ingo saw it do
3500 connections/second over localhost with a single thread of execution
under 2.3.  It avoids the thundering herd problem completely by passing
the siginfo events on listening sockets between the threads.  With large
tcp buffers it ends up serving connections in a single go with
accept()/fcntl()/read()/write()/close().

and its entirely useless in the real world.  Tell it to do content
encoding, cgi, modules, keepalives, blah blah and it will stare blankly at
you and point out that NT in just the right conditions can spew out tons
of data in labs and uh-oh what will the pointy hairs think.

So I'm all for hacking a better static model into apache that maintains
apache's ultra-configurability.  And yes, that is in the works.  But I
personally find it exceedingly silly to put http in the kernel at this
point, for the usual avoid-code-in-the-kernel reasons.  Run phhttpd along
side apache and do url magic if you so desire  (not that its anywhere
ready for prime time, and you need kernel patches for siginfo to work,
etc).

-- zach

[oh, hpa, having written this I realize it might be taken as a flame at
you.  Its not at all, this is aimed more at the list in general :) ]

Date: Tue, 08 Jun 1999 09:17:14 -0700
From: Dan Kegel 
To: "linux-kernel@vger.rutgers.edu" 
Subject: Re: Preparations for ZD's upcoming Apache/Linux benchmark

felix@convergence.de asked:
> Do you have a list of URLs of fast web servers?
> I never heard of phhttpd and I didn't get the URL of khttpd because I
> missed a few days worth of email.  The fastest free user-space web
> server I know is thttpd from http://www.acme.com/software/thttpd/, which
> uses select().

I have a list at http://www.kegel.com/c10k.html

(I don't list phhttpd yet, as I haven't heard a URL for it yet.)
- Dan

Date: Tue, 8 Jun 1999 14:19:03 -0400 (EDT)
From: Zach Brown 
To: linux-kernel@vger.rutgers.edu
Subject: Re: Preparations for ZD's upcoming Apache/Linux benchmark

On 8 Jun 1999 felix@convergence.de wrote:

> Do you have a list of URLs of fast web servers?

an initial snapshot can be found near

ftp://ftp.zabbo.net/pub/users/zab/phhttpd

as stated, its an incomplete weekend hack.  its not meant to be a working
web server, but it shows the siginfo i/o model well.  included are patches
against 2.2 and 2.3 to make siginfo work (cheers to sct/ingo for doing the
real work there)

however I'll happily take patches that add the rest of the coating to make
it a functional web server if people really want to make a fast-ass static
server out of it.

> missed a few days worth of email.  The fastest free user-space web
> server I know is thttpd from http://www.acme.com/software/thttpd/, which
> uses select().

unfortunately select() has problems scaling.  not only is the actual
implementation using it nasty due to bookkeeping (poll fixes this), but
the concept of passing around these big masks to/from the kernel is yucky.
(poll doesn't fix this)

in phhttpd at least the siginfo fed engine is almost twice as fast as the
poll() stuff, and the poll() syscalls show up prominently in profile runs
when compared to the siginfo rt signal queue stuff.  course, my poll()
code might just suck a lot ;)

-- zach

- - - - - -
007 373 5963

This document was written as part of the Linux Scalability Project. For more information, see our home page.
If you have comments or suggestions, email linux-scalability@citi.umich.edu

Projects: Linux scalability: Web Server Optimizations

Linux Web Server Optimizations

Abstract

Introduction

Summary

URL Index

Discourse