Standard (non-concurrent) tests:
--------------------------------

- python: standard python benchbark "pybench".

- pbp/bs: "bs" benchmark for php, through apache on localhost.

- gzip/bzip2: compressing a certain amount of repetitive ASCII data
  with default parameters.

- Lisp compiler: comping the "Screamer" constraints programming
  package in CMU Common Lisp 18e in fast mode (no debugging, no type
  checks).

- Lisp screamer: running the Screamer regression tests.

- mplayer configure: configure script inside the Mplayer
  distribution.  Heavy on /bin/sh scripting but also contains many
  very small C compilations.

- mplayer gmake: compile mplayer after this configure run.

- kernel: compiling the FreeBSD kernel used on this benchmarking setup.

- Linux kernel: compiling Linux kernel 2.6.7 using the Redhat-8
  compiler, gcc-3.2 with Redhat messings.  This test is run several
  times with different amounts of allowed concurrency.  The "user CPU
  time" time chart for the -j > 1 runs will be nonsense on SMP
  machines.  NOTE: this is running a Linux compiler (a Linux binary)
  in the FreeBSD Linux emulation layer.  It is *not* using a
  crosscompiler.

- freebsd: FreeBSD `make world`

- mozilla: compile Mozilla 1.7.10 in the default configuration in
  FreeBSD ports.

- mozilla-layout and mozilla-content: re-run these two subdirectories
  inside the Mozilla distribution.  These are the two subdirectories
  with the heaviest amount of C++ template expansion and should be
  most different from the pure C compilatio tests (e.g. the kernels).

Compilation tests do not include time for extracting tarfiles, running
configure scripts or doing `make depend`.


Mencoder tests:
---------------

- mpeg4 2 a, mpeg4 7 b etc.

  - "2" (or "7" in other tests) stands for vqscale=2 (or 7).  Lower is
    better quality, bigger files.

  - "a" is moderate quality options = v4mv:trell

  - "b" is high quality options = 
    trell:mbd=2:vmax_b_frames=1:v4mv:subq=8:vb_strategy=0:vlelim=0:vcelim=0:
    cmp=2:subcmp=2:precmp=2:predia=1:dia=1:vme=4

  - that means:
    mpeg4 2 a = big files, medium quality
    mpeg4 2 b = big files, high quality
    mpeg4 7 a = smaller files, medium quality
    mpeg4 7 b = smaller files, high quality

- the mjpeg tests have their parameters printed in the charts.
  vqscale tests are variable bitrate, where 2 = bigger files and 7 =
  smaller files.  "vhq" means higher quality video, but this is lower
  quality than the parameters I give for the mpeg4 tests.

All Mencoder tests are run twice: once with a generic mplayer, one
with the fresh mplayer we compiled in earlier tests and which is
allowed to use autodetction of the CPU in use.


Remarks on parallel tests:
--------------------------

Parallel testing always follows the same scheme: one kind of command
is run in the foreground and one kind of command in the background.
There can be any number of background command and any number of
foreground commands.  All background commands are the same and all
foreground commands are the same.

The test then fires up all foreground commands and all background
commands at the same time, all simultaneously.  Foreground commands
are *not* run one after another, everything is parallel.

Background commands need to be much shorter than foreground commands.
Typicall a foreground program takes a minute and a background program
a few tens of a second.

Example: 3x Lisp in foreground and 3x plain http fetch in background
runs 6 processes in parallel.

The benchmark's overall time is detemined by when the last of the
foreground commands exits.  Since all foreground commands are the same
they should exit at roughly the same time.  The time it took for the
foreground commands to complete is noted and reported as the main
results time.

It is then looked up how many times the background programs ran.  The
background commands run <N> parallel instances, and each individual
instance is in an endlees loop.

At the time when the foreground commands are done it is looked up how
many times the background command was executed, in all loops combined.

More concrete:
- 3x Lisp foreground, 3x plain http fetch in the background.
- Lets say the last of the free Lisp exits after 1 minute.
- At that time three seperate endless loops of fetch have completed a
  number of runs.
- The benchmarking program picks up the number of times each loop ran
  and sums them up.

Example: during the 60 seconds that the foreground programs took,
background loop 1 made 100 runs, background loop 2 made 85 runs and
background loop 3 made 110 runs.  The result is 296 runs through the
background loop and the benchmarking program will report 5
runs/seconds.  The result is printed in the charts with "BG for ...".


Parallel benchmarks come in these flavors:
------------------------------------------

- mencoder in the foreground and runs over the full php benchmark
  suite through apache in the background.

- Lisp compilation runs in the foreground and plain http fetch in the
  background. 

- The Linux kernel compilation is run with various -j options.

- Compiling Mozilla and doing FreeBSD's `make world` at the same
  time.  However, this particular test does not sort out number of
  runs like the other one does.  The time reported is whatever
  finishes last.  To make the runs more even, the Mozilla compilation
  does a full compilation and then the two subdirections again.

The "plain http fetches" test is fetching a file of a certain size (20
MB right now) from an apache through localhost using the FreeBSD
`fetch` program.  To take the performance of the harddrive out of the
timing, I use a sparse file.  That means no actual disk blocks exist,
but apache has to deliver 20 MB though the TCP and IP layers to a
userland program on the same host.  All that while some foreground
program(s) eat up CPU.


General setup notes:
--------------------

I use one single setup on one harddrive.

This setup as it is now is intended to compare different hardware
platforms.  I just connect my standard installation to a new machine
and run the suite.  It is not yet suitable to compare different
versions of the base OS, much less to compare different OSes.

I apologize for not having made a more general suite so far.  It is
planned.  However, this is very hard work and I use this
fixed-installation suite for now until I am more confident what kind
of tests are useful in first place.


Current standard setup:
-----------------------

- FreeBSD-6.0-beta5
- ports tree from FreeBSD-6.0-beta2.  All applications from this tree
- no CPU-specific compilation flags
- custom kernels with internal tests turned off
- malloc debug turned off
- Linux compilations done in linux-base-8 and linux-dev-8

One run through the suite takes approximately 8 hours on a 2.6 GHz
AMD64 machine.

For multi-process or multi-core machines the same kernel with SMP
turned on is used.


Future outlook on this suite:
-----------------------------

Once I stabilize the tests and seperate the useful from the useless
ones, I will change the script to be more general.

First I will change it so that you can run it on any FreeBSD version
(to monitor kernel changes' effect on performance).

Then I will isolate those tests that can be run on non-FreeBSD
so that you can run it on other Unixes.

Note however that the suite relies on gcc pretty heavily.  In
addition, Linux versions differ in compilation speed due to the fact
that the grade of insanity when expanding the glibc include files
varies.  It will always be required to compare different OS kernels by
dragging around a complete chroot environment containing the reference
compiler, including libs and include files.

(of course you can opt to test a different compiler, but using a
different compiler and kernel at the same time will be nonsense)


Various details
---------------

This suite is careful not to run tests too close after another, to
avoid that delayed effects influence followup tests.  An example is
that after a `make clean` or other deleting operations there is a sync
and a wait to make sure that softupdates don't kick in when we already
run a new timed test.

For the parallel tests the suite is careful to make sure no old
background loops are around when new tests start.  This is the most
tricky part of the suite, BTW.

This suite is a first-class hardware and overclocking test.  Prime95
is pale in comparision.


Why do I use user CPU time, not user+sys?
-----------------------------------------

Because the SMP kernel running on a SMP machine causes higher sys
values for equally fast CPUs, even when just one program is running.
It actually causes slightly higher user values, too (from cache
trashing when choosing a different CPU) but that is minor.  To use
user+sys I would have to run the single-threaded kernel for
single-threaded benchmarks.  That costs a lot of time, basically twice
as much.  Since user and user+sys for the same kernel correlate
precisely (from what I have seen) it is better to just take user and
spend the time which would be required for a second run to test
something actually different.


Graphs and charting
-------------------

The graphs are made with the pycharts package, writing Python programs
from a Perl script (these fools had to use indendations for block
boundaries - why? Anyway...).

All results are normalized to one reference machine, which is set to
100.

Charts usually come in an unlimited variant and in one variant which
concentrates on the area around 100 (the "cutoff" variant).

Charts are made for wall clock time and for user CPU time.

Parallel tests are usually not reported in the user CPU time charts.

The charting script builds an average of all values that it finds for
one given platform in the results file.  It then goes over the values
a second time and kicks out all values that are out of bounds.  For
now, values are considered out of bounds if they are > 2 < 0.5 of the
first average (including themself, so this test is really not that
hard).  I will tighten that up after I clean up my results so far.