Standard (non-concurrent) tests: -------------------------------- - python: standard python benchbark "pybench". - pbp/bs: "bs" benchmark for php, through apache on localhost. - gzip/bzip2: compressing a certain amount of repetitive ASCII data with default parameters. - Lisp compiler: comping the "Screamer" constraints programming package in CMU Common Lisp 18e in fast mode (no debugging, no type checks). - Lisp screamer: running the Screamer regression tests. - mplayer configure: configure script inside the Mplayer distribution. Heavy on /bin/sh scripting but also contains many very small C compilations. - mplayer gmake: compile mplayer after this configure run. - kernel: compiling the FreeBSD kernel used on this benchmarking setup. - Linux kernel: compiling Linux kernel 2.6.7 using the Redhat-8 compiler, gcc-3.2 with Redhat messings. This test is run several times with different amounts of allowed concurrency. The "user CPU time" time chart for the -j > 1 runs will be nonsense on SMP machines. NOTE: this is running a Linux compiler (a Linux binary) in the FreeBSD Linux emulation layer. It is *not* using a crosscompiler. - freebsd: FreeBSD `make world` - mozilla: compile Mozilla 1.7.10 in the default configuration in FreeBSD ports. - mozilla-layout and mozilla-content: re-run these two subdirectories inside the Mozilla distribution. These are the two subdirectories with the heaviest amount of C++ template expansion and should be most different from the pure C compilatio tests (e.g. the kernels). Compilation tests do not include time for extracting tarfiles, running configure scripts or doing `make depend`. Mencoder tests: --------------- - mpeg4 2 a, mpeg4 7 b etc. - "2" (or "7" in other tests) stands for vqscale=2 (or 7). Lower is better quality, bigger files. - "a" is moderate quality options = v4mv:trell - "b" is high quality options = trell:mbd=2:vmax_b_frames=1:v4mv:subq=8:vb_strategy=0:vlelim=0:vcelim=0: cmp=2:subcmp=2:precmp=2:predia=1:dia=1:vme=4 - that means: mpeg4 2 a = big files, medium quality mpeg4 2 b = big files, high quality mpeg4 7 a = smaller files, medium quality mpeg4 7 b = smaller files, high quality - the mjpeg tests have their parameters printed in the charts. vqscale tests are variable bitrate, where 2 = bigger files and 7 = smaller files. "vhq" means higher quality video, but this is lower quality than the parameters I give for the mpeg4 tests. All Mencoder tests are run twice: once with a generic mplayer, one with the fresh mplayer we compiled in earlier tests and which is allowed to use autodetction of the CPU in use. Remarks on parallel tests: -------------------------- Parallel testing always follows the same scheme: one kind of command is run in the foreground and one kind of command in the background. There can be any number of background command and any number of foreground commands. All background commands are the same and all foreground commands are the same. The test then fires up all foreground commands and all background commands at the same time, all simultaneously. Foreground commands are *not* run one after another, everything is parallel. Background commands need to be much shorter than foreground commands. Typicall a foreground program takes a minute and a background program a few tens of a second. Example: 3x Lisp in foreground and 3x plain http fetch in background runs 6 processes in parallel. The benchmark's overall time is detemined by when the last of the foreground commands exits. Since all foreground commands are the same they should exit at roughly the same time. The time it took for the foreground commands to complete is noted and reported as the main results time. It is then looked up how many times the background programs ran. The background commands run parallel instances, and each individual instance is in an endlees loop. At the time when the foreground commands are done it is looked up how many times the background command was executed, in all loops combined. More concrete: - 3x Lisp foreground, 3x plain http fetch in the background. - Lets say the last of the free Lisp exits after 1 minute. - At that time three seperate endless loops of fetch have completed a number of runs. - The benchmarking program picks up the number of times each loop ran and sums them up. Example: during the 60 seconds that the foreground programs took, background loop 1 made 100 runs, background loop 2 made 85 runs and background loop 3 made 110 runs. The result is 296 runs through the background loop and the benchmarking program will report 5 runs/seconds. The result is printed in the charts with "BG for ...". Parallel benchmarks come in these flavors: ------------------------------------------ - mencoder in the foreground and runs over the full php benchmark suite through apache in the background. - Lisp compilation runs in the foreground and plain http fetch in the background. - The Linux kernel compilation is run with various -j options. - Compiling Mozilla and doing FreeBSD's `make world` at the same time. However, this particular test does not sort out number of runs like the other one does. The time reported is whatever finishes last. To make the runs more even, the Mozilla compilation does a full compilation and then the two subdirections again. The "plain http fetches" test is fetching a file of a certain size (20 MB right now) from an apache through localhost using the FreeBSD `fetch` program. To take the performance of the harddrive out of the timing, I use a sparse file. That means no actual disk blocks exist, but apache has to deliver 20 MB though the TCP and IP layers to a userland program on the same host. All that while some foreground program(s) eat up CPU. General setup notes: -------------------- I use one single setup on one harddrive. This setup as it is now is intended to compare different hardware platforms. I just connect my standard installation to a new machine and run the suite. It is not yet suitable to compare different versions of the base OS, much less to compare different OSes. I apologize for not having made a more general suite so far. It is planned. However, this is very hard work and I use this fixed-installation suite for now until I am more confident what kind of tests are useful in first place. Current standard setup: ----------------------- - FreeBSD-6.0-beta5 - ports tree from FreeBSD-6.0-beta2. All applications from this tree - no CPU-specific compilation flags - custom kernels with internal tests turned off - malloc debug turned off - Linux compilations done in linux-base-8 and linux-dev-8 One run through the suite takes approximately 8 hours on a 2.6 GHz AMD64 machine. For multi-process or multi-core machines the same kernel with SMP turned on is used. Future outlook on this suite: ----------------------------- Once I stabilize the tests and seperate the useful from the useless ones, I will change the script to be more general. First I will change it so that you can run it on any FreeBSD version (to monitor kernel changes' effect on performance). Then I will isolate those tests that can be run on non-FreeBSD so that you can run it on other Unixes. Note however that the suite relies on gcc pretty heavily. In addition, Linux versions differ in compilation speed due to the fact that the grade of insanity when expanding the glibc include files varies. It will always be required to compare different OS kernels by dragging around a complete chroot environment containing the reference compiler, including libs and include files. (of course you can opt to test a different compiler, but using a different compiler and kernel at the same time will be nonsense) Various details --------------- This suite is careful not to run tests too close after another, to avoid that delayed effects influence followup tests. An example is that after a `make clean` or other deleting operations there is a sync and a wait to make sure that softupdates don't kick in when we already run a new timed test. For the parallel tests the suite is careful to make sure no old background loops are around when new tests start. This is the most tricky part of the suite, BTW. This suite is a first-class hardware and overclocking test. Prime95 is pale in comparision. Why do I use user CPU time, not user+sys? ----------------------------------------- Because the SMP kernel running on a SMP machine causes higher sys values for equally fast CPUs, even when just one program is running. It actually causes slightly higher user values, too (from cache trashing when choosing a different CPU) but that is minor. To use user+sys I would have to run the single-threaded kernel for single-threaded benchmarks. That costs a lot of time, basically twice as much. Since user and user+sys for the same kernel correlate precisely (from what I have seen) it is better to just take user and spend the time which would be required for a second run to test something actually different. Graphs and charting ------------------- The graphs are made with the pycharts package, writing Python programs from a Perl script (these fools had to use indendations for block boundaries - why? Anyway...). All results are normalized to one reference machine, which is set to 100. Charts usually come in an unlimited variant and in one variant which concentrates on the area around 100 (the "cutoff" variant). Charts are made for wall clock time and for user CPU time. Parallel tests are usually not reported in the user CPU time charts. The charting script builds an average of all values that it finds for one given platform in the results file. It then goes over the values a second time and kicks out all values that are out of bounds. For now, values are considered out of bounds if they are > 2 < 0.5 of the first average (including themself, so this test is really not that hard). I will tighten that up after I clean up my results so far.