The following is a (long -- sorry) amalgamation of various articles
culled from Sun-Spots and Sun-Managers regarding what has come to
be termed 'The Pmeg Thrashing Problem' that seems endemic to
SparcStations. We experienced severe performance degradation on our
machines with high context switch, high interrupt count, lots of free
memory and not so high cpu usage. We had plenty of swap and 28 MB of
memory. Our job mix involves XDM, xterminal users, dynamically
linked X windows applications (lots of 'em) and extensive usage of
System V shared memory segments. Seems like it was custom tailored
to bring this poor machine to its knees :-}. As I plowed through
Kbytes of old Sun-Spots and Sun-Managers articles, a golden thread
emerged which seemed to describe our problem. I thought it might
be useful to collect it all together and present it in this forum.
It is presented in roughly chronological order and includes private
correspondance never before published (does that sound like an Enquirer
byline or what?). I have done some editing out of names and non-related
information [indicated by square brackets like this -- ml]. Each
note is prefaced by a header like the following. Enjoy!
+=========================================================================
The first message I hold that hints about this [extensively abridged]:
+=========================================================================
Date: Thu, 14 Jun 90 07:45:18 EDT
From: sid@think.com
To: sun-managers@eecs.nwu.edu
Subject: Synopsis of 4.03 Memory Managment Question
Thanks to everyone that took the time to respond to my query about
performance problems with a 4/370. No one had a solution to my
problem. A few basically said no one has an answer to my problems. I
will give a brief synopsis of the responses below and then
follow that with the actual letters.
[...]
He also says there is a Usenix Conference article describing
the memory system. If I can find it, I will let you guys
know. If anyone has a copy, drop me a line.
[Conference Proceedings of the Summer, 1987 Usenix Technical Conference and
Exhibition: "Virtual Memory Architecture in SunOS", Gingell, Moran & Shannon;
"Shared Libraries in SunOS", Gingell, Lee, Dang & Weeks -- ml]
[...]
Mark Weiser sent a note about problems with the memory
manager when the maps are larger than 32MB. I do not
think this is the problem with this machine, but I will
look at it with this in mind.
The Letters
**************************************************************
[...]
>From briggs.pa@Xerox.COM Wed Jun 13 20:28:07 1990
Sid,
I forwarded your question to someone who had experienced memory problems
here at PARC, and been tracking it down with Sun. Here's his response.
----- Begin Forwarded Messages -----
From: Mark Weiser <mark@arisia.Xerox.COM>
To: briggs@Xerox.COM
Yes, Sun's memory management is not good at all for lots of memory.
Our main trouble is that the memory mapping hardware on sparcstations
cannot map more than 32MB at once, and even if two processes are sharing
pages they need different maps. The result is tht the 32MB limit
is easily exceeded even on a 16 MB machine.
When it is exceeded, even slightly, SunOS does not behave well at all,
essentially thrashing in the kernel trying to keep the MMU filled.
The symptom of this is high % system time even for your cpu bound
processes.
But it does not show up as paging and swapping activity, so I am not
sure that is the same as Sid's troubles.
To answer his specific question, no, there is no way in 4.0.3 or 4.1 to
limit the amount of memory allocated to files. However, 4.1 is supposed
to do this more intelligently. 4.1 does NOT manage the MMU more
intelligently.
+=========================================================================
Further amplification by Sid which nailed the problem accurately:
+=========================================================================
Date: Tue, 3 Jul 90 09:21:15 EDT
From: sid@think.com
To: sun-managers@eecs.nwu.edu
Subject: SunOS 4.03 Memory Management Problems
I sent a note out about a month ago regarding anomalous behaviour
we were seeing on a Sun 4/370 with 56 Megabytes of memory. The
system started to page heavily while there was still 20 megabytes
(35%) of memory still free. I did not get any solutions from this
mailing list (a first I think) but I did pick up something from
comp.sys.sun on the Usenet that has solved the problem. Here is the
note from the Usenet:
From: murthy@algron.cs.cornell.edu (Chet Murthy)
Subject: Re: SparcStation I Memory Leak? (a possible answer)
Date: 14 Jun 90 22:10:45 GMT
X-Refs: Original: v9n211
X-Sun-Spots-Digest: Volume 9, Issue 210, message 12
murthy@algron.cs.cornell.edu (Chet Murthy) writes:
>I have been running a large LISP application on a SparcStation I for a
>while now, and I have noticed some really awful problems with the
>allocation of memory as time goes on.
Well, after some talking with a Sun OS ambassador at a new products
session, I found out some interesting stuff.
The phenomenon is called "pmeg stealing". I'm not sure what's going on,
exactly, but the idea seems to be that somebody in the kernel is stealing
memory from the pool, and not putting it back.
So it looks like there's less and less. The fix, from someone who may
choose to remain anonymous (otherwise, he can raise his hand - I didn't
figure this out myself) is to turn off the swapper, leaving only the pager
running:
To turn off swap:
% su
# adb -wk /vmunix /dev/mem
nosched?W 1
^D
# reboot
And I've gotten conflicting reports as to whether it is fixed in 4.1 or
not. So we'll just have to wait and see...
murthy@cs.cornell.edu
We have applied this "fix" to two of our multiuser systems and it has
had wonderful results. The system starts paging activity at around
500K bytes of free memory instead of 20 Mbytes. We have not seen any
failures yet that can be attributed to the swapper being turned off.
Your mileage may vary....
sid@think.com
+=========================================================================
The Definitive Article by Gordan Irlam at the University of Adelaide:
+=========================================================================
Date: 9 Jul 90 00:09:14 GMT
From: gordoni@chook.ua.oz.au (Gordon Irlam)
Subject: Sun-4 MMU Performance
Sun-Spots Digest: v9n257
A Guide to Sun-4 Virtual Memory Performance
===========================================
Gordon Irlam, Adelaide University.
(gordoni@cs.ua.oz.au or gordoni@chook.ua.oz.au)
Throughput on a Sparcstation drops substantially once the amount of active
virtual memory exceeds 16M, and by the time it reaches 25M the machine can
be running up to 10 times slower than normal. This is the conclusion I
reach from running a simplistic test program on an otherwise idle
Sparcstation.
Note that the limit involves the amount of ACTIVE virtual memory used.
Additional virtual memory may be consumed by processes that remain idle
without incurring any penalty. (SunOS usually steals pages from idle
processes, so strictly speaking such memory is not normally considered to
be part of the virtual memory consumed.) Also note that it is 16M of
active VIRTUAL memory. Dynamically linked libraries, shared text
segments, and copy on write forking means that the amount of PHYSICAL
memory used could conceivably be as little as half this value. I would
guess that any physical memory that is added to a typical Sparcstation
beyond around 14M will effectively only be used as a disk cache.
This problem exists on all Sun-4 systems. The problem is a result of
poorly designed MMU hardware, and the failure of the operating system to
attempt to minimize the effects of the design. Sun-4's have a fixed
amount of memory that can be used for storing page tables, on
Sparcstations in particular this memory area is far to small.
This posting quantifies to some extent the performance losses resulting
from the Sun-4 memory management subsystem, describes the cause of the
problem, and suggests work-arounds that may be useful in overcoming some
of the worst effects of the problem. This posting is based in part on a
previous posting on the subject and the helpful responses received, many
thanks.
1. Sparcstation Virtual Memory Performance
-------------------------------------------
The following table shows the throughput of a Sparcstation-1 as a function
of active virtual memory. The program used to obtain these figures is
included at the end of this posting. The program forks several times and
each child spends its life sequentially accessing pages of a shared 2M
data segment over and over again. Forking and the use of a shared data
segment allows the test program to be run on a machine with very little
physical memory but otherwise does not significantly effect the results
obtained. The first two columns show a sudden performance drop beyond
16M. The remaining columns contain raw data that can be used to
understand what is happening.
virtual relative elapsed user system translation swap
memory speed time time time faults ins
(Mb) (sec) (sec) (sec)
2 1.00 3.5 2.7 0.8 1224 1
4 1.09 6.4 5.3 1.1 1840 1
6 1.14 9.2 8.1 1.2 2442 0
8 1.15 12.2 10.7 1.4 2729 0
10 1.17 15.0 13.3 1.7 3381 0
12 1.17 18.0 16.1 1.9 4121 0
14 1.12 21.8 19.6 2.1 5275 0
16 1.08 25.9 22.6 3.1 8746 2
18 0.57 55.3 29.1 25.9 98251 6
20 0.40 87.7 34.4 53.0 200296 7
22 0.25 151.3 41.8 109.0 406885 12
24 0.11 388.3 61.9 325.3 1202899 20
26 0.12 371.9 62.6 304.5 1118388 22
28 0.06 764.8 91.8 655.4 2412144 39
30 0.03 1607.1 156.3 1446.2 5316313 56
32 0.02 2601.0 221.5 2373.1 8665839 88
Note that the test program is designed to illustrate the nature of the
virtual memory problem in a simple fashion, not to provide realistic
estimates of expected system performance. Realistic performance estimates
can be much better made after having taken into account the issues raised
in sections 3 and 4 below. In particular, system performance will
probably not degrade as rapidly as shown in the above table.
>From this table it can be clearly seen that once the amount of active
virtual memory exceeds 16M the system suddenly finds itself having to
handle an incredibly large number of page faults. This causes a drastic
increase in the amount of system time consumed, which results in a
devastating drop in throughput.
I emphasize here that the machine does not run out of physical memory at
16M. It has plenty of free memory during all of the tests - the free list
is several megabytes in size, and the machine does not page fault to disk.
2. A Few Minor Details
-----------------------
This section can be skipped.
The first few figures show a relative speed slightly greater than 1.00.
This is because the cost of invoking the initial image is amortized over a
greater number of processes.
When repeating the tests those that had a very low throughput produced
figures that varied by around 30%. The slightest perturbation of the
machine when it is very heavily loaded is found to significantly alter the
elapsed time. When a test has been run several times the figures
presented above are those with the smallest elapsed time.
The amount of user time consumed grows at a faster rate beyond 16M of
active virtual memory than below 16M. This may be a result of
inaccuracies in the process accounting subsystem. Alternatively it could
be some sort of user cost resulting from context invalidations.
The swapping figures are not significant. They are the result of a
strange feature of SunOS. Once all the page tables for a process's data
segment have been stolen the process is conceptually swapped. This
involves moving text pages that are not currently shared onto the free
list. In this case no such pages exist. But even if they did no disk
activity would occur because the free list has plenty of space. On a real
system this quirk typically adds significantly to the performance
degradation that occurs once the virtual memory limit has been exceeded.
The possibility that the sudden increase in system time beyond 16M is a
result of context switching can be discounted by running similar tests in
which each process uses 4M instead of 2M. A sudden performance drop will
be observed at around 20M. This figure is slightly higher than 16M
because fewer page tables are wasted mapping less than the maximum
possible amount of memory.
The above figures were obtained under SunOS 4.0.3, however subsequent
measurements have shown that essentially identical results are obtained
under SunOS 4.1.
3. Implications for a Real System
----------------------------------
The amount of active virtual memory at which a sudden drop in throughput
occurs, and the severity of the drop should not be viewed as precise
parameters of the system. In a real system the observed performance will
be heavily dependent on process sizes, memory access patterns, and context
switching patterns. For instance, the elapsed time given above for 32M of
active virtual memory would have been five times larger if every data
access had resulted in a page fault. Alternatively, on a real system
locality of address references could have had the opposite effect and
reduced the elapsed time by a factor of 5. The context switching rate has
a significant effect on the performance obtained when the system is short
of pmegs since it determines how long a process will be given to run
before having its pmegs stolen from it. If the context switching rate is
too high processes will get very little useful work done since they will
be spending all their time faulting on the pages of their resident sets,
and never getting a chance to execute when all the pages of their resident
sets are resident.
Because the performance losses are a function of the amount of virtual
memory used, dynamically linked libraries, shared code pages, and copy on
write forking means that it is possible for these problems to occur on a
machine with substantially less physical memory than the 16M of virtual
memory at which the problem started to occur.
On the other hand locality of reference will reduce the severity of the
problem. Large scientific applications that don't display much data
reference locality will be an exception.
The impression I have is that virtual memory performance will not normally
be a serious problem on a Sparcstation with less than 16M of physical
memory, with between 16M and 32M it could be a problem depending upon the
job mix, and it will almost certainly be a serious problem on any
Sparcstation with 32M or more. If it isn't a problem on a machine with
32M or more you have almost certainly wasted your money buying the extra
memory as you do not appear to be using it.
[It's a sorry tale to go out and buy lots of memory to stop a system
thrashing, install it, turn the machine on, and find the system still
thrashes, but thanks to the large disk cache you have just installed it is
now able to do so at previously unheard of rates.]
A giveaway indication that the virtual memory system is a problem on a
running system is the presence of swapping, as shown by "vmstat -S 5", but
with a free list of perhaps a megabyte or more in size. This swapping
does not involve real disk traffic. Pages are simply being moved back and
forth onto the free list. Note that if you are only running one or two
large processes this swapping behavior will probably not be observed.
Regardless of whether you see this behavior or not vmstat should also be
showing the system spending most of its time in system mode.
The ratio of user time to system time obtained using vmstat should give
you are rough estimate of the cost associated with the virtual memory
management problems. You can get a more accurate estimate by looking at
the number of translation faults (8.7 million in the previous table), and
the time taken to handle them (2400 seconds). Then compute the time taken
to handle a single fault (280us). Now look at a hatcnt data structure in
the kernel using adb.
# adb -k /vmunix /dev/mem
physmem 17f4
hatcnt/8D
_hatcnt:
_hatcnt: 2129059 2034884 19942909 3173659
2685512 0 0 0
$q
#
The 4th word is the total number of pmeg allocations (see below) since the
system has been booted (3173659), while the 5th word is the number of pmeg
allocations that stole a pmeg from another process (2685512). Estimating,
say 32 faults per stolen pmeg allocation you can work out the total time
the system has spent handling these faults (7 hours). This time can then
be compared to the total amount of time the system has been up (48 hours).
On a non-Sparcstation Sun-4 you should estimate around 16 faults per
stolen pmeg allocation, rather than 32.
4. The Sun-4 Memory Management Architecture
-------------------------------------------
The 4/460 has a three level address translation scheme all other Sun-4
machines have a two level scheme. Sparcstations have 4k pages, all other
machines have 8k pages. The level 2 page tables (level 3 tables on the
4/460) are referred to by Sun as page management entry groups, or simply
pmegs. Each pmeg on a Sparcstation contains 64 entries, and since the
pages are 4k in size this means that a single pmeg can map up to 256k of
virtual memory. On all other Sun-4 machines the pmegs contain 32 entries,
but the page size is 8k, so that once again a single pmeg can map up to
256k.
Most systems use high speed static RAM to cache individual page table
entries and hence speed up address translations. This is not done on
Sun-4's. Instead all page tables (pmegs) are permanently stored in high
speed static RAM. This results in address translation hardware that is
both simple and reasonably fast. The downside however is that the number
of pmegs that can be stored is limited by the amount of static RAM
available. On the Sparcstations the static RAM can store up to 128 pmegs,
giving a total mapping of up to 32M. A 4/1xx, or 4/3xx can map up to 64M,
a 4/2xx can map up to 128M, and a 4/4xx can map up to 256M of virtual
memory.
32M is the maximum amount of virtual memory that can be mapped on a
Sparcstation, however since a pmeg can only be used to map pages within a
single contiguous 256k aligned range of virtual address, the amount of
virtual memory mapped when a machine runs out of pmegs will be
substantially less. This is particularly evident when it is realized that
separate pmegs will be assigned to map the text, data, and stack sections
of each process, and some of these will probably be much smaller the 256k.
Currently under SunOS pmegs are never shared between processes even if
they may map identical virtual addresses to identical physical addresses,
as could be the case with a common text segment. Dynamically linked
libraries are also probably bad in this respect as they will require
several pmegs per process, whereas if the process was statically linked
the number of pmegs consumed would be reduced because pmegs would only be
consumed mapping in the routines that are actually used.
When a process needs to access a page that is not referenced by any of the
pmegs that are currently being stored, and no free pmegs exist, it steals
a pmeg belonging to another process. When the other process next goes to
access a page contained in this pmeg it will get a translation fault and
also have to steal a pmeg from some other process. Having got the pmeg
back, however, all the page table entries associated with that pmeg will
have been marked invalid, and thus the process will receive additional
address translation faults when it goes to access each of the 64 pages
that are associated with the pmeg (32 pages on a machine other than a
Sparcstation).
The problem is compounded by SunOS swapping out processes whose resident
set size is zero. If all the pmegs belonging to a process get stolen from
it the kernel determines that the process's resident set size is zero, and
promptly swaps the process out. Fortunately this swapping only involves
moving all of the processes pages onto the free list, and not to disk.
But the CPU load associated with doing this appears to be substantial, and
their is no obvious justification for doing it.
5. Working Around the Problem
------------------------------
Although the problems with the Sun-4 MMU hardware architecture probably
can't be completely overcome by modifying SunOS a number of actions can
probably be taken to diminish it's effect.
Applications that have a large virtual address space and whose working set
is spread out in a sparse manner are problematic for the Sun-4 memory
management architecture, and the only alternatives may be to upgrade to a
more expensive model, or switch to a different make of computer. Large
numerical applications, certain Lisp programs, and large database
applications are the most likely candidates.
A reasonable solution to the problem in many cases would be for SunOS to
keep a copy of all pmegs in software. Alternatively a cache of the active
pmegs could be kept. In either case when a page belonging to a stolen
group is next accessed, the entire pmeg can be loaded instead of each page
in the pmeg causing a fault and being individually loaded. Doing this
would probably involve between 50 and 100 lines of source code.
It may also be desirable for pmegs to be shared between processes when
both pmegs map the same virtual addresses to the same physical addresses.
This would be useful for dynamically linked libraries, and shared text
sections. Unfortunately this is probably difficult to do given SunOS's
current virtual memory software architecture.
Until a solution similar to the one proposed above is available a number
of other options can be used as a stop gap measure. None of these
solutions is entirely satisfactory, and depending on your job mix in
certain circumstances they could conceivably make the situation worse.
Preventing "eager swapping" will significantly improve performance in many
cases. Despite this swapping not necessarily involving any disk traffic,
we noticed a significant improvement on our machines when we did this; the
response time during times of peak load probably improved by between a
factor of 5 and 10.
The simplest way to prevent eager swapping is to prevent all swapping.
This is quite acceptable provided you have sufficient physical memory to
keep the working sets of all active processes resident.
# adb -w /vmunix
nosched?W 1
$q
# reboot
Or use "adb -w /sys/sun4c/OBJ/vm_sched.o", to fix any future kernels you
may build.
A better solution is to only prevent eager swapping, although doing this
is slightly more complex. The following example shows how to do this for
a Sparcstation under SunOS 4.0.3. The offset will probably differ
slightly on a machine other than a Sparcstation, or under SunOS 4.1,
although hopefully not by too much.
# adb -w /sys/sun4c/OBJ/vm_sched.o
sched+1f8?5i
_qs+0xf8: orcc %g0, %o0, %g0
bne,a _qs + 0x1b8
ld [%i5 + 0x8], %i5
ldsb [%i5 + 0x1e], %o1
sethi %hi(0x0), %o3
sched+1f8?W 80a02001
_qs+0xf8: 0x80900008 = 0x80a02001
sched+1f8?i
_qs+0xf8: cmp %g0, 0x1
$q
# "rebuild kernel and reboot"
Another important thing to do is try to minimize the number of context
switches that are occurring. How to do this will depend heavily on
the applications you are running. Make sure you consider the effect
of trivial applications such as clocks and load meters. These can
significantly increase the context switching rate, and consume
valuable pmegs. As a rough guide when a machine is short of pmegs, up
to 40 context switches per second will probably be acceptable on a
Sparcstation, while a larger machine should be able to cope with maybe
100 or 200 context switches per second. These figures will depend on
the number of pmegs consumed by the average program, the number of
pmegs the machine has, and whether context switching is occurring
between the same few programs, or amongst different programs. The
above values are based on observations of machines that are mainly
used to run a large number of reasonably small applications. They are
probably not valid for a site that runs a few large applications.
Finally if you are using a Sparcstation to support many copies of a small
number of programs that are dynamically linked to large libraries, you
might want to try building them to use static libraries. For instance
this would be the case if you are running 10 or more xterms, clocks, or
window managers on the one machine. The benefit here is that page
management groups won't be wasted mapping routines into processes that
aren't ever called. The libraries have to be reasonably large for you to
gain any benefit from doing this since only 1 page mangement group per
process will be saved for every 256k of library code. And you need to be
running multiple copies of the program so that the physical memory cost
incurred by not sharing the library code used by other applications is
amortized amongst the multiple instances of this application.
6. A Small Test Program
------------------------
The test program below can be run without any problems on any Sun-4 system
with 8M of physical memory or more. Indeed it will probably work on a
Sun-4 system with as little as 4M. The test program is intended to be
used to illustrate in a simple fashion the nature of the problem with the
Sun-4 virtual memory subsystem. It is not intended to be used to measure
the performance of a Sun-4 under typical conditions. It should however
allow you to get a rough feel for the amount of active virtual memory that
you will typically be able to use.
When running the test program make sure no-one else is using the system.
And then run "vmtest 64", and "vmstat -S 5" concurrently for a minute or
two to make sure that no paging to disk is occurring - "fre" should be
greater than 256, and po and sr should both be 0. Note that swapping to
the free list may be occurring due to a previously mentioned quirk of
SunOS. Once the system has settled down kill vmtest and vmstat.
To run the test program use the command "time vm_stat n", where n is the
amount of virtual memory to use in megabytes. More detailed information
can be determined using a command similar to the following and comparing
the output of vmstat before and after each test.
$ for mb in 2 4 8 16 32 64
do
(echo "=== $mb megabytes ===";
vmstat -s;
time vmtest $mb;
vmstat -s) >> vm_stat.results
done
$
You may want to alter the process size or context switching rate to see
what sort of effects these have on the results. Larger processes mean
that fewer pmegs are wasted mapping less than a full address range. Hence
the amount of active virtual memory that can be used before problems start
to show up will increase. A faster context switching rate will reduce the
amount of time a process gets to execute before being descheduled. If
there is a pmeg shortage by the time the process is next scheduled many or
all of its pmegs will be gone. Adjusting the context switching rate to a
typical value seen on your system may be informative.
/* vmtest.c, test Sun-4 virtual memory performance.
*
* Gordon Irlam (gordoni@cs.ua.oz.au), June 1990.
*
* Compile: cc -O vmtest.c -o vmtest
* Run: time vmtest n
* (Test performance for n Megabytes of active virtual memory,
* n should be even.)
*/
#include <stdio.h>
#include <sys/wait.h>
#define STEP_SIZE 4096 /* Will step on every page. */
#define MEGABYTE 1048576
#define LOOP_COUNT 5000
#define PROCESS_SIZE (2 * MEGABYTE)
#define CONTEXT_SWITCH_RATE 50
char blank[PROCESS_SIZE]; /* Shared data. */
main(argc, argv)
int argc;
char *argv[];
{
int size, proc_count, pid, proc, count, i;
if (argc != 2 || sscanf(argv[1], "%d", &size) != 1 || size > 500) {
fprintf(stderr, "Usage: %s size\n", argv[0]);
exit(1);
}
/* Touch zero fill pages so that they will be shared upon forking. */
for (i = 0; i < PROCESS_SIZE; i += STEP_SIZE)
blank[i] = 0;
/* Fork several times creating processes that will use the memory.
* Children will go into a loop accessing each of their pages in turn.
*/
proc_count = size * MEGABYTE / PROCESS_SIZE;
for (proc = 0; proc < proc_count; proc++) {
pid = fork();
if (pid == -1) fprintf(stderr, "Fork failed.\n");
if (pid == 0) {
for (count = 0; count < LOOP_COUNT; count++)
for (i = 0; i < PROCESS_SIZE; i += STEP_SIZE)
if (blank[i] != 0) fprintf(stderr, "Optimizer food.\n");
exit(0);
}
}
/* Loop waiting for children to exit. Don't block, instead sleep for
* short periods of time so as to create a realistic context switch rate.
*/
proc = proc_count;
while (proc > 0) {
usleep(2 * (1000000 / CONTEXT_SWITCH_RATE));
if (wait3(0, WNOHANG, 0) != 0) proc--;
}
}
+=========================================================================
The response from Sun:
+=========================================================================
Date: 25 Jul 90 03:15:49 GMT
From: jblind@griffith.eng.sun.com (Joanne Blind-Griffith)
Sun-Spots Digest: v9n274
Subject: Re: Sun-4 MMU Performance
X-Art: Usenet #11
Sun Microsystem's Response to
A Guide to Sun-4 Virtual Memory Performance
Joanne Blind-Griffith, Product Manager, Sun Microsystems.
The recent Sun-Spots posting by Gordon Irlam is essentially accurate in
describing the hardware limitations of the Sun MMU. As he points out,
whether this limitation is encountered on any particular machine depends
on which Sun hardware is involved and what sort of applications are being
used. It is our experience that this limitation is rarely encountered
with applications which show typical locality of reference. Most common
applications and job mixes will never encounter this limit. However, some
very large applications, and some applications which share memory between
many processes, will encounter this limit.
The Sun MMU design results in a very fast MMU with a minimum of hardware.
The Sun MMU is best thought of as a cache for virtual-to-physical
mappings. As with all caches, the cache was designed to be large enough
for the sort of typical applications to be run on the machine. Nearly all
applications achieve a very high hit rate on this cache. However, like
any cache, there are applications that will exceed the capacity of the
cache, greatly lowering the hit rate. Since this cache (i.e., the Sun
MMU) is loaded by software, the cost of a cache miss can be quite
expensive.
We have improved the algorithms that manage the Sun MMU. The improvment
involves adding another level of caching between the MMU management
software and the upper levels of the kernel. This is a classic space/
time tradeoff where a little bit of space for this software cache saves a
lot of time in reloading the MMU for those applications which exceed the
hardware limits of the MMU. In addition, many other changes have been
made to the MMU management software to improve performance in general and
to reduce the effects of some worst case behaviour.
Following are the test results using Gordon's vmtest program run on a 12MB
SPARCstation 1+ with the improved MMU management software:
virtual elapsed user system
memory time time time
(MB) (sec) (sec) (sec)
2 2 2.3 0.6
4 5 4.7 0.8
8 10 9.4 1.1
10 13 11.9 1.2
12 16 14.3 1.4
14 18 16.8 1.5
16 21 19.5 1.7
18 25 22.5 1.9
20 27 25.3 2.0
22 30 27.6 2.2
24 33 30.4 2.5
26 36 33.3 2.5
28 39 35.7 2.7
30 41 38.1 2.9
32 44 40.8 3.1
Note that the performance is essentially linear through 32MB.
This improved MMU management software will be included in the next release
of SunOS. It will be available as a patch for SunOS 4.1 (Sun4c and Sun4
platforms) and 4.1 PSR A end of July and for SunOS 4.0.3c (Sun4c machines)
early August.
+=========================================================================
Heretofore un-published article from Gordan Irlam circa the previous note:
+=========================================================================
In a previous article I discussed how the performance of a Sun4, or
Sparcstation drops substantially once the amount of active virtual
memory exceeds some machine dependent limit. The problem was caused
by a large number of page faults resulting from a shortage of hardware
page tables, or pmegs.
Sun have developed a fix for this problem that involves maintaining a
software cache of the pmegs so that the hardware page tables can be
rapidly reloaded. I believe that this fix will shortly be available
as a patch to SunOS 4.1 for both Sun4 and Sun4c machines.
The patch improves performance to an extent which I would not have
thought possible. The results of running my test program on a
Sparcstation 1 are presented below.
old new old new
virtual relative relative elapsed elapsed
memory speed speed time time
(Mb) (sec) (sec)
2 1.00 1.00 3.5 3.6
4 1.09 1.06 6.4 6.8
6 1.14 1.14 9.2 9.5
8 1.15 1.14 12.2 12.6
10 1.17 1.15 15.0 15.6
12 1.17 1.16 18.0 18.7
14 1.12 1.12 21.8 22.5
16 1.08 1.07 25.9 27.0
18 0.57 1.08 55.3 30.1
20 0.40 1.09 87.7 33.1
22 0.25 1.09 151.3 36.2
24 0.11 1.08 388.3 40.0
26 0.12 1.08 371.9 43.4
28 0.06 1.09 764.8 46.3
30 0.03 1.09 1607.1 49.7
32 0.02 1.11 2601.0 52.1
64 very slow 1.09 very big 105.4
96 time stops 1.08 infinity 159.8
(Note the old elapsed times were under SunOS 4.0.3, the new ones are
under SunOS 4.1, thus the absolute times are not directly comparable.
Furthermore for the new tests the system was not in single user mode,
so a number of daemons etc. will also have been running.)
For 32M of virtual memory the results show a 50 fold improvement in
performance.
The fix has solved the problems we were experiencing completely, and I
imagine it will do likewise for most other sites with similar
problems. It is however plausible that sites that run unusual
programs that consume very large amounts of virtual memory, or use
virtual memory in a very sparse fashion may continue to experience
some performance degradation. But hopefully nothing like what they
are currently experiencing.
Gordon Irlam, Adelaide University.
+=========================================================================
Heretofore unpublished private e-mail Gordon sent to Sun;
explaining why he chose not to post previous article:
+=========================================================================
Date: Tue, 21 Aug 1990 20:00:42 +0930
From: Gordon Irlam <cs.adelaide.edu.au!gordoni>
Subject: Re: DBE performance report
To: [somebody at Sun whom I choose to leave unamed -- ml]
[...]
I am currently not planning to post the message I sent you.
1) Most of what I say has been covered by an article that Sun posted.
2) I have found that some problems still exist as far as virtual
memory performance is concerned. These are not shown up by my
test program. These problems are an order of magnitude less
(maybe even two orders less) than the problems we were having
and only relate to processes with very large sparse address spaces.
They are not a problem for normal processes, irrespective of
how many processes are run. This is not I problem for our site
it would only be relevant to sites running fairly special applications
- Lisp, and some large array handling programs would be the
most likely candidates. I was hoping to do some measurements on
this and include this as part of my posting - specifying both
the nature of such processes that would have problems and the
extent of the problems. I now don't think I will get around to
doing this. These problems are a result of the Sun4 MMU architecture
and cannot be fixed in software.
If you feel my posting of something that corroborates what Sun have said
would help you please let me know. It would however include something
mentioning that some (much less severe) problems may still exist for some
very special applications.
Gordon.
+=========================================================================
Yet more heretofore unpublished private e-mail Gordon sent to Sun;
more technical discussion on the aspects of pmeg stealing:
+=========================================================================
Return-Path: <gordoni@cs.adelaide.edu.au>
Date: Fri, 24 Aug 1990 20:28:48 +0930
From: Gordon Irlam <gordoni@cs.adelaide.edu.au>
To: [somebody at Sun whom I choose to leave unamed -- ml]
Subject: Re: PMEG stealing fix
> The new implementation of the SUN sun4/sun4c hat layer (AKA pmeg
> patch) DOES IMPROVE the performance of a single application with a
> sparse address space.
I agree totally. I am very impressed by the performance of the new
hat layer. What I was alluding to in the mail I sent to Allan was the
lack of any hardware table walking mechanism. So that if the memory
access pattern is very sparse performance will still be poor. I think
loading a pmeg takes roughly 200us, so you certainly don't want to be
accessing new pmegs that are not stored in the static RAM more frequently
than that.
Whether machines should have table walking hardware is an
interesting question. MIPS justify its ommision on the grounds
that doing it in software constitutes on average something like a
3% performance penalty. However for some applications especially
large LISP applications it can be perhaps 80%. I guess this is
part of the trend away from true general purpose computers.
Back to Suns. Gut feeling, no quantative justification. The
current number of pmegs on a Sparcstation is too small, and by the
time you make it bigger the cost is probably similar to what it
would have costed to provide hardware table-walking instead. I
also think the idea of pme's existing in contiguous groups is a bad
idea, cacheing pme's would be much more flexible.
Partial justification. In view of the current OO-hype I think it
is reasonable to anticipate large object oriented systems that
consist of large numbers of objects interacting with one another.
I suspect such systems are likely to have considerably less
locality, and be considerably larger than most of the software run
today. I imagine that Sun4's in general and Sparcstations in
particular will be a poor platform for such systems. I think the
problems will be particularly severe if persistent object
orientated systems become at all popular.
+=========================================================================
Private e-mail responding to attributed queries from myself:
+=========================================================================
Date: Wed, 12 Sep 1990 14:46:52 +0930
From: Gordon Irlam <gordoni@cs.adelaide.edu.au>
To: mark@DRD.Com
Subject: Re: Sun 4 & MMUs: Reprise
mark> For programs dynamically linked (involving shared libraries), are
pmegs allocated for all potential members of the shared library?
Is this the rationale for savings by linking statically
A pmeg is needed for each 256k of contiguous address space.
Statically linked executables are unlikely to require more than 1
pmeg to address all of the library routines they use since the
total size of the routines they actually use probably will be only
100 or 200k. With shared libraries virtual space is allocated for
the entire library, including the many routines that are not used.
Thus although you might only use a few routines from the library,
they will be scattered over a larger address range, which means
that more pmegs are required to map them. The X libraries caused
us the most problems because they are fairly large.
If you want to you can work out exactly how many pmegs shared
libraries are using, by using the "trace" command and keeping track
of the mmap system calls. If you have never heard of "trace"
before, I hadn't until recently, give it ago, its fairly impressive
and useful.
mark> The point is made that shared text and data between processes
still involves non-shared pmegs (i.e., pmegs mapping the same
pages aren't shared). Is this also true for multiple processes
attaching to System V shared memory segments? Does each process
have its own set of pmegs mapping the shared memory?
I don't know anything about the system V shared memory stuff. But
based on my general knowledge of the memory management
architecture, I think this would almost be certain - each process
has its own set of pmegs. The lowest levels of the memory
management code don't appear to export anything to the higher
levels that would allow any other way of it being implemented.
mark> It is implied that more than 16 megabytes of physical memory is
typically going to be used as a disk cache (rather than for text,
data and what-not). What's the basis for this claim? That the
virtual memory pages mapped by the pmegs is only 50% utilized on
the average?
Yes. The pmegs can map up to 32M, and on average have about 50%
occupancy. This figure is only a very course approximation to the
true situation. Typically I would guess it could be anything from
10M to 24M, say.
Sun have released a patch to SunOS (see Sun Spots vol. 9, No. 274,
Msg 11). That overcomes the pmeg problems for normal
applications. It works by caching the old pmeg values and loading
the entire pmeg, instead of faulting each pmeg entry in one at a
time. If you are running normal applications this will solve your
problems completely. The CPU cost involved would be no more than
that of a context switch. If you running some large numerical or
artificial intelligence applications, you may continue to have some
problems, although they will be substantially reduced. That is
programs that use large amounts of virtual memory in a very sparse
manner can still have problems. This is because Suns currently do
not have any hardware page table walking mechanism, there is little
that can be done about this using software - if you run Lisp you
might have a look at tuning the behavior of the garbage colector to
reduce the number of faults, but if that fails you will either have
to get a bigger machine, or switch to a different architecture
(stay away from Mips too, since they don't have hardware table
walking). At our site we run fairly standard applications, and the
pmeg patch from Sun has solved our problems completely.
Gordon.
+=========================================================================
Private e-mail responding to my query about existance of 4.0.3 patch:
+=========================================================================
Date: Wed, 12 Sep 90 08:11:25 PDT
From: jblind@Eng.Sun.COM (Joanne Blind-Griffith)
To: mark@DRD.Com
Subject: Re: Sun-4 MMU (Sun-Spots v9i274m11)
Yes, there is a PMEG patch for SunOS 4.0.3c. You can obtain this patch
by calling the Answer Center (1-800-USA-4SUN), however they may ask you
to verify whether you're experiencing a PMEG thrashing problem by running
'hatstat'.
Joanne Blind-Griffith
Desktop System Software Product Manager
+=========================================================================
Private e-mail responding to my request for permission to post all this:
+=========================================================================
Date: Fri, 14 Sep 1990 10:00:58 +0930
From: Gordon Irlam <gordoni@cs.adelaide.edu.au>
To: mark@DRD.Com
Subject: Re: Sun 4 & MMUs: Reprise
> Would you mind if I posted the note you sent me to sun-spots?
Not at all. Feel free to edit what I sent as appropriate. The important
thing to get across is that the patches will solve the problems completely
for normal applications, and it is only a few very unusual applications
that may still have trouble. These unusual applications are likely to have
similar problems on other machines that do not have hardware page table
walking, such as MIPS based machines. Do you know if the RS/6000 has
hardware table walking, I imagine it does, but I am not certain.
Gordon Irlam
Adelaide University
(gordoni@cs.adelaide.edu.au)
PS. We don't have any problems now that we have installed the pmeg patch.
=======================================
Final note, the figures given above where for the data base accelerator
package, this turned into the pmeg patch, but some further tuning occured
along the way. The figures are probably not accurate for the final
pmeg patch that Sun produced.
Gordon.
This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:05:58 CDT