Summary: Sun 4 MMU & the 'PMEG Thrashing (Stealing) Problem'

From: Mark Lawrence (mark@drd.com)
Date: Fri Sep 14 1990 - 10:08:35 CDT

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

The following is a (long -- sorry) amalgamation of various articles
culled from Sun-Spots and Sun-Managers regarding what has come to
be termed 'The Pmeg Thrashing Problem' that seems endemic to
SparcStations. We experienced severe performance degradation on our
machines with high context switch, high interrupt count, lots of free
memory and not so high cpu usage. We had plenty of swap and 28 MB of
memory. Our job mix involves XDM, xterminal users, dynamically
linked X windows applications (lots of 'em) and extensive usage of
System V shared memory segments. Seems like it was custom tailored
to bring this poor machine to its knees :-}. As I plowed through
Kbytes of old Sun-Spots and Sun-Managers articles, a golden thread
emerged which seemed to describe our problem. I thought it might
be useful to collect it all together and present it in this forum.

It is presented in roughly chronological order and includes private
correspondance never before published (does that sound like an Enquirer
byline or what?). I have done some editing out of names and non-related
information [indicated by square brackets like this -- ml]. Each
note is prefaced by a header like the following. Enjoy!

+=========================================================================
The first message I hold that hints about this [extensively abridged]:
+=========================================================================
    Date: Thu, 14 Jun 90 07:45:18 EDT
    From: sid@think.com
    To: sun-managers@eecs.nwu.edu
    Subject: Synopsis of 4.03 Memory Managment Question

    Thanks to everyone that took the time to respond to my query about
    performance problems with a 4/370. No one had a solution to my
    problem. A few basically said no one has an answer to my problems. I
    will give a brief synopsis of the responses below and then
    follow that with the actual letters.

        [...]
    He also says there is a Usenix Conference article describing
    the memory system. If I can find it, I will let you guys
    know. If anyone has a copy, drop me a line.

[Conference Proceedings of the Summer, 1987 Usenix Technical Conference and
Exhibition: "Virtual Memory Architecture in SunOS", Gingell, Moran & Shannon;
"Shared Libraries in SunOS", Gingell, Lee, Dang & Weeks -- ml]

        [...]
    Mark Weiser sent a note about problems with the memory
    manager when the maps are larger than 32MB. I do not
    think this is the problem with this machine, but I will
    look at it with this in mind.

                    The Letters
    **************************************************************
        [...]
>From briggs.pa@Xerox.COM Wed Jun 13 20:28:07 1990

    Sid,
            I forwarded your question to someone who had experienced memory problems
    here at PARC, and been tracking it down with Sun. Here's his response.

         ----- Begin Forwarded Messages -----

    From: Mark Weiser <mark@arisia.Xerox.COM>
    To: briggs@Xerox.COM

    Yes, Sun's memory management is not good at all for lots of memory.
    Our main trouble is that the memory mapping hardware on sparcstations
    cannot map more than 32MB at once, and even if two processes are sharing
    pages they need different maps. The result is tht the 32MB limit
    is easily exceeded even on a 16 MB machine.

    When it is exceeded, even slightly, SunOS does not behave well at all,
    essentially thrashing in the kernel trying to keep the MMU filled.
    The symptom of this is high % system time even for your cpu bound
    processes.
    But it does not show up as paging and swapping activity, so I am not
    sure that is the same as Sid's troubles.

    To answer his specific question, no, there is no way in 4.0.3 or 4.1 to
    limit the amount of memory allocated to files. However, 4.1 is supposed
    to do this more intelligently. 4.1 does NOT manage the MMU more
    intelligently.

+=========================================================================
Further amplification by Sid which nailed the problem accurately:
+=========================================================================

    Date: Tue, 3 Jul 90 09:21:15 EDT
    From: sid@think.com
    To: sun-managers@eecs.nwu.edu
    Subject: SunOS 4.03 Memory Management Problems

    I sent a note out about a month ago regarding anomalous behaviour
    we were seeing on a Sun 4/370 with 56 Megabytes of memory. The
    system started to page heavily while there was still 20 megabytes
    (35%) of memory still free. I did not get any solutions from this
    mailing list (a first I think) but I did pick up something from
    comp.sys.sun on the Usenet that has solved the problem. Here is the
    note from the Usenet:

      From: murthy@algron.cs.cornell.edu (Chet Murthy)
      Subject: Re: SparcStation I Memory Leak? (a possible answer)
      Date: 14 Jun 90 22:10:45 GMT
      X-Refs: Original: v9n211
      X-Sun-Spots-Digest: Volume 9, Issue 210, message 12

      murthy@algron.cs.cornell.edu (Chet Murthy) writes:

>I have been running a large LISP application on a SparcStation I for a
>while now, and I have noticed some really awful problems with the
>allocation of memory as time goes on.

      Well, after some talking with a Sun OS ambassador at a new products
      session, I found out some interesting stuff.

      The phenomenon is called "pmeg stealing". I'm not sure what's going on,
      exactly, but the idea seems to be that somebody in the kernel is stealing
      memory from the pool, and not putting it back.

      So it looks like there's less and less. The fix, from someone who may
      choose to remain anonymous (otherwise, he can raise his hand - I didn't
      figure this out myself) is to turn off the swapper, leaving only the pager
      running:

      To turn off swap:

         % su
         # adb -wk /vmunix /dev/mem
         nosched?W 1
         ^D
         # reboot

      And I've gotten conflicting reports as to whether it is fixed in 4.1 or
      not. So we'll just have to wait and see...

              murthy@cs.cornell.edu

    We have applied this "fix" to two of our multiuser systems and it has
    had wonderful results. The system starts paging activity at around
    500K bytes of free memory instead of 20 Mbytes. We have not seen any
    failures yet that can be attributed to the swapper being turned off.
    Your mileage may vary....

    sid@think.com

+=========================================================================
The Definitive Article by Gordan Irlam at the University of Adelaide:
+=========================================================================
    Date: 9 Jul 90 00:09:14 GMT
    From: gordoni@chook.ua.oz.au (Gordon Irlam)
    Subject: Sun-4 MMU Performance
    Sun-Spots Digest: v9n257

               A Guide to Sun-4 Virtual Memory Performance
               ===========================================

                   Gordon Irlam, Adelaide University.
             (gordoni@cs.ua.oz.au or gordoni@chook.ua.oz.au)

    Throughput on a Sparcstation drops substantially once the amount of active
    virtual memory exceeds 16M, and by the time it reaches 25M the machine can
    be running up to 10 times slower than normal. This is the conclusion I
    reach from running a simplistic test program on an otherwise idle
    Sparcstation.

    Note that the limit involves the amount of ACTIVE virtual memory used.
    Additional virtual memory may be consumed by processes that remain idle
    without incurring any penalty. (SunOS usually steals pages from idle
    processes, so strictly speaking such memory is not normally considered to
    be part of the virtual memory consumed.) Also note that it is 16M of
    active VIRTUAL memory. Dynamically linked libraries, shared text
    segments, and copy on write forking means that the amount of PHYSICAL
    memory used could conceivably be as little as half this value. I would
    guess that any physical memory that is added to a typical Sparcstation
    beyond around 14M will effectively only be used as a disk cache.

    This problem exists on all Sun-4 systems. The problem is a result of
    poorly designed MMU hardware, and the failure of the operating system to
    attempt to minimize the effects of the design. Sun-4's have a fixed
    amount of memory that can be used for storing page tables, on
    Sparcstations in particular this memory area is far to small.

    This posting quantifies to some extent the performance losses resulting
    from the Sun-4 memory management subsystem, describes the cause of the
    problem, and suggests work-arounds that may be useful in overcoming some
    of the worst effects of the problem. This posting is based in part on a
    previous posting on the subject and the helpful responses received, many
    thanks.

             1. Sparcstation Virtual Memory Performance
             -------------------------------------------

    The following table shows the throughput of a Sparcstation-1 as a function
    of active virtual memory. The program used to obtain these figures is
    included at the end of this posting. The program forks several times and
    each child spends its life sequentially accessing pages of a shared 2M
    data segment over and over again. Forking and the use of a shared data
    segment allows the test program to be run on a machine with very little
    physical memory but otherwise does not significantly effect the results
    obtained. The first two columns show a sudden performance drop beyond
    16M. The remaining columns contain raw data that can be used to
    understand what is happening.

     virtual relative elapsed user system translation swap
      memory speed time time time faults ins
        (Mb) (sec) (sec) (sec)
           2 1.00 3.5 2.7 0.8 1224 1
           4 1.09 6.4 5.3 1.1 1840 1
           6 1.14 9.2 8.1 1.2 2442 0
           8 1.15 12.2 10.7 1.4 2729 0
          10 1.17 15.0 13.3 1.7 3381 0
          12 1.17 18.0 16.1 1.9 4121 0
          14 1.12 21.8 19.6 2.1 5275 0
          16 1.08 25.9 22.6 3.1 8746 2
          18 0.57 55.3 29.1 25.9 98251 6
          20 0.40 87.7 34.4 53.0 200296 7
          22 0.25 151.3 41.8 109.0 406885 12
          24 0.11 388.3 61.9 325.3 1202899 20
          26 0.12 371.9 62.6 304.5 1118388 22
          28 0.06 764.8 91.8 655.4 2412144 39
          30 0.03 1607.1 156.3 1446.2 5316313 56
          32 0.02 2601.0 221.5 2373.1 8665839 88

    Note that the test program is designed to illustrate the nature of the
    virtual memory problem in a simple fashion, not to provide realistic
    estimates of expected system performance. Realistic performance estimates
    can be much better made after having taken into account the issues raised
    in sections 3 and 4 below. In particular, system performance will
    probably not degrade as rapidly as shown in the above table.

>From this table it can be clearly seen that once the amount of active
    virtual memory exceeds 16M the system suddenly finds itself having to
    handle an incredibly large number of page faults. This causes a drastic
    increase in the amount of system time consumed, which results in a
    devastating drop in throughput.

    I emphasize here that the machine does not run out of physical memory at
    16M. It has plenty of free memory during all of the tests - the free list
    is several megabytes in size, and the machine does not page fault to disk.

                      2. A Few Minor Details
                      -----------------------

    This section can be skipped.

    The first few figures show a relative speed slightly greater than 1.00.
    This is because the cost of invoking the initial image is amortized over a
    greater number of processes.

    When repeating the tests those that had a very low throughput produced
    figures that varied by around 30%. The slightest perturbation of the
    machine when it is very heavily loaded is found to significantly alter the
    elapsed time. When a test has been run several times the figures
    presented above are those with the smallest elapsed time.

    The amount of user time consumed grows at a faster rate beyond 16M of
    active virtual memory than below 16M. This may be a result of
    inaccuracies in the process accounting subsystem. Alternatively it could
    be some sort of user cost resulting from context invalidations.

    The swapping figures are not significant. They are the result of a
    strange feature of SunOS. Once all the page tables for a process's data
    segment have been stolen the process is conceptually swapped. This
    involves moving text pages that are not currently shared onto the free
    list. In this case no such pages exist. But even if they did no disk
    activity would occur because the free list has plenty of space. On a real
    system this quirk typically adds significantly to the performance
    degradation that occurs once the virtual memory limit has been exceeded.

    The possibility that the sudden increase in system time beyond 16M is a
    result of context switching can be discounted by running similar tests in
    which each process uses 4M instead of 2M. A sudden performance drop will
    be observed at around 20M. This figure is slightly higher than 16M
    because fewer page tables are wasted mapping less than the maximum
    possible amount of memory.

    The above figures were obtained under SunOS 4.0.3, however subsequent
    measurements have shown that essentially identical results are obtained
    under SunOS 4.1.

                 3. Implications for a Real System
                 ----------------------------------

    The amount of active virtual memory at which a sudden drop in throughput
    occurs, and the severity of the drop should not be viewed as precise
    parameters of the system. In a real system the observed performance will
    be heavily dependent on process sizes, memory access patterns, and context
    switching patterns. For instance, the elapsed time given above for 32M of
    active virtual memory would have been five times larger if every data
    access had resulted in a page fault. Alternatively, on a real system
    locality of address references could have had the opposite effect and
    reduced the elapsed time by a factor of 5. The context switching rate has
    a significant effect on the performance obtained when the system is short
    of pmegs since it determines how long a process will be given to run
    before having its pmegs stolen from it. If the context switching rate is
    too high processes will get very little useful work done since they will
    be spending all their time faulting on the pages of their resident sets,
    and never getting a chance to execute when all the pages of their resident
    sets are resident.

    Because the performance losses are a function of the amount of virtual
    memory used, dynamically linked libraries, shared code pages, and copy on
    write forking means that it is possible for these problems to occur on a
    machine with substantially less physical memory than the 16M of virtual
    memory at which the problem started to occur.

    On the other hand locality of reference will reduce the severity of the
    problem. Large scientific applications that don't display much data
    reference locality will be an exception.

    The impression I have is that virtual memory performance will not normally
    be a serious problem on a Sparcstation with less than 16M of physical
    memory, with between 16M and 32M it could be a problem depending upon the
    job mix, and it will almost certainly be a serious problem on any
    Sparcstation with 32M or more. If it isn't a problem on a machine with
    32M or more you have almost certainly wasted your money buying the extra
    memory as you do not appear to be using it.

    [It's a sorry tale to go out and buy lots of memory to stop a system
    thrashing, install it, turn the machine on, and find the system still
    thrashes, but thanks to the large disk cache you have just installed it is
    now able to do so at previously unheard of rates.]

    A giveaway indication that the virtual memory system is a problem on a
    running system is the presence of swapping, as shown by "vmstat -S 5", but
    with a free list of perhaps a megabyte or more in size. This swapping
    does not involve real disk traffic. Pages are simply being moved back and
    forth onto the free list. Note that if you are only running one or two
    large processes this swapping behavior will probably not be observed.
    Regardless of whether you see this behavior or not vmstat should also be
    showing the system spending most of its time in system mode.

    The ratio of user time to system time obtained using vmstat should give
    you are rough estimate of the cost associated with the virtual memory
    management problems. You can get a more accurate estimate by looking at
    the number of translation faults (8.7 million in the previous table), and
    the time taken to handle them (2400 seconds). Then compute the time taken
    to handle a single fault (280us). Now look at a hatcnt data structure in
    the kernel using adb.

    # adb -k /vmunix /dev/mem
    physmem 17f4
    hatcnt/8D
    _hatcnt:
    _hatcnt: 2129059 2034884 19942909 3173659
                    2685512 0 0 0
    $q
    #

    The 4th word is the total number of pmeg allocations (see below) since the
    system has been booted (3173659), while the 5th word is the number of pmeg
    allocations that stole a pmeg from another process (2685512). Estimating,
    say 32 faults per stolen pmeg allocation you can work out the total time
    the system has spent handling these faults (7 hours). This time can then
    be compared to the total amount of time the system has been up (48 hours).
    On a non-Sparcstation Sun-4 you should estimate around 16 faults per
    stolen pmeg allocation, rather than 32.

              4. The Sun-4 Memory Management Architecture
              -------------------------------------------

    The 4/460 has a three level address translation scheme all other Sun-4
    machines have a two level scheme. Sparcstations have 4k pages, all other
    machines have 8k pages. The level 2 page tables (level 3 tables on the
    4/460) are referred to by Sun as page management entry groups, or simply
    pmegs. Each pmeg on a Sparcstation contains 64 entries, and since the
    pages are 4k in size this means that a single pmeg can map up to 256k of
    virtual memory. On all other Sun-4 machines the pmegs contain 32 entries,
    but the page size is 8k, so that once again a single pmeg can map up to
    256k.

    Most systems use high speed static RAM to cache individual page table
    entries and hence speed up address translations. This is not done on
    Sun-4's. Instead all page tables (pmegs) are permanently stored in high
    speed static RAM. This results in address translation hardware that is
    both simple and reasonably fast. The downside however is that the number
    of pmegs that can be stored is limited by the amount of static RAM
    available. On the Sparcstations the static RAM can store up to 128 pmegs,
    giving a total mapping of up to 32M. A 4/1xx, or 4/3xx can map up to 64M,
    a 4/2xx can map up to 128M, and a 4/4xx can map up to 256M of virtual
    memory.

    32M is the maximum amount of virtual memory that can be mapped on a
    Sparcstation, however since a pmeg can only be used to map pages within a
    single contiguous 256k aligned range of virtual address, the amount of
    virtual memory mapped when a machine runs out of pmegs will be
    substantially less. This is particularly evident when it is realized that
    separate pmegs will be assigned to map the text, data, and stack sections
    of each process, and some of these will probably be much smaller the 256k.

    Currently under SunOS pmegs are never shared between processes even if
    they may map identical virtual addresses to identical physical addresses,
    as could be the case with a common text segment. Dynamically linked
    libraries are also probably bad in this respect as they will require
    several pmegs per process, whereas if the process was statically linked
    the number of pmegs consumed would be reduced because pmegs would only be
    consumed mapping in the routines that are actually used.

    When a process needs to access a page that is not referenced by any of the
    pmegs that are currently being stored, and no free pmegs exist, it steals
    a pmeg belonging to another process. When the other process next goes to
    access a page contained in this pmeg it will get a translation fault and
    also have to steal a pmeg from some other process. Having got the pmeg
    back, however, all the page table entries associated with that pmeg will
    have been marked invalid, and thus the process will receive additional
    address translation faults when it goes to access each of the 64 pages
    that are associated with the pmeg (32 pages on a machine other than a
    Sparcstation).

    The problem is compounded by SunOS swapping out processes whose resident
    set size is zero. If all the pmegs belonging to a process get stolen from
    it the kernel determines that the process's resident set size is zero, and
    promptly swaps the process out. Fortunately this swapping only involves
    moving all of the processes pages onto the free list, and not to disk.
    But the CPU load associated with doing this appears to be substantial, and
    their is no obvious justification for doing it.

                   5. Working Around the Problem
                   ------------------------------

    Although the problems with the Sun-4 MMU hardware architecture probably
    can't be completely overcome by modifying SunOS a number of actions can
    probably be taken to diminish it's effect.

    Applications that have a large virtual address space and whose working set
    is spread out in a sparse manner are problematic for the Sun-4 memory
    management architecture, and the only alternatives may be to upgrade to a
    more expensive model, or switch to a different make of computer. Large
    numerical applications, certain Lisp programs, and large database
    applications are the most likely candidates.

    A reasonable solution to the problem in many cases would be for SunOS to
    keep a copy of all pmegs in software. Alternatively a cache of the active
    pmegs could be kept. In either case when a page belonging to a stolen
    group is next accessed, the entire pmeg can be loaded instead of each page
    in the pmeg causing a fault and being individually loaded. Doing this
    would probably involve between 50 and 100 lines of source code.

    It may also be desirable for pmegs to be shared between processes when
    both pmegs map the same virtual addresses to the same physical addresses.
    This would be useful for dynamically linked libraries, and shared text
    sections. Unfortunately this is probably difficult to do given SunOS's
    current virtual memory software architecture.

    Until a solution similar to the one proposed above is available a number
    of other options can be used as a stop gap measure. None of these
    solutions is entirely satisfactory, and depending on your job mix in
    certain circumstances they could conceivably make the situation worse.

    Preventing "eager swapping" will significantly improve performance in many
    cases. Despite this swapping not necessarily involving any disk traffic,
    we noticed a significant improvement on our machines when we did this; the
    response time during times of peak load probably improved by between a
    factor of 5 and 10.

    The simplest way to prevent eager swapping is to prevent all swapping.
    This is quite acceptable provided you have sufficient physical memory to
    keep the working sets of all active processes resident.

    # adb -w /vmunix
    nosched?W 1
    $q
    # reboot

    Or use "adb -w /sys/sun4c/OBJ/vm_sched.o", to fix any future kernels you
    may build.

    A better solution is to only prevent eager swapping, although doing this
    is slightly more complex. The following example shows how to do this for
    a Sparcstation under SunOS 4.0.3. The offset will probably differ
    slightly on a machine other than a Sparcstation, or under SunOS 4.1,
    although hopefully not by too much.

    # adb -w /sys/sun4c/OBJ/vm_sched.o
    sched+1f8?5i
    _qs+0xf8: orcc %g0, %o0, %g0
                    bne,a _qs + 0x1b8
                    ld [%i5 + 0x8], %i5
                    ldsb [%i5 + 0x1e], %o1
                    sethi %hi(0x0), %o3
    sched+1f8?W 80a02001
    _qs+0xf8: 0x80900008 = 0x80a02001
    sched+1f8?i
    _qs+0xf8: cmp %g0, 0x1
    $q
    # "rebuild kernel and reboot"

    Another important thing to do is try to minimize the number of context
    switches that are occurring. How to do this will depend heavily on
    the applications you are running. Make sure you consider the effect
    of trivial applications such as clocks and load meters. These can
    significantly increase the context switching rate, and consume
    valuable pmegs. As a rough guide when a machine is short of pmegs, up
    to 40 context switches per second will probably be acceptable on a
    Sparcstation, while a larger machine should be able to cope with maybe
    100 or 200 context switches per second. These figures will depend on
    the number of pmegs consumed by the average program, the number of
    pmegs the machine has, and whether context switching is occurring
    between the same few programs, or amongst different programs. The
    above values are based on observations of machines that are mainly
    used to run a large number of reasonably small applications. They are
    probably not valid for a site that runs a few large applications.

    Finally if you are using a Sparcstation to support many copies of a small
    number of programs that are dynamically linked to large libraries, you
    might want to try building them to use static libraries. For instance
    this would be the case if you are running 10 or more xterms, clocks, or
    window managers on the one machine. The benefit here is that page
    management groups won't be wasted mapping routines into processes that
    aren't ever called. The libraries have to be reasonably large for you to
    gain any benefit from doing this since only 1 page mangement group per
    process will be saved for every 256k of library code. And you need to be
    running multiple copies of the program so that the physical memory cost
    incurred by not sharing the library code used by other applications is
    amortized amongst the multiple instances of this application.

                        6. A Small Test Program
                        ------------------------

    The test program below can be run without any problems on any Sun-4 system
    with 8M of physical memory or more. Indeed it will probably work on a
    Sun-4 system with as little as 4M. The test program is intended to be
    used to illustrate in a simple fashion the nature of the problem with the
    Sun-4 virtual memory subsystem. It is not intended to be used to measure
    the performance of a Sun-4 under typical conditions. It should however
    allow you to get a rough feel for the amount of active virtual memory that
    you will typically be able to use.

    When running the test program make sure no-one else is using the system.
    And then run "vmtest 64", and "vmstat -S 5" concurrently for a minute or
    two to make sure that no paging to disk is occurring - "fre" should be
    greater than 256, and po and sr should both be 0. Note that swapping to
    the free list may be occurring due to a previously mentioned quirk of
    SunOS. Once the system has settled down kill vmtest and vmstat.

    To run the test program use the command "time vm_stat n", where n is the
    amount of virtual memory to use in megabytes. More detailed information
    can be determined using a command similar to the following and comparing
    the output of vmstat before and after each test.

    $ for mb in 2 4 8 16 32 64
    do
      (echo "=== $mb megabytes ===";
      vmstat -s;
      time vmtest $mb;
      vmstat -s) >> vm_stat.results
    done
    $

    You may want to alter the process size or context switching rate to see
    what sort of effects these have on the results. Larger processes mean
    that fewer pmegs are wasted mapping less than a full address range. Hence
    the amount of active virtual memory that can be used before problems start
    to show up will increase. A faster context switching rate will reduce the
    amount of time a process gets to execute before being descheduled. If
    there is a pmeg shortage by the time the process is next scheduled many or
    all of its pmegs will be gone. Adjusting the context switching rate to a
    typical value seen on your system may be informative.

    /* vmtest.c, test Sun-4 virtual memory performance.
     *
     * Gordon Irlam (gordoni@cs.ua.oz.au), June 1990.
     *
     * Compile: cc -O vmtest.c -o vmtest
     * Run: time vmtest n
     * (Test performance for n Megabytes of active virtual memory,
     * n should be even.)
     */

    #include <stdio.h>
    #include <sys/wait.h>
    #define STEP_SIZE 4096 /* Will step on every page. */
    #define MEGABYTE 1048576

    #define LOOP_COUNT 5000
    #define PROCESS_SIZE (2 * MEGABYTE)
    #define CONTEXT_SWITCH_RATE 50

    char blank[PROCESS_SIZE]; /* Shared data. */

    main(argc, argv)
    int argc;
    char *argv[];
    {
      int size, proc_count, pid, proc, count, i;

      if (argc != 2 || sscanf(argv[1], "%d", &size) != 1 || size > 500) {
          fprintf(stderr, "Usage: %s size\n", argv[0]);
          exit(1);
      }

      /* Touch zero fill pages so that they will be shared upon forking. */
      for (i = 0; i < PROCESS_SIZE; i += STEP_SIZE)
          blank[i] = 0;

      /* Fork several times creating processes that will use the memory.
       * Children will go into a loop accessing each of their pages in turn.
       */
      proc_count = size * MEGABYTE / PROCESS_SIZE;
      for (proc = 0; proc < proc_count; proc++) {
          pid = fork();
          if (pid == -1) fprintf(stderr, "Fork failed.\n");
          if (pid == 0) {
              for (count = 0; count < LOOP_COUNT; count++)
                 for (i = 0; i < PROCESS_SIZE; i += STEP_SIZE)
                     if (blank[i] != 0) fprintf(stderr, "Optimizer food.\n");
              exit(0);
          }
      }

      /* Loop waiting for children to exit. Don't block, instead sleep for
       * short periods of time so as to create a realistic context switch rate.
       */
      proc = proc_count;
      while (proc > 0) {
        usleep(2 * (1000000 / CONTEXT_SWITCH_RATE));
        if (wait3(0, WNOHANG, 0) != 0) proc--;
      }
    }

+=========================================================================
The response from Sun:
+=========================================================================
    Date: 25 Jul 90 03:15:49 GMT
    From: jblind@griffith.eng.sun.com (Joanne Blind-Griffith)
    Sun-Spots Digest: v9n274
    Subject: Re: Sun-4 MMU Performance
    X-Art: Usenet #11

                        Sun Microsystem's Response to
                 A Guide to Sun-4 Virtual Memory Performance

             Joanne Blind-Griffith, Product Manager, Sun Microsystems.

    The recent Sun-Spots posting by Gordon Irlam is essentially accurate in
    describing the hardware limitations of the Sun MMU. As he points out,
    whether this limitation is encountered on any particular machine depends
    on which Sun hardware is involved and what sort of applications are being
    used. It is our experience that this limitation is rarely encountered
    with applications which show typical locality of reference. Most common
    applications and job mixes will never encounter this limit. However, some
    very large applications, and some applications which share memory between
    many processes, will encounter this limit.

    The Sun MMU design results in a very fast MMU with a minimum of hardware.
    The Sun MMU is best thought of as a cache for virtual-to-physical
    mappings. As with all caches, the cache was designed to be large enough
    for the sort of typical applications to be run on the machine. Nearly all
    applications achieve a very high hit rate on this cache. However, like
    any cache, there are applications that will exceed the capacity of the
    cache, greatly lowering the hit rate. Since this cache (i.e., the Sun
    MMU) is loaded by software, the cost of a cache miss can be quite
    expensive.

    We have improved the algorithms that manage the Sun MMU. The improvment
    involves adding another level of caching between the MMU management
    software and the upper levels of the kernel. This is a classic space/
    time tradeoff where a little bit of space for this software cache saves a
    lot of time in reloading the MMU for those applications which exceed the
    hardware limits of the MMU. In addition, many other changes have been
    made to the MMU management software to improve performance in general and
    to reduce the effects of some worst case behaviour.

    Following are the test results using Gordon's vmtest program run on a 12MB
    SPARCstation 1+ with the improved MMU management software:

            virtual elapsed user system
            memory time time time
            (MB) (sec) (sec) (sec)
             2 2 2.3 0.6
             4 5 4.7 0.8
             8 10 9.4 1.1
            10 13 11.9 1.2
            12 16 14.3 1.4
            14 18 16.8 1.5
            16 21 19.5 1.7
            18 25 22.5 1.9
            20 27 25.3 2.0
            22 30 27.6 2.2
            24 33 30.4 2.5
            26 36 33.3 2.5
            28 39 35.7 2.7
            30 41 38.1 2.9
            32 44 40.8 3.1

    Note that the performance is essentially linear through 32MB.

    This improved MMU management software will be included in the next release
    of SunOS. It will be available as a patch for SunOS 4.1 (Sun4c and Sun4
    platforms) and 4.1 PSR A end of July and for SunOS 4.0.3c (Sun4c machines)
    early August.

+=========================================================================
Heretofore un-published article from Gordan Irlam circa the previous note:
+=========================================================================

    In a previous article I discussed how the performance of a Sun4, or
    Sparcstation drops substantially once the amount of active virtual
    memory exceeds some machine dependent limit. The problem was caused
    by a large number of page faults resulting from a shortage of hardware
    page tables, or pmegs.

    Sun have developed a fix for this problem that involves maintaining a
    software cache of the pmegs so that the hardware page tables can be
    rapidly reloaded. I believe that this fix will shortly be available
    as a patch to SunOS 4.1 for both Sun4 and Sun4c machines.

    The patch improves performance to an extent which I would not have
    thought possible. The results of running my test program on a
    Sparcstation 1 are presented below.

                           old new old new
            virtual relative relative elapsed elapsed
             memory speed speed time time
               (Mb) (sec) (sec)
                  2 1.00 1.00 3.5 3.6
                  4 1.09 1.06 6.4 6.8
                  6 1.14 1.14 9.2 9.5
                  8 1.15 1.14 12.2 12.6
                 10 1.17 1.15 15.0 15.6
                 12 1.17 1.16 18.0 18.7
                 14 1.12 1.12 21.8 22.5
                 16 1.08 1.07 25.9 27.0
                 18 0.57 1.08 55.3 30.1
                 20 0.40 1.09 87.7 33.1
                 22 0.25 1.09 151.3 36.2
                 24 0.11 1.08 388.3 40.0
                 26 0.12 1.08 371.9 43.4
                 28 0.06 1.09 764.8 46.3
                 30 0.03 1.09 1607.1 49.7
                 32 0.02 1.11 2601.0 52.1

                 64 very slow 1.09 very big 105.4

                 96 time stops 1.08 infinity 159.8

    (Note the old elapsed times were under SunOS 4.0.3, the new ones are
    under SunOS 4.1, thus the absolute times are not directly comparable.
    Furthermore for the new tests the system was not in single user mode,
    so a number of daemons etc. will also have been running.)

    For 32M of virtual memory the results show a 50 fold improvement in
    performance.

    The fix has solved the problems we were experiencing completely, and I
    imagine it will do likewise for most other sites with similar
    problems. It is however plausible that sites that run unusual
    programs that consume very large amounts of virtual memory, or use
    virtual memory in a very sparse fashion may continue to experience
    some performance degradation. But hopefully nothing like what they
    are currently experiencing.

                              Gordon Irlam, Adelaide University.

+=========================================================================
Heretofore unpublished private e-mail Gordon sent to Sun;
  explaining why he chose not to post previous article:
+=========================================================================
    Date: Tue, 21 Aug 1990 20:00:42 +0930
    From: Gordon Irlam <cs.adelaide.edu.au!gordoni>
    Subject: Re: DBE performance report
        To: [somebody at Sun whom I choose to leave unamed -- ml]

        [...]
    I am currently not planning to post the message I sent you.
        1) Most of what I say has been covered by an article that Sun posted.
        2) I have found that some problems still exist as far as virtual
           memory performance is concerned. These are not shown up by my
           test program. These problems are an order of magnitude less
           (maybe even two orders less) than the problems we were having
           and only relate to processes with very large sparse address spaces.
           They are not a problem for normal processes, irrespective of
           how many processes are run. This is not I problem for our site
           it would only be relevant to sites running fairly special applications
           - Lisp, and some large array handling programs would be the
           most likely candidates. I was hoping to do some measurements on
           this and include this as part of my posting - specifying both
           the nature of such processes that would have problems and the
           extent of the problems. I now don't think I will get around to
           doing this. These problems are a result of the Sun4 MMU architecture
           and cannot be fixed in software.

    If you feel my posting of something that corroborates what Sun have said
    would help you please let me know. It would however include something
    mentioning that some (much less severe) problems may still exist for some
    very special applications.

                                                      Gordon.

+=========================================================================
Yet more heretofore unpublished private e-mail Gordon sent to Sun;
  more technical discussion on the aspects of pmeg stealing:
+=========================================================================
    Return-Path: <gordoni@cs.adelaide.edu.au>
    Date: Fri, 24 Aug 1990 20:28:48 +0930
    From: Gordon Irlam <gordoni@cs.adelaide.edu.au>
        To: [somebody at Sun whom I choose to leave unamed -- ml]
    Subject: Re: PMEG stealing fix

> The new implementation of the SUN sun4/sun4c hat layer (AKA pmeg
> patch) DOES IMPROVE the performance of a single application with a
> sparse address space.

    I agree totally. I am very impressed by the performance of the new
    hat layer. What I was alluding to in the mail I sent to Allan was the
    lack of any hardware table walking mechanism. So that if the memory
    access pattern is very sparse performance will still be poor. I think
    loading a pmeg takes roughly 200us, so you certainly don't want to be
    accessing new pmegs that are not stored in the static RAM more frequently
    than that.

    Whether machines should have table walking hardware is an
    interesting question. MIPS justify its ommision on the grounds
    that doing it in software constitutes on average something like a
    3% performance penalty. However for some applications especially
    large LISP applications it can be perhaps 80%. I guess this is
    part of the trend away from true general purpose computers.

    Back to Suns. Gut feeling, no quantative justification. The
    current number of pmegs on a Sparcstation is too small, and by the
    time you make it bigger the cost is probably similar to what it
    would have costed to provide hardware table-walking instead. I
    also think the idea of pme's existing in contiguous groups is a bad
    idea, cacheing pme's would be much more flexible.

    Partial justification. In view of the current OO-hype I think it
    is reasonable to anticipate large object oriented systems that
    consist of large numbers of objects interacting with one another.
    I suspect such systems are likely to have considerably less
    locality, and be considerably larger than most of the software run
    today. I imagine that Sun4's in general and Sparcstations in
    particular will be a poor platform for such systems. I think the
    problems will be particularly severe if persistent object
    orientated systems become at all popular.

+=========================================================================
Private e-mail responding to attributed queries from myself:
+=========================================================================
    Date: Wed, 12 Sep 1990 14:46:52 +0930
    From: Gordon Irlam <gordoni@cs.adelaide.edu.au>
    To: mark@DRD.Com
    Subject: Re: Sun 4 & MMUs: Reprise

  mark> For programs dynamically linked (involving shared libraries), are
    pmegs allocated for all potential members of the shared library?
    Is this the rationale for savings by linking statically

    A pmeg is needed for each 256k of contiguous address space.
    Statically linked executables are unlikely to require more than 1
    pmeg to address all of the library routines they use since the
    total size of the routines they actually use probably will be only
    100 or 200k. With shared libraries virtual space is allocated for
    the entire library, including the many routines that are not used.
    Thus although you might only use a few routines from the library,
    they will be scattered over a larger address range, which means
    that more pmegs are required to map them. The X libraries caused
    us the most problems because they are fairly large.

    If you want to you can work out exactly how many pmegs shared
    libraries are using, by using the "trace" command and keeping track
    of the mmap system calls. If you have never heard of "trace"
    before, I hadn't until recently, give it ago, its fairly impressive
    and useful.

  mark> The point is made that shared text and data between processes
    still involves non-shared pmegs (i.e., pmegs mapping the same
    pages aren't shared). Is this also true for multiple processes
    attaching to System V shared memory segments? Does each process
    have its own set of pmegs mapping the shared memory?

    I don't know anything about the system V shared memory stuff. But
    based on my general knowledge of the memory management
    architecture, I think this would almost be certain - each process
    has its own set of pmegs. The lowest levels of the memory
    management code don't appear to export anything to the higher
    levels that would allow any other way of it being implemented.

  mark> It is implied that more than 16 megabytes of physical memory is
    typically going to be used as a disk cache (rather than for text,
    data and what-not). What's the basis for this claim? That the
    virtual memory pages mapped by the pmegs is only 50% utilized on
    the average?

    Yes. The pmegs can map up to 32M, and on average have about 50%
    occupancy. This figure is only a very course approximation to the
    true situation. Typically I would guess it could be anything from
    10M to 24M, say.

    Sun have released a patch to SunOS (see Sun Spots vol. 9, No. 274,
    Msg 11). That overcomes the pmeg problems for normal
    applications. It works by caching the old pmeg values and loading
    the entire pmeg, instead of faulting each pmeg entry in one at a
    time. If you are running normal applications this will solve your
    problems completely. The CPU cost involved would be no more than
    that of a context switch. If you running some large numerical or
    artificial intelligence applications, you may continue to have some
    problems, although they will be substantially reduced. That is
    programs that use large amounts of virtual memory in a very sparse
    manner can still have problems. This is because Suns currently do
    not have any hardware page table walking mechanism, there is little
    that can be done about this using software - if you run Lisp you
    might have a look at tuning the behavior of the garbage colector to
    reduce the number of faults, but if that fails you will either have
    to get a bigger machine, or switch to a different architecture
    (stay away from Mips too, since they don't have hardware table
    walking). At our site we run fairly standard applications, and the
    pmeg patch from Sun has solved our problems completely.

                                                Gordon.

+=========================================================================
Private e-mail responding to my query about existance of 4.0.3 patch:
+=========================================================================
    Date: Wed, 12 Sep 90 08:11:25 PDT
    From: jblind@Eng.Sun.COM (Joanne Blind-Griffith)
    To: mark@DRD.Com
    Subject: Re: Sun-4 MMU (Sun-Spots v9i274m11)

    Yes, there is a PMEG patch for SunOS 4.0.3c. You can obtain this patch
    by calling the Answer Center (1-800-USA-4SUN), however they may ask you
    to verify whether you're experiencing a PMEG thrashing problem by running
    'hatstat'.

    Joanne Blind-Griffith
    Desktop System Software Product Manager

+=========================================================================
Private e-mail responding to my request for permission to post all this:
+=========================================================================
    Date: Fri, 14 Sep 1990 10:00:58 +0930
    From: Gordon Irlam <gordoni@cs.adelaide.edu.au>
    To: mark@DRD.Com
    Subject: Re: Sun 4 & MMUs: Reprise

> Would you mind if I posted the note you sent me to sun-spots?

    Not at all. Feel free to edit what I sent as appropriate. The important
    thing to get across is that the patches will solve the problems completely
    for normal applications, and it is only a few very unusual applications
    that may still have trouble. These unusual applications are likely to have
    similar problems on other machines that do not have hardware page table
    walking, such as MIPS based machines. Do you know if the RS/6000 has
    hardware table walking, I imagine it does, but I am not certain.

                                                    Gordon Irlam
                                                    Adelaide University
                                                    (gordoni@cs.adelaide.edu.au)

    PS. We don't have any problems now that we have installed the pmeg patch.
    =======================================
    Final note, the figures given above where for the data base accelerator
    package, this turned into the pmeg patch, but some further tuning occured
    along the way. The figures are probably not accurate for the final
    pmeg patch that Sun produced.
                                                   Gordon.

Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]

This archive was generated by hypermail 2.1.2 : Fri Sep 28 2001 - 23:05:58 CDT