top

top - Simple Concrete Examples

Some of you might have read Computer Science at university. I read Physics (and then Maths), which I like to think taught me how to experiment (and in the case of number theory, how to thrash around with no sense of direction). So while Dijkstra assures us that calling it computer science

is like referring to surgery as 'knife science'
we can nonetheless be a bit experimentalist about this. Practical experience is gold, and starting simple is by no means a bad thing.

According to my words above, a program that's not waiting around for anything will exhibit in top with a high user/system value, and a low idle/waiting value. Time to put my money where my mouth is; let's create programs that deliberately provoke these situations, just to be sure.

User/system is high - idle/waiting are low

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
// simpleHeavyBurn.cpp

int main()
{
  unsigned  int x = 1;
  while(true)
    {
      x++;
    }
}

Couldn't be simpler. This code has no inputs, no outputs. It just increments a value forever. Bonus question; what happens if one attempts to increment a signed int beyond its maximum value? It's undefined behaviour (see para section five, paragraph four of the C++11 standard, ISO/IEC 14882 Third Edition 2011-09-01), but doing the same with an unsigned int is fine.

I expect this to occupy one core completely (i.e. take up 100% CPU time on one core). Let's build it, run it and then watch it in top. Build with a simple command line call:

g++ simpleHeavyBurn.cpp -o simpleHeavyBurn

and then watch it run.

top running, showing a single core maxing

As expected, we can see the executable thrashing a single core. No waiting for IO, no sleeping; the process simpleHeavyBurn simply churns as fast as it can. In this display format of top, there is an inconsistency; we see that the process of interest is apparently consuming 100% CPU (or near enough), but the %Cpu(s) value on the third line doesn't reach 30. If the process is using all the CPU, how come the summary line indicates that we're using just over a quarter of CPU time in userspace (us), and the rest is being spent idle (id)?

The answer is that the summary line is showing a value over all cores, while the %CPU column is giving a value normalised such that 100 represents one core's worth. This becomes clearer if we switch top to show each core separately. This is done by pressing the "1" key in top. Here's the result:

top running, showing a single core maxing

Note that there is now a separate line for each core, showing the time each core individually is spending in userspace, systemspace, idle and waiting. It's clear that in this example, the process of interest is completely taking up %Cpu3, which is spending no time doing anything else.

The immediate performance conclusions from this are clear; there's no waiting. A core is churning as heavily as it can. We've left three cores sitting idle. Confronted with a top like this, the obvious next steps would be to look at improving the efficiency of the code (so that we get more done in the same number of CPU cycles), and looking into making use of more than one core (i.e. multi-threading, or multi-processing).

So what happens if we run up two of these processes at the same time? That is, we start the executable simpleHeavyBurn, and then while it's running, start another one? The simple interpretation is that we'd see two of them running in top, and if our operating system knows what it's doing, they will have a core each. Let's experiment. This time, I'm using the default top refresh period of two seconds.

top running, showing two cores maxing
As expected, we can see both simpleHeavyBurn processes, both running as fast as they can, on one core each.

User/system is low - waiting is high

This is something we would expect to see if data can be processed faster than it can be pulled off a disk; the CPU will do some work to process data, and then sit waiting for the next set of data to arrive. If we had jumped straight into profiling our code, we could easily not even notice that the CPU is mostly sitting idle, and we could easily be fine-tuning some section of code that makes no difference to the overall performance.

I was working on an application that had to fetch, process and display a whole lot of data that was kept in a database. The database was communicated with via a separate executable; all I could do was ask for the data, and then wait. It took an age. Fingers were being pointed, but running up top made it clear what was happening. The process in charge of fetching from the database was spending a few percent of CPU time working in user-space, and almost the entire rest of the CPU time in idle-waiting. This indicated that the CPU was ready and willing to do a lot more, a lot faster, but it was having to wait for IO to complete. Duly pointed at the source of the problem, the code that fetched the information from the database file was targetted and pretty soon there was no waiting. Throughput increased by a factor of almost ten, slicing ninety percent off the time taken. All it required was a bit of experimentation and thought to ensure that data was being provided as fast as the CPU could use it.

We could have spent an age tuning the code doing the processing, and even if we'd doubled the speed of that part, there would have been zero difference in the overall performance - the CPU would just spend more time waiting around for IO.

Lesson: Identify the source of the delay and fix that.

I know it sounds obvious, but you'd be surprised.

Let's write some code to deliberately provoke this situation. I'm simply going to use C++ to copy a file, which will encompass reading data in and then writing it out. I've generated a 10GB file of random numbers to do this.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#include <iostream>
#include <fstream>

int main()
{
  std::ifstream source("ranNumbers", std::ios::binary);
  std::ofstream dest("copyOfranNumbers", std::ios::binary);

  dest << source.rdbuf();
}

This being an experimental art, I don't mind admitting it took me a few attempts to get something that would unambiguously demonstrate waiting for IO.

g++ readWriteFile.cpp -o readWriteFile

and then watch it run.

readWriteFile process showing lots of waiting for IO

There is a lot of waiting going on here, but why's it spread over so many cores? We can see at any one time that the process readWriteFile is using only a few percent of one core. The period here is one second; maybe the process is being moved from core to core, and we're seeing the waiting time being spread over all cores. If that were the case, we would expect that if we increased the refresh rate of top, we'd see the waiting time become more concentrated in a single core (the process would be moved around fewer cores in the reduced period). Let's test; I'll repeat, with top given a refresh rate of 2 (i.e. update twice a second).

readWriteFile process showing lots of waiting for IO, with top refreshing more frequently

Looks a bit more concentrated. Maybe, maybe not. Increasing the refresh rate of top any further pushes it into the realms of too fast for a human to sensible draw information from by watching. At this point, my options are two-fold; run top in batch mode to output to file, and then look through it in slowtime (or process it with a suitable script), or force the process to remain on one core. In Linux, a process can be forced to hold to one core with the taskset command:

taskset 0x1 ./readWriteFile

which leads to a top showing what we suspected:

readWriteFile process showing lots of waiting for IO, with top refreshing more frequently, now pinned to the first core

With the process pinned to the first core, we can see that the waiting is almost all on that first core. As expected, the process is spending a lot of time waiting for IO to complete. I think that the surge in waiting on another core at the end is the start of the animated gif itself being written as the recording period comes to a close. So there we have it; a simple example in which the process is held up waiting for IO to complete. Obviously, real-world examples are rarely going to be as simple as this, but that's the point of simple examples.