Skip to content

What is that system call doing?

October 3, 2008

Getting back on topic, here is a nice short bit of Dtrace.

Sometimes by the time I get to see an issue the “where on the object” question is well defined and in two recent cases that came down to “Why is system call X slow?” . The two system calls were not the same in each case but the bit of D to find the answer was almost identical in both cases.

Faced with a system call that is taking a long time you have to understand the three possible reasons this can happen:

  1. It has to do a lot of processing to achieve it’s results.

  2. It blocks for a long time waiting for an asynchronous event to occur.

  3. It blocks for a short time but many times waiting for asynchronous events to occur.

So it would be really nice to be able to see where a system call is spending all it’s time. The starting point for such an investigation is that when in the system call there are only two important states. The thread is either running on a CPU or it is not. Typically when it is not it is because it is blocked for some reason. So using the Dtrace sched provider’s on-cpu and off-cpu probes to see how much time the system call spends blocked and then print out stacks if it is blocked for more than a given amount of time.

Here it is running against a simple mv(1) command:

$ pfexec /usr/sbin/dtrace -s syscall-time.d -c "mv .d .x" dtrace: script ‘syscall-time.d’ matched 17 probes dtrace: pid 26118 has exited CPU     ID                    FUNCTION:NAME   3  79751                     rename:entry rename(.d, .x)   3  21381                    resume:on-cpu Off cpu for: 1980302               genunix`cv_timedwait_sig+0x1c6               rpcmod`clnt_cots_kcallit+0x55d               nfs`nfs4_rfscall+0x3a9               nfs`rfs4call+0xb7               nfs`nfs4rename_persistent_fh+0x1eb               nfs`nfs4rename+0x482               nfs`nfs4_rename+0x89               genunix`fop_rename+0xc2               genunix`vn_renameat+0x2ab               genunix`vn_rename+0x2b    3  79752                    rename:return    on-cpu                                                        value  ————- Distribution ————- count                16384 |                                         0                    32768 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@           3                    65536 |@@@@@@@@@@                               1                   131072 |                                         0            off-cpu                                                       value  ————- Distribution ————- count               131072 |                                         0                   262144 |@@@@@@@@@@@@@@@@@@@@@@@@@@@              2                   524288 |                                         0                  1048576 |@@@@@@@@@@@@@                            1                  2097152 |                                         0          rename times on: 205680 off: 2625604 total: 2831284   $  

From the aggregations at the bottom of the output you can see that the system call went off-cpu three times and one of those occasions it was off CPU for long enough that my limit of 10000000 nano seconds was reached and so a stack trace was printed. It also becomes pretty clear where that system call spent all it’s time. It was a “rename” system call and I’m on an NFS file system so it has to wait for the server to respond and that server is going to have to make sure it has updated some non-volatile storage.

Here is the script:

#!/usr/sbin/dtrace -s /* run using dtrace -p or dtace -c */ syscall::rename:entry / pid == $target / { 	self->traceme = 1; 	self->ts = timestamp; 	self->on_cpu = timestamp; 	self->total_on = 0; 	self->total_off = 0; 	printf("rename(%s, %s)", copyinstr(arg0), copyinstr(arg1)); } sched:::off-cpu / self->traceme == 1 / { 	self->off_cpu = timestamp; 	self->total_on += self->off_cpu – self->on_cpu; } sched:::off-cpu / self->traceme == 1 / { 	 @["on-cpu"] = quantize(self->off_cpu – self->on_cpu); } sched:::on-cpu / self->traceme == 1 / { 	self->on_cpu = timestamp; 	 @["off-cpu"] = quantize(self->on_cpu – self->off_cpu); 	 self->total_off += self->on_cpu – self->off_cpu; } /* if off cpu for more than a millisecond print a stack */ sched:::on-cpu / self->traceme == 1 && timestamp – self->off_cpu > 1000*1000 / { 	printf("Off cpu for: %d", self->on_cpu – self->off_cpu); 	stack(10); } sched:::off-cpu / self->traceme == 1 && timestamp – self->on_cpu > 1000*1000 / { 	printf("On cpu for: %d", self->off_cpu – self->on_cpu); 	stack(10); } syscall::rename:return /self->traceme/ { 	self->traceme = 0; 	self->total_on += timestamp – self->on_cpu; 	@["on-cpu"] = quantize(timestamp – self->on_cpu); 	printa(@); 	printf("%s times on: %d off: %d total: %d\n",probefunc, self->total_on, 	self->total_off, timestamp-self->ts); 	self->on_cpu = 0; 	self->off_cpu = 0; 	self->total_on = 0; 	self->total_off = 0; }

11/9/2009 Updated script to correct the time in the comment.


From → Solaris

  1. Nice one. Is there a reason for having 2 instances of sched:::off-cpu probe with the same predicate but different bodies ?

  2. No there is not. It is just the way the script evolved.

  3. Great, this really helped give a definitive answer to a serious customer problem. For me the mental leap required to get from "examine potential code paths, instrument code paths, run experiment, repeat" to "get a stack at just the right point in time." was a big one.
    There is a lesson to be learned for newbs like me in that: when you are in a position where you think the only way to get the answer with dtrace will be to install probes everywhere it is wiser to try to recast the problem to "how do I take a reading at the critical *time*". Many thanks for that lesson, my troubleshooting will be better for it.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: