cron not setting roots PATH correctly

January 28, 2008

While my home server continues to run the timezone enabled cron
daemon I have after the last upgrade to build 81 I started getting
mails to root saying:


Your "cron" job on pearson
/tank/fs/local/snapshot minute tank/fs
produced the following output:
/tank/fs/local/snapshot[2]: zfs: not found


Which was odd as the script had worked perfectly for years, well
months. So why did root’s path no longer contain “/usr/sbin”?
Here I made a big mistake. I assumed (always a bad thing) that the
bug was introduced by my code. Needless to say the timing could not
have been worse. I had just put the “final” code
changes for code review
so finding a new bug was a real fly in
the ointment. So finding a bug in that code would just be irritating.
Then to add more confusion to the bug if you used crontab -e to edit
the crontab, for example to add a cron job like this:





* * * * * echo $PATH > /tmp/.root_path


To help debug the problem the problem would go away, at which point
you forget about it until you reboot the system (to help diagnose
6653187)
and when the job runs now it has the wrong PATH.


After a few minutes staring at the code
it is obvious what is wrong. We are using an uninitialized variable
to choose which PATH to use. The question was what had I done to
cause this? Now I spend a few hours staring at the code running under
libumem, running under the debugger, to see how I could have
introduced the bug. I could not see how this could ever have worked.
Finally I decide to check to see if there have been any recent
changes to cron in the hoe that this was not my fault. So it was off
to Martins “Mercurial
for TeamWare users
” page to find how to do this with
mercurial:





changeset:   5581:aa8f6b1ea400
user: basabi
date: Mon Dec 03 14:32:45 2007 -0800
summary: 6636777 *cron* coredumps on NULL home directory
changeset: 5558:0976be4b75d2
user: basabi
date: Thu Nov 29 21:09:22 2007 -0800
summary: 6416652 *cron* suffers from amnesia if name services aren’t there
at boot time
changeset: 1315:45f0335a274a
user: basabi
date: Tue Jan 24 07:11:42 2006 -0800
summary: 6270017 cron/at-jobs log warning about not obtaining latest contrac
t
from popen(3c)


Perhaps one of those last two putbacks introduced the bug. Time to
try the unmodified cron binary (yes the time to try that binary was
hours ago but there is no point in being smart after the event).
Sure enough the bug is there so I did not introduce it. Time to file
the bug an move on.





Bug ID 6655359
Synopsis cron assumes malloc returns zeros memory and then
sets root’s path by luck rather than judgement





Introduced by: 6416652
*cron* suffers from amnesia if name services aren’t there at boot
time


Moral. Always check the putback logs.


Good Morning Build 81

January 27, 2008

After a short delay (things are very busy at the moment) build 81
has hit the Sun Ray server:


: enoexec.eu FSS 1 $; uname -a
SunOS enoexec 5.11 snv_81 sun4u sparc SUNW,Sun-Fire
: enoexec.eu FSS 2 $;


All seems well so far. Again StarOffice decided to take me through
the, “Do you agree with the T&C’s and have you Registered
dialogue” which is irritating that it can’t remember but not
critical.


Not quite going to Maidenhead.

January 27, 2008

Only five riders this morning and a deceptively cold morning it
was too. In one of those strange coincidences I was already thinking
we had not been to Windsor for a while when someone suggested we head
out that way for a flat run to Maidenhead. However somewhere after
Ascot we got all competitive as there was a group of riders up the
road so we missed our left turn and ended up back in Windsor. Once we
arrived at the Cafe I was actually very pleased as I realised just
how cold my hands were. Nothing some Tea, which in the Top Twenty, as
I have mentioned
before
means as much as you can drink and something to eat could
not cure.


Then we returned via Old Windsor, it is easy to forget what a
fabulous place Windsor until you cycle through it on a clear crisp
(and yes cold) Sunday morning. Then home via Datchett, Staines and
Shepperton.


50 miles, very gentle. Not going to get fit enough for the planned
trip to Ventoux
by doing rides like this however much I engjoyed it.


Another January Bonk

January 24, 2008

It
happened again last
night. I did not even get out of Frimley
when I started to feel all the energy drain from my legs and started
to get the horribly familiar sweating that seems to accompany
bonking. Being wise to the tell tale signs I consumed my emergency Go
Bar
and took on fluids which enabled me to climb our of Frimley
and make good progress all the way to Webridge but with just two
miles to go everything just stopped. I still find it hard to believe
how total the lack of energy is and how quickly it takes you over
when you bonk. I freewheeled down past Walton Station and the crawled
the last few yards home.


I suspect a few things came together to get me this time:



  1. I’m not as fit as normal since it is winter, I’ve been
    traveling, I’ve been eating too much and not riding enough.


  2. I did not have a second breakfast when I got to work. (I know
    this sort of contradicts the eating too much above but the important
    thing is to eat the right food at the right time, which is within
    15 minutes of the end of your ride).


  3. Lunch did not have enough carbs in it.


  4. I left work late so Lunch was now 6 hours earlier.



When I finally did get home I then went OTT and ate
much to much. Now looking back at my blog I realise that January does
seem to be the month when I most often bonk all though apparently not
in 2007.


Breaking chains

January 20, 2008

A warm, very warm for this time of year, 52 mile ride with 8
riders. We went out over White Down the descent being unpleasant
thanks to the damp road with signs of oil making the hairpin
after the Pill Box particularly entertaining. Then over the A25
though Holmbury St Mary, Forest Green, Ockley and round to Henfold
Lakes.


The usual return trip was interrupted by a puncture, hardly
newsworthy, then during the quick sprint up the hill into Leatherhead
a snapped chain. Once
again
it was one in the eye for those who laugh about the tool
kit
I carry. So rather than calling for a Taxi we repaired the
broken chain by taking a link out (I was on my old tourer so the
spare chain links I had were for a 5-7 speed chain not an 8-9 speed
chain) and continued on our way. Breaking a chain is a great excuse
reason to not compete on any of the climbs.


Opposite’s in Service

January 17, 2008

When I read the announcement
about Sun’s acquisition of MySQL
on Jonathan’s blog I too
was drawn to the fact that for every employee of Sun Service there
is an equal and opposite employee. Check out the picture:







I’m left wondering what my opposite would be like. Clearly they
would not ride a bike. I hesitate to suggest asking what other
characteristics they would have but no one reads this anyway.


Dear google reader

January 14, 2008

If I have gone to the trouble of logging in then you know who I am and so can I please have the pages in my native tongue, English English, rather than the native tongue of the IP address from which I am connected. Which is this case is Spanish.


Also can someone tell me how to force google search to let me be English when travelling?


Ranmore & Box Hill

January 13, 2008

Despite the forecast of miserable weather this morning we had 7
riders out and were quickly heading out of Cobham when we took an
unexpected left down Cobham
Park Road
. I can see why we’ve never been down there before as it
essentially just does three sides of a rectangle and if we did not
turn we would have just done the forth side. However it was very
picturesque and since the purpose of the ride is just cycling
actually going somewhere is not important.


In that vein after we went through Effingham Junction1
and then Effingham to climb up onto Ranmore we took both lefts so we
ended up going along Hogden
Lane
which was closer to cyclo cross than road riding. Then along
Ranmore Common Road and down Ranmore Road into Dorking. Again the
descent down Ranmore Road was spectacular although on a wet road you
either have to take it easy or trust that you can get round the bends
without braking.


By now we were obviously on for a short ride and now we had the
wind behind us as we flew north on the A24 not using the bike track
as 30mph on the bike track is dangerous and we were heading for the
Zig
Zags
which is more safely done from the main road. Up the Zig
Zags after the first turn I was convinced I actually was a cycling
God as I flew up the hill only to be returned to reality at the
second turn when I turned into the wind that had been blowing me up
the hill. I managed the hill in 8 minutes dead (and I almost was).


Then a more gentle run down over Headley to the Cafe in
Leatherhead. From Leatherhead it was the usual run home only faster
due to the massive tail wind.


One negative thing was the Audi that Hooted us in Esher, for
daring to be on the road and then when he passed “washed his
windscreen” with windscreen washers that seemed to miss his
windscreen altogether. I have a word for people like that but this is
family blog.


The short ride was only 42 miles and an average of 15.4mph.









1
Ah there is a mystical place of my my youth. The train my Dad would
get home from work, the O2 (IIRC) from Victoria, used to terminate
at Effingham Junction. What small boy would not wonder what
wonderful things would be at this place. I’m glad I never managed to
get there as the is actually very little to report from Effingham
Junction, well at least for a small boy, unless I could have blagged
my way into this
shed
.



Latency bubbles in your disk IO

January 11, 2008

The following was written in response to an email from a
customer about monitoring IO in response to my scsi.d
postings
. Tim covers where disk IO requests can be queued in his
posting titled “
Where
can I/O queue up in sd/ssd
which
I would recommend as a starting point.


The disk IO sub-systems are built to provide maximum through put
which is most often the right thing. However the weakness of tuning
for throughput is that occasionally you can get some bizarre
behaviour when it comes to latency. The way that optimum IO bandwidth
is achieved is by sorting each io by logical
block address (LBA)
and then issuing those in order to minimize
head seek. This is documented in the disksort(9F) manual page.


So if you have a sequence of writes to blocks N, N+1, N+2, N-200,
N+3, N+4, N+5,N+6, N+7 in that order and your LUN
as a queue depth and therefore throttle of 2.1
The IO’s will actually be delivered to the LUN in this order N, N+1,
N+2, N+3, N+4, N+5,N+6, N+7, N-200. Hence there will be a significant
latency applied to the IO going to LBA N-200 and in practice it is
possible to have IO requests delayed on the waitq for many seconds (I
have a pathological test case that can hold them there for the time
it takes to perform an IO on nearly every block on the LUN,
literally hours). You better hope that that IO was not your important
one!


This issue only comes into play in the disk driver has reached the
throttle for the device as up until that point each IO can be passed
straight to the LUN for processing.2
Once the driver has reached the throttle for the LUN it begins
queuing IO requests internally and by default will sort them to get
maximum throughput. Clearly the lower the throttle the the sooner you
get into this potential scenario.


Now for the good news. For most disk arrays sorting by LBA does
not make much sense since the LUN will be made up of a number of
drives and there will be a read cache and a write cache. So for these
devices it makes sense to disable disksort and deliver the IO
requests to the LUN in the order in which they are delivered to the
disk driver. If you look in the source for sd.c you will see that we
do this by default for most common arrays. To achieve this there is a
flag, “disable
disksort”,
that can be set in sd.conf or
ssd.conf depending on which driver is in use. See Micheal’s
blog entry
about editing sd.conf. While you are reading that
entry note you can use it to set the throttle for individual LUNS so
you do not have to set [s]sd_max_throttle, which will penalize all
devices rather than just the one you were aiming for. If you have
just one that only has a small queue depth and you will see below why
a small queue depth can be a really bad thing.


So how could you spot these latency bubbles?


It will come as no surprise that the answer is dtrace. Using my
pathological test case, but with it set to run for only 10 minute to
a single spindle, the following D produces a clear indication that
all is not well:


fbt:ssd:ssdstrategy:entry,
fbt:sd:sdstrategy:entry
{
start[(struct buf *)arg0] = timestamp;
}
fbt:ssd:ssdintr:entry,
fbt:sd:sdintr:entry
/ start[(this->buf = (struct buf *)((struct scsi_pkt *)arg0)->pkt_private)] != 0 /
{
this->un = ((struct sd_xbuf *) this->buf->b_private)->xb_un;
@[this->un] = lquantize((timestamp – start[this->buf])/1000000,
60000, 600000, 60000);
@q[this->un] = quantize((timestamp – start[this->buf])/1000000);
start[this->buf] = 0;
}


This produces the following output3,
the times are milliseconds:





dtrace: script ’ssdrwtime.d’ matched 4 probes
^C
6597960853440
value ————- Distribution ————- count
< 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 377204
60000 | 0
120000 | 0
180000 | 0
240000 | 0
300000 | 0
360000 | 0
420000 | 0
480000 | 2
540000 | 300
>= 600000 | 0
6597960853440
value ————- Distribution ————- count
-1 | 0
0 | 40
1 | 9
2 | 6
4 | 17
8 | 23
16 | 6
32 | 36
64 |@@ 15407
128 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 361660
256 | 0
512 | 0
1024 | 0
2048 | 0
4096 | 0
8192 | 0
16384 | 0
32768 | 0
65536 | 0
131072 | 0
262144 | 0
524288 | 302
1048576 | 0


Now recall that my test case is particularly unpleasant but it
demonstrates the point. 300 IO requests took over 9 minutes and they
only actually got to complete as the test case was shutting down.
While the vast majority of the IO requests complete in less than
256ms.





Now lets run the same pathological test
with disksort disabled:


dtrace: script ’ssdrwtime.d’ matched 4 probes
^C
6597960853440
value ————- Distribution ————- count
< 60000 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 543956
60000 | 0
6597960853440
value ————- Distribution ————- count
-1 | 0
0 | 30
1 | 21
2 | 30
4 | 0
8 | 0
16 | 50
32 | 3
64 | 384
128 | 505
256 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 531169
512 |@ 11764
1024 | 0


Notice that now the majority of the IO requests took longer now,
falling in the 256ms bucket rather than the 128ms bucket but none of
the IO requests took many minutes.





Now my test case is pathological but if
you have drives with small queue depths and disk sort is still
enabled you are open to some quite spectacular latency bubbles. To
mitigate this my advice is:



  1. Don’t ever set the global
    [s]sd_max_throttle in /etc/system. Use the [s]sd.conf file to set
    the appropriate throttle for each device.


  2. Consider what is more important to
    you. Throughput or latency. If is is latency or if your LUN is on a
    storage array then turn of disksort using the [s]sd.conf file.


  3. If you have pathological
    applications then understand that the IO subsystem can give you
    throughput or bounded latency, not both. So separate out the IO
    devices that need throughput from those for which latency is more
    important.


  4. Be aware that even “dumb”
    disk drives often implement disk sort internally so in some cases
    they can give a similar issues when they have a queue depth of
    greater than 24.
    In those cases you may find it better to throttle them down to a
    queue depth of 2 and disable disksort in [s]sd to get the most
    predictable latency all be it at the expense of throughput. If this
    is your issue then you can spot this either by using scsi.d directly
    or by modifying it to produce aggregations like those above. I’ll
    leave that as an exercise for the reader.






1The
queue depth of a LUN is the number of commands that it can handle at
the same time. The throttle is usually set to the same number and it
used by the disk driver to prevent it sending more commands than the
device can cope with.




2Now
the LUN itself may then re order the IO if it has more then two IO’s
in it’s internal queue.




3Edited
to remove output for other drives.




4With
a queue depth of 2 the drive can not sort the IO requests. It has to
have one as active and the next one waiting. When the active one
completes the waiting one will be actioned before a new command can
come from the initiator.



Privileges, fdisk and backing up windows xp

January 8, 2008

Over the holiday period I had some fun when the one Windows XP
system in the house failed and would not boot. The reason was that
the disk had a number of sectors that could not be read and had to be
replaced. Anyway since all the user data is not present on the PC
nothing was lost except a few hours of my time. That time did make me
think that even the old Solaris installer was not that bad and that
was before I had to reinstall every program, plus service pack two
and the worlds supply of patches. So I have decided that I should
back up the XP partition so that I don’t have to go through that
again.


To do this I I’ve installed Solaris as a dual boot and then use
dd(1) to copy the XP partition onto the server. The only problem is
that the cmdk(7D) driver does not seem to honour the end of
partitions. So if I just dd the /dev/rdsk/c0d0p1 partition I don’t
just get the XP partition but also get the entire Solaris partition.


Hence I have a little script that used fdisk(1M) to get the disk
geometry and then copy just the XP partition. Now the strange thing
about fdisk is that it has to be run with “all”
privileges. So the script is started with all and then drops any that
it does not really need, ie nearly all of them. So exec_attr and
prof_attr have to have the “XP Backup” profile added so
that they look like this:


# egrep ‘XP Backup’ /etc/security/*_attr
/etc/security/exec_attr:XP Backup:solaris:cmd:::/usr/local/sbin/xp_backup:privs=all
/etc/security/prof_attr:XP Backup:::Back up Windows XP:help=XpBackup.html
#


Then just use ‘usermod -P “XP Backup” cjg’ so I can run
this via ssh and then redirect the output into a file.


Anyway here is the script, I’ve tested that it works on the PC and
on my work laptop which still has an XP partition:


#!/bin/ksh
# CDDL HEADER START
#
# The contents of this file are subject to the terms of the
# Common Development and Distribution License (the "License").
# You may not use this file except in compliance with the License.
#
# You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
# or http://www.opensolaris.org/os/licensing.
# See the License for the specific language governing permissions
# and limitations under the License.
#
# When distributing Covered Code, include this CDDL HEADER in each
# file and include the License file at usr/src/OPENSOLARIS.LICENSE.
# If applicable, add the following below this CDDL HEADER, with the
# fields enclosed by brackets "[]" replaced with your own identifying
# information: Portions Copyright [yyyy] [name of copyright owner]
#
# CDDL HEADER END
#
# copy the Windows partition to standard out.
# This needs to run with the "all" privilege so that fdisk can read the partion
# information. It has to read that as the disk driver does not enforce the end
# of the partitions.
#
# I added the XP Backup profile and then added this command to it with
# the additional priv. Here is teh entry from /etc/security/exec_attr:
# XP Backup:solaris:cmd:::/usr/local/sbin/xp_backup:privs=all
# (obviously not as a comment).
#
# Then there is a degree of paranoia in the script. It drops all the
# privileges that it does not need.
export PATH=/usr/bin:/usr/sbin
device=${1:-c0d0}
device=${device##*/}
set – $(fdisk -W – ${device}p0 | ppriv -e -s A=basic nawk ‘$1 ~ "[0-9]" { x++ } $1 == 7 { print $7, $NF/$7, $NF%$7, x }’)
ppriv -s L=file_dac_read,proc_exec -s EIP=proc_exec,file_dac_read $$
if (( $3 != 0 ))
then
print "Partition is not a whole number of tracks!" >&2
exit 1
fi
bs=$1
count=$2
part=$4
if (( bs == 0 || count == 0 ))
then
print "Block size($bs) and block count($count) can be zero" >&2
exit 1
fi
exec ppriv -e -s EPL-file_dac_read -s I-file_dac_read dd bs=${bs}b count=$count < /dev/rdsk/${device}p${part}

Follow up: It turns out the comment obout cmdk letting you read off the end of the disk is wrong.
Only when using the p0 partition can you read the whole drive (which was my mistake) if you choose the correct partition which in my case is the p1 partition it will only read to the end of the partition. However I still need to use fdisk to find the correct partition so the script remains the same.