Jumpstart supernet bug

#system administration #networking

Solaris Jumpstart environment (actually the whole network boot subsystem) has a stupid bug in the rcS script, resulting panic in the "superneted" environment.

This looks like this:

Rebooting with command: boot /pci@8,700000/pci@3/SUNW,qfe@0,1
Boot device: /pci@8,700000/pci@3/SUNW,qfe@0,1  File and args:
2ae00
Requesting Internet address for 0:3:ba:34:a3:12
SunOS Release 5.8 Version Generic_108528-13 64-bit
Copyright 1983-2001 Sun Microsystems, Inc.  All rights reserved.
whoami: no domain name
rtioctl: kstr_ioctl failed: error 128
whoami: couldn't add route: error 128.
WARNING: nfsdyn_mountroot: NFS3 mount_root failed: error 128
Cannot mount root on /pci@8,700000/pci@3/SUNW,qfe@0,1 fstype nfsdyn

panic[cpu0]/thread=10408000: vfs_mountroot: cannot mount root

0000000010407970 genunix:vfs_mountroot+70 (10435c00, 0, 0, 10410918, 10, 14)
  %l0-3: 0000000010435c00 0000000010439250 000000007e000000 0000000010435e38
  %l4-7: 0000000000000000 00000000104136b0 00000000000b7798 0000000000001798
0000000010407a20 genunix:main+94 (10410160, 2000, 10407ec0, 10408030, fff2,
1004ec8c)
  %l0-3: 0000000000000001 0000000000000001 0000000000000015 0000000000000e9a
  %l4-7: 0000000010428c38 0000000010462318 00000000000cd4c0 0000000000000540

skipping system dump - no dump device configured
rebooting...

Resetting ...

This bug is registered as Bug ID 4832595 on Sun. Last time, I've checked it it was in the status "closed, because not a bug". This is actually not true. And unfortunatly we have a "superneted" environment. So I had to help myself.

System panics during adding default route in the /Solaris_8/Tools/Boot/etc/rcS. Look for the line /sbin/hostconfig -p bootparams 2> /dev/null. At this point network interface is already up but is configured with the "classful" netmask. Netmask configuration itself happens a few lines later. The program /sbin/get_netmask will get netmask via ICMP type 17 message sent to the server.

So, at the moment of running hostconfig in the case if your default gateway is in the other network assuming classful netmask on the interface, you will get the panic. Sure! You are trying to add as your default gateway a host, and you don't know, how to reach it - it is in the other network!

Solution? Network mask should be set before configuring the default gateway. Sounds logical, doesn't it?

In theory, you can try to figure out the ip address of the machine to ask the netmask via hostconfig -p bootparams -n -v or by looking where you've mounted your root partition from. I have not tried these "clean" ways. I have a "quick-and-durty" hack.

The interface configuration part looks like this:

    old_ifs=$IFS
    IFS=":"
    set -- $net_device_list
    for i
    do
            #
            # skip the auto-revarp for the loopback device
            #
            if [ "$i" = "lo0" ]; then
                    continue
            fi
            /sbin/ifconfig $i auto-revarp -trailers >/tmp/dev.$$ 2>&1
            ipaddr=`/sbin/ifconfig $i |grep inet |awk '{print $2;}'`
            if [ "X$ipaddr" != "X0.0.0.0" ] ; then
                    # The interface configured itself correctly
                    echo "Configured interface $i"
        /sbin/ifconfig $i up
            else
                    echo "Skipping interface $i"
            fi
    done
    IFS=$old_ifs

Let's rewrite it like this:

    old_ifs=$IFS
    IFS=":"
    set -- $net_device_list
    for i
    do
            #
            # skip the auto-revarp for the loopback device
            #
            if [ "$i" = "lo0" ]; then
                    continue
            fi
            /sbin/ifconfig $i auto-revarp -trailers >/tmp/dev.$$ 2>&1
            ipaddr=`/sbin/ifconfig $i |grep inet |awk '{print $2;}'`
            if [ "X$ipaddr" != "X0.0.0.0" ] ; then
                    # The interface configured itself correctly
                    echo "Configured interface $i"
            #
            # Netmask workaround: set it up right here!
            #
                    if [ -f /tmp/._set_supernet ] ; then
                            echo "Supernet workaround is applied on the interface $i"
                            /sbin/ifconfig $i netmask 0xfffffc00 up
                    else
                            /sbin/ifconfig $i up
                    fi
            else
                    echo "Skipping interface $i"
            fi
    done
    IFS=$old_ifs

What happens here? If a semaphore file /tmp/._set_supernet exists, we set up the netmask right in the script. You know netmask of your network. If the semaphore doesn't exist, we proceed normaly.

Now, where to create the semaphore file? This is a long topic itself, and the best source of information is the Blueprints book "JumpStart Technology: Effective Use in the Solaris Operating Environment" by John S. Howard and Alex Noordergraaf. Information about this book is available on the Sun Blueprints pages. I'll just tell you what to do.

When you boot your system, you can pass parameters to the kernel. Normally you would start your Jumpstart installation like this: boot net - install nowin. Kernel doesn't proceed all parameters and they are passed to the init and then to the startup scripts. In the same rcS script look for the /sbin/getbootargs. After it, you will see the "case" operator and parameters processing.

So, you can define your own parameter and include it's processing there like this:

supernet)
    cat < /dev/null > /tmp/._set_supernet
    shift
    ;;

Now, just add one more parameter to your boot command: boot net - install nowin supernet and that's it! Again, for the detailed discussion about the parameter's processing refer to the "Jumpstart" book (ISBN 0-13-062154-4) or to the Sun Blueprints articles. More fun...

Well... It appears not to solve all the problems.

Let's review, how network boot process works. With snoop you should be able to observer following traffic:

initial RARP broadcast and response (this is done by OBP)
TFTP transfer of the inetboot file
second RARP pair - this time is done by kernel
BPARAM WHOAMI - kernel gets workstation's parameters
BPARAM GETFILE root - kernel requests location of the root fs
MOUNT and NFS traffic
..
somewhere BPARAM WHOAMI - final configuration - is done by hostconfig in rcS

BPARAM dump looks like this (produced with snoop -v)

BPARAM:  ----- Boot Parameters -----
BPARAM:
BPARAM:  Proc = 1 (Who am I?)
BPARAM:  Client name = client01
BPARAM:  Domain name = my.domain.name
BPARAM:  Router addr = 10.0.0.1
BPARAM:

First, you will get a problem, if the router's address doesn't belong to the classful network of the machine being installed. Note, this is kernel phase, so you cannot work around it with rcS patching. You could remove default gateway entry on the jumpstart server (route delete default [ip.of.your.gateway] check with netstat -rn) but then you will get a second problem - accessing root partition.

Be sure, that jumpstart server could be reached directly - this means, it has to belong to the same classful network as you workstation. Alternatevly, you can get a "router" in every classful network and then "route" between classful and classless parts.

Another solution it to separate Boot and Install servers as described in Advanced Installation Guide. This may help, I didn't test it. Finally, you can use DHCP to boot - I'm not sure how buggy is it however.