[PATCH Coda 0/5]

Thu Jul 12 14:31:04 UTC 2007

On Thu, Jul 12, 2007 at 01:11:03PM +0100, Robert Watson wrote:
> On Wed, 11 Jul 2007, Jan Harkes wrote:
> >   tar -xzf rpc2-2.5.tar.gz
> >   ( cd rpc2-2.5 ; ./configure ; gmake ; sudo gmake install )
> 
> rpc2 depends on perl and should probably check for it in configure.

Right, I was looking at my cvs/git tree which doesn't need perl anymore.
In fact the makefile snippet that ran the perl script was afaik the last
gmake dependency in rpc2.

> >   tar -xzf coda-6.9.1.tar.gz
> >   ( cd coda-6.9.1 ; ./configure --prefix=/usr/coda ; gmake ;
> >   sudo gmake client-install )
> 
> Newer gcc revisions get awful cranky about all those string constants being 
> passed around as 'char *'s :-).

Sigh, every new gcc adds a new target for cleanups, and I haven't even
fixed all the aliasing warnings yet (still using gcc-4.1 here). Although
I don't remember if any of these more paranoid warnings have uncovered
bugs, at least it helps with cleaning up the code.

> >   # load the coda kernel module, kldload coda or something
> >   /usr/coda/sbin/venus-setup testserver.coda.cs.cmu.edu 100000
> 
> This probably no longer applies on FreeBSD:
> 
>   You need a character device for the Coda kernel module
>   On *BSD systems you probably have to run "mknod /dev/cfs0 c 93 0"
> 
> ...and might lead to confusion.

I think most of what venus-setup tries to do no longer applies, checks
for missing ports in /etc/services, etc. I remember adding a test to
avoid that message on devfs-based systems but I may have never committed
the fix.

> >   /usr/coda/sbin/venus
> 
> I neglected to notice your instructions to explicitly load the kernel 
> module, so ran Venus without it -- the error message wasn't as suggestive 
> as might be hoped.  Perhaps "Load the kernel module, dammit" if consecutive 
> ENOENT's come from trying to open the device nodes would be appropriate?
> 
>   13:45:07 Coda Venus, version 6.9.1
>   13:45:07 Probably another Venus is running! open failed for
>   /dev/cfs0,/dev/coda/0, exiting

Right, when venus is already running we should see something like EBUSY.

> >   ls /coda/testserver.coda.cs.cmu.edu/
> >
> >If everything worked ok you should see 2 directories and a file named
> >'WELCOME'.
> 
> At this point I got an ls stuck in the coda_call wait channel:
> 
>   freebsd-coda# ls /coda
>   freebsd-coda# ls /coda/testserver.coda.cs.cmu.edu/
>   load:0.02  cmd: ls 39084 [coda_call] 0.00u 0.00s 0% 1116k
> 
> I see UDP traffic to telemann.coda.cs.cmu.edu but no apparent forward 
> progress until eventually:
> 
>   ls: /coda/testserver.coda.cs.cmu.edu/: No such file or directory

Probably took about 60 or 90 seconds. I guess we the reply packets never
really made it all the way across the firewalls and the remote procedure
call timed out. We can handle some classes of address translation
firewalls better than we used to, the side-effect and server->client
connections are now going over the same UDP port pair, but there are
clearly still situations that we just don't handle.

I've seen firewalls that assume that UDP traffic is single request,
single reply and no state between requests. So they basically drop the
redirection once a reply has been received and send the next request
from a new port. Hey it works fine for DNS...

Very confusing on the server side, it identifies clients based on the
(ip,port) and each new request is assumed to come from a new client and
so we can never get a stable connection.

> surprising.  I notice also that the clock on the box may be off by an hour, 
> perhaps a problem?

Shouldn't be, can't rely on time being anywhere near consistent in a
distributed system.

> When I killed venus and restarted it, then the system hung:
...
>   13:49:59 starting FSDB scan (4166, 100000) (25, 75, 4)
>   13:49:59 	2 cache files in table (0 blocks)

Hmm, at this point we haven't even tried to mount /coda yet, so I'm kind
of surprised this actually managed to wedge the system. I wonder if this
has to do with that bit of code where we used to pass a NULL vfs mount.

    -/*  cp = make_coda_node(&ctlfid, vfsp, VCHR);
    -    The above code seems to cause a loop in the cnode links.
    -    I don't totally understand when it happens, it is caught
    -    when closing down the system.
    - */
    -    cp = make_coda_node(&ctlfid, 0, VCHR);
    -
    +    cp = make_coda_node(&ctlfid, vfsp, VCHR);

Without debugging I can't tell if your kernel ended up spinning on this
'loop in the cnode links', but I'm pretty sure that passing a NULL mount
is not the correct fix for the issue. If it weren't for the assert or
panic that is triggered in insmntque it would probably just leave the
vnode hanging around without any references.

Jan