NFS: rpcsec_gss with Linux clients

Sat Sep 1 23:57:58 UTC 2012

Attila Bogar wrote:
> Hi,
> 
> In the wireshark trace I see, that during an NFS mount, Linux opens
> two
> TCP connections.
> Linux creates the GSS conect on one tcp connection, sends a DESTROY
> destroys rpcsec,
> but immediately (without waiting for the DESTROY reply) - reusing the
> context on the other TCP connection.
> 
> I don't know who is guilty the BSD or the Linux (or both) as I haven't
> spent time reading the RFCs.
> 
This certainly sounds bogus. I can see an argument for 2 TCP connections
for trunking, but since a security context should only be destroyed
when the client is done with it, doing a DESTROY doesn't make sense?
(There is something in the RPC header called a "handle". It identifies
the security context, and it would be nice to check the wireshark
trace to see if it the same as the one being used on the other connection?)

> This is very difficult to reproduce if the server is very fast. You
> have to use an extremely fast client.
> With a Linux virtual machine I couldn't reproduce. Even printf's in
> the
> bsd kernel destroy the balance and everything starts to suddenly work
> because of the timing. This is a quantum bug.
> 
> Look at /usr/src/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c
> 
> In svc_rpc_gss()
> case RPCSEC_GSS_DESTROY:
> 
> svc_rpc_gss_validate returns FALSE during the DESTROY.
> 
> I don't quite know why, but during the destroy within the the
> svc_rpc_gss_validate() gss_verify_mic() returns maj_stat =
> GSS_S_DEFECTIVE_TOKEN, no matter what heimdal version I use.
> 
That would indicate the encrypted checksum isn't correct. It
might be using an algorithm only supported by the newer RPCSEC_GSS_V3?

> As a consequence, client->cl_state is marked CLIENT_STALE;
> 
For DESTROY when it will fail, I'm not sure if marking the
context stale makes sense. (I can see an argument for and against
doing this.)

I've attached a small patch with disables setting client->cl_state
to CLIENT_STALE for this case, which you could try, to see if it
helps?

> I think client locking should have been used at this point.
> 
> In the meantime the next TCP connection's nfs PUTROOTFH request is
> being
> processed in the kernel.
> 
> And this is the point where the problem may or may not happen.
> In svc_rpc_gss() at the beginning svc_rpc_gss_timeout_clients() is
> called.
> If it's called before svc_rpc_gss_validate() marked the cl_state
> CLIENT_STALE and the Linux client survived.
> 
> Here is my patch for review. This is my first ever kernel patch.
> 
> I'm going to open a PR...
> 
I'd suggest contacting the Linux folks first and see if they are
willing to look at the wireshark trace or know of an issue/fix,
because it really sounds like a Linux client issue.

> Constructive comments are welcome.
> 
> Thanks,
> Attila
> 
> --- /usr/src/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c.orig 2012-08-30
> 23:34:00.000000000 +0100
> +++ /usr/src/sys/rpc/rpcsec_gss/svc_rpcsec_gss.c 2012-08-31
> 15:59:40.000000000 +0100
> @@ -565,7 +565,8 @@
> */
> client->cl_state = CLIENT_NEW;
> client->cl_locked = FALSE;
> - client->cl_expiration = time_uptime + 5*60;
> + /* we are now more cautious */
> + client->cl_expiration = time_uptime + 4*60;
> 
Waiting 4 minutes instead of 5 shouldn't have any real effect,
although it might avoid the problem for your case w.r.t. timing.

> return (client);
> }
> @@ -930,7 +931,11 @@
> if (cred_lifetime == GSS_C_INDEFINITE)
> cred_lifetime = time_uptime + 24*60*60;
> 
> - client->cl_expiration = time_uptime + cred_lifetime;
> + /*
> + * we are now more cautious
> + * 12 sec is just an adhoc hack value
> + */
> + client->cl_expiration = time_uptime + cred_lifetime - 12;
> 
This time is usually the TGT lifetime (12->24hrs), so subtracting
12 sec from it doesn't really make any sense. (I will note that
the calculation of cred_lifetime for the GSS_C_INDEFINITE case
looks incorrect, since time_uptime gets added twice, but I doubt
that's relevant to your problem, since it is set to more than 24hrs.)

> /*
> * Fill in cred details in the rawcred structure.
> @@ -990,7 +995,7 @@
> gss_buffer_desc rpcbuf, checksum;
> OM_uint32 maj_stat, min_stat;
> gss_qop_t qop_state;
> - int32_t rpchdr[128 / sizeof(int32_t)];
> + int32_t rpchdr[2048 / sizeof(int32_t)];
> int32_t *buf;
> 
> rpc_gss_log_debug("in svc_rpc_gss_validate()");
> @@ -1024,7 +1029,12 @@
> if (maj_stat != GSS_S_COMPLETE) {
> rpc_gss_log_status("gss_verify_mic", client->cl_mech,
> maj_stat, min_stat);
> - client->cl_state = CLIENT_STALE;
> + /*
> + * Linux nfs-utils>=1.2.3 is re-using GSS context
> + * on other TCP NFS connection after it DESTROYED it
> + * The garbage collector will remove client at cl_expiration
> + */
> + /* client->cl_state = CLIENT_STALE; */
> return (FALSE);
> }
> 
If this helps, please try the attached patch which does the
same thing, but only for the DESTROY case.

rick

> _______________________________________________
> freebsd-fs at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"
-------------- next part --------------
A non-text attachment was scrubbed...
Name: rpcsec-destroy.patch
Type: text/x-patch
Size: 1002 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20120901/ca35dc1a/rpcsec-destroy.bin