scp: Write Failed: Cannot allocate memory

Wed Jul 6 04:31:32 UTC 2011

Quoting "Jeremy Chadwick" <freebsd at jdc.parodius.com>:

> On Wed, Jul 06, 2011 at 01:54:12PM +1000, Peter Ross wrote:
>> Quoting "Jeremy Chadwick" <freebsd at jdc.parodius.com>:
>>
>> >On Wed, Jul 06, 2011 at 01:07:53PM +1000, Peter Ross wrote:
>> >>Quoting "Jeremy Chadwick" <freebsd at jdc.parodius.com>:
>> >>
>> >>>On Wed, Jul 06, 2011 at 12:23:39PM +1000, Peter Ross wrote:
>> >>>>Quoting "Jeremy Chadwick" <freebsd at jdc.parodius.com>:
>> >>>>
>> >>>>>On Tue, Jul 05, 2011 at 01:03:20PM -0400, Scott Sipe wrote:
>> >>>>>>I'm running virtualbox 3.2.12_1 if that has anything to do with it.
>> >>>>>>
>> >>>>>>sysctl vfs.zfs.arc_max: 6200000000
>> >>>>>>
>> >>>>>>While I'm trying to scp, kstat.zfs.misc.arcstats.size is
>> >>>>>>hovering right around that value, sometimes above, sometimes
>> >>>>>>below (that's as it should be, right?). I don't think that it
>> >>>>>>dies when crossing over arc_max. I can run the same scp 10 times
>> >>>>>>and it might fail 1-3 times, with no correlation to the
>> >>>>>>arcstats.size being above/below arc_max that I can see.
>> >>>>>>
>> >>>>>>Scott
>> >>>>>>
>> >>>>>>On Jul 5, 2011, at 3:00 AM, Peter Ross wrote:
>> >>>>>>
>> >>>>>>>Hi all,
>> >>>>>>>
>> >>>>>>>just as an addition: an upgrade to last Friday's
>> >>>>>>>FreeBSD-Stable and to VirtualBox 4.0.8 does not fix the
>> >>>>>>>problem.
>> >>>>>>>
>> >>>>>>>I will experiment a bit more tomorrow after hours and grab
>> >>>>some statistics.
>> >>>>>>>
>> >>>>>>>Regards
>> >>>>>>>Peter
>> >>>>>>>
>> >>>>>>>Quoting "Peter Ross" <Peter.Ross at bogen.in-berlin.de>:
>> >>>>>>>
>> >>>>>>>>Hi all,
>> >>>>>>>>
>> >>>>>>>>I noticed a similar problem last week. It is also very
>> >>>>>>>>similar to one reported last year:
>> >>>>>>>>
>> >>>>>>>>http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/058708.html
>> >>>>>>>>
>> >>>>>>>>My server is a Dell T410 server with the same bge card (the
>> >>>>>>>>same pciconf -lvc output as described by Mahlon:
>> >>>>>>>>
>> >>>>>>>>http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/058711.html
>> >>>>>>>>
>> >>>>>>>>Yours, Scott, is a em(4)..
>> >>>>>>>>
>> >>>>>>>>Another similarity: In all cases we are using VirtualBox. I
>> >>>>>>>>just want to mention it, in case it matters. I am still
>> >>>>>>>>running VirtualBox 3.2.
>> >>>>>>>>
>> >>>>>>>>Most of the time kstat.zfs.misc.arcstats.size was reaching
>> >>>>>>>>vfs.zfs.arc_max then, but I could catch one or two cases
>> >>>>>>>>then the value was still below.
>> >>>>>>>>
>> >>>>>>>>I added vfs.zfs.prefetch_disable=1 to sysctl.conf but it
>> >>does not help.
>> >>>>>>>>
>> >>>>>>>>BTW: It looks as ARC only gives back the memory when I
>> >>>>>>>>destroy the ZFS (a cloned snapshot containing virtual
>> >>>>>>>>machines). Even if nothing happens for hours the buffer
>> >>>>>>>>isn't released..
>> >>>>>>>>
>> >>>>>>>>My machine was still running 8.2-PRERELEASE so I am upgrading.
>> >>>>>>>>
>> >>>>>>>>I am happy to give information gathered on old/new kernel  
>> if it helps.
>> >>>>>>>>
>> >>>>>>>>Regards
>> >>>>>>>>Peter
>> >>>>>>>>
>> >>>>>>>>Quoting "Scott Sipe" <cscotts at gmail.com>:
>> >>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>On Jul 2, 2011, at 12:54 AM, jhell wrote:
>> >>>>>>>>>
>> >>>>>>>>>>On Fri, Jul 01, 2011 at 03:22:32PM -0700, Jeremy Chadwick wrote:
>> >>>>>>>>>>>On Fri, Jul 01, 2011 at 03:13:17PM -0400, Scott Sipe wrote:
>> >>>>>>>>>>>>I'm running 8.2-RELEASE and am having new problems
>> >>>>>>>>>>>>with scp. When scping
>> >>>>>>>>>>>>files to a ZFS directory on the FreeBSD server --
>> >>>>>>>>>>>>most notably large files
>> >>>>>>>>>>>>-- the transfer frequently dies after just a few
>> >>>>>>>>>>>>seconds. In my last test, I
>> >>>>>>>>>>>>tried to scp an 800mb file to the FreeBSD system and
>> >>>>>>>>>>>>the transfer died after
>> >>>>>>>>>>>>200mb. It completely copied the next 4 times I
>> >>>>>>>>>>>>tried, and then died again on
>> >>>>>>>>>>>>the next attempt.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>On the client side:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>"Connection to home closed by remote host.
>> >>>>>>>>>>>>lost connection"
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>In /var/log/auth.log:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>Jul  1 14:54:42 freebsd sshd[18955]: fatal: Write
>> >>>>>>>>>>>>failed: Cannot allocate
>> >>>>>>>>>>>>memory
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>I've never seen this before and have used scp before
>> >>>>>>>>>>>>to transfer large files
>> >>>>>>>>>>>>without problems. This computer has been used in
>> >>>>>>>>>>>>production for months and
>> >>>>>>>>>>>>has a current uptime of 36 days. I have not been
>> >>>>>>>>>>>>able to notice any problems
>> >>>>>>>>>>>>copying files to the server via samba or netatalk, or
>> >>>>any problems in
>> >>>>>>>>>>>>apache.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>Uname:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>FreeBSD xeon 8.2-RELEASE FreeBSD 8.2-RELEASE #0: Sat
>> >>>>>>>>>>>>Feb 19 01:02:54 EST
>> >>>>>>>>>>>>2011     root at xeon:/usr/obj/usr/src/sys/GENERIC  amd64
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>I've attached my dmesg and output of vmstat -z.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>I have not restarted the sshd daemon or rebooted the computer.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>Am glad to provide any other information or test anything else.
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>{snip vmstat -z and dmesg}
>> >>>>>>>>>>>
>> >>>>>>>>>>>You didn't provide details about your networking setup (rc.conf,
>> >>>>>>>>>>>ifconfig -a, etc.).  netstat -m would be useful too.
>> >>>>>>>>>>>
>> >>>>>>>>>>>Next, please see this thread circa September 2010,  
>> titled "Network
>> >>>>>>>>>>>memory allocation failures":
>> >>>>>>>>>>>
>> >>>>>>>>>>>http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/thread.html#58708
>> >>>>>>>>>>>
>> >>>>>>>>>>>The user in that thread is using rsync, which relies on
>> >>>>scp by default.
>> >>>>>>>>>>>I believe this problem is similar, if not identical, to yours.
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>Please also provide your output of ( /usr/bin/limits -a )
>> >>>>for the server
>> >>>>>>>>>>end and the client.
>> >>>>>>>>>>
>> >>>>>>>>>>I am not quite sure I agree with the need for ifconfig -a but some
>> >>>>>>>>>>information about the networking driver your using for  
>> the interface
>> >>>>>>>>>>would be helpful, uptime of the boxes. And configuration
>> >>of the pool.
>> >>>>>>>>>>e.g. ( zpool status -a ;zfs get all <poolname> ) You  
>> should probably
>> >>>>>>>>>>prop this information up somewhere so you can reference by
>> >>>>URL whenever
>> >>>>>>>>>>needed.
>> >>>>>>>>>>
>> >>>>>>>>>>rsync(1) does not rely on scp(1) whatsoever but rsync(1)
>> >>>>can be made to
>> >>>>>>>>>>use ssh(1) instead of rsh(1) and I believe that is what Jeremy is
>> >>>>>>>>>>stating here but correct me if I am wrong. It does use ssh(1) by
>> >>>>>>>>>>default.
>> >>>>>>>>>>
>> >>>>>>>>>>Its a possiblity as well that if using tmpfs(5) or  
>> mdmfs(8) for /tmp
>> >>>>>>>>>>type filesystems that rsync(1) may be just filling up your
>> >>>>temp ram area
>> >>>>>>>>>>and causing the connection abort which would be
>> >>>>>>>>>>expected. ( df -h ) would
>> >>>>>>>>>>help here.
>> >>>>>>>>>
>> >>>>>>>>>Hello,
>> >>>>>>>>>
>> >>>>>>>>>I'm not using tmpfs/mdmfs at all. The clients yesterday
>> >>>>>>>>>were 3 different OSX computers (over gigabit). The FreeBSD
>> >>>>>>>>>server has 12gb of ram and no bce adapter. For what it's
>> >>>>>>>>>worth, the server is backed up remotely every night with
>> >>>>>>>>>rsync (remote FreeBSD uses rsync to pull) to an offsite
>> >>>>>>>>>(slow cable connection) FreeBSD computer, and I have not
>> >>>>>>>>>seen any errors in the nightly rsync.
>> >>>>>>>>>
>> >>>>>>>>>Sorry for the omission of networking info, here's the
>> >>>>>>>>>output of the requested commands and some that popped up
>> >>>>>>>>>in the other thread:
>> >>>>>>>>>
>> >>>>>>>>>http://www.cap-press.com/misc/
>> >>>>>>>>>
>> >>>>>>>>>In rc.conf:  ifconfig_em1="inet 10.1.1.1 netmask 255.255.0.0"
>> >>>>>>>>>
>> >>>>>>>>>Scott
>> >>>>>
>> >>>>>Just to make it crystal clear to everyone:
>> >>>>>
>> >>>>>There is no correlation between this problem and use of ZFS.   
>> People are
>> >>>>>attempting to correlate "cannot allocate memory" messages with  
>> "anything
>> >>>>>on the system that uses memory".  The VM is much more complex  
>> than that.
>> >>>>>
>> >>>>>Given the nature of this problem, it's much more likely the issue is
>> >>>>>"somewhere" within a networking layer within FreeBSD, whether it be
>> >>>>>driver-level or some sort of intermediary layer.
>> >>>>>
>> >>>>>Two people who have this issue in this thread are both using  
>> VirtualBox.
>> >>>>>Can one, or both, of you remove VirtualBox from the configuration
>> >>>>>entirely (kernel, etc. -- not sure what is required) and then  
>> see if the
>> >>>>>issue goes away?
>> >>>>
>> >>>>On the machine in question I only can do it after hours so I will do
>> >>>>it tonight.
>> >>>>
>> >>>>I was _successfully_ sending the file over the loopback interface using
>> >>>>
>> >>>>cat /zpool/temp/zimbra_oldroot.vdi | ssh localhost "cat > /dev/null"
>> >>>>
>> >>>>I did it, btw, with the IPv6 localhost address first (accidently),
>> >>>>and then using IPv4. Both worked.
>> >>>>
>> >>>>It always fails if I am sending it through the bce(4) interface,
>> >>>>even if my target is the VirtualBox bridged to the bce card (so it
>> >>>>does not "leave" the computer physically).
>> >>>>
>> >>>>Below the uname -a, ifconfig -a, netstat -rn, pciconf -lv and
>> >>>>kldstat output.
>> >>>>
>> >>>>I have another box where I do not see that problem. It copies files
>> >>>>happily over the net using ssh.
>> >>>>
>> >>>>It is an an older HP ML 150 with 3GB RAM only but with a bge(4)
>> >>>>driver instead. It runs the same last week's RELENG_8. I installed
>> >>>>VirtualBox and enabled vboxnet (so it loads the kernel modules). But
>> >>>>I do not run VirtualBox on it (because it hasn't enough RAM).
>> >>>>
>> >>>>Regards
>> >>>>Peter
>> >>>>
>> >>>>DellT410one# uname -a
>> >>>>FreeBSD DellT410one.vv.fda 8.2-STABLE FreeBSD 8.2-STABLE #1: Thu Jun
>> >>>>30 17:07:18 EST 2011
>> >>>>root at DellT410one.vv.fda:/usr/obj/usr/src/sys/GENERIC  amd64
>> >>>>DellT410one# ifconfig -a
>> >>>>bce0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST>
>> >>>>metric 0 mtu 1500
>> >>>>	options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
>> >>>>	ether 84:2b:2b:68:64:e4
>> >>>>	inet 192.168.50.220 netmask 0xffffff00 broadcast 192.168.50.255
>> >>>>	inet 192.168.50.221 netmask 0xffffff00 broadcast 192.168.50.255
>> >>>>	inet 192.168.50.223 netmask 0xffffff00 broadcast 192.168.50.255
>> >>>>	inet 192.168.50.224 netmask 0xffffff00 broadcast 192.168.50.255
>> >>>>	inet 192.168.50.225 netmask 0xffffff00 broadcast 192.168.50.255
>> >>>>	inet 192.168.50.226 netmask 0xffffff00 broadcast 192.168.50.255
>> >>>>	inet 192.168.50.227 netmask 0xffffff00 broadcast 192.168.50.255
>> >>>>	inet 192.168.50.219 netmask 0xffffff00 broadcast 192.168.50.255
>> >>>>	media: Ethernet autoselect (1000baseT <full-duplex>)
>> >>>>	status: active
>> >>>>bce1: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
>> >>>>	options=c01bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,VLAN_HWTSO,LINKSTATE>
>> >>>>	ether 84:2b:2b:68:64:e5
>> >>>>	media: Ethernet autoselect
>> >>>>lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
>> >>>>	options=3<RXCSUM,TXCSUM>
>> >>>>	inet6 fe80::1%lo0 prefixlen 64 scopeid 0xb
>> >>>>	inet6 ::1 prefixlen 128
>> >>>>	inet 127.0.0.1 netmask 0xff000000
>> >>>>	nd6 options=3<PERFORMNUD,ACCEPT_RTADV>
>> >>>>vboxnet0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> metric 0 mtu 1500
>> >>>>	ether 0a:00:27:00:00:00
>> >>>>DellT410one# netstat -rn
>> >>>>Routing tables
>> >>>>
>> >>>>Internet:
>> >>>>Destination        Gateway            Flags    Refs      Use   
>> Netif Expire
>> >>>>default            192.168.50.201     UGS         0    52195   bce0
>> >>>>127.0.0.1          link#11            UH          0        6    lo0
>> >>>>192.168.50.0/24    link#1             U           0  1118212   bce0
>> >>>>192.168.50.219     link#1             UHS         0     9670    lo0
>> >>>>192.168.50.220     link#1             UHS         0     8347    lo0
>> >>>>192.168.50.221     link#1             UHS         0   103024    lo0
>> >>>>192.168.50.223     link#1             UHS         0    43614    lo0
>> >>>>192.168.50.224     link#1             UHS         0     8358    lo0
>> >>>>192.168.50.225     link#1             UHS         0     8438    lo0
>> >>>>192.168.50.226     link#1             UHS         0     8338    lo0
>> >>>>192.168.50.227     link#1             UHS         0     8333    lo0
>> >>>>192.168.165.0/24   192.168.50.200     UGS         0     3311   bce0
>> >>>>192.168.166.0/24   192.168.50.200     UGS         0      699   bce0
>> >>>>192.168.167.0/24   192.168.50.200     UGS         0     3012   bce0
>> >>>>192.168.168.0/24   192.168.50.200     UGS         0      552   bce0
>> >>>>
>> >>>>Internet6:
>> >>>>Destination                       Gateway
>> >>>>Flags      Netif Expire
>> >>>>::1                               ::1                           UH
>> >>>>lo0
>> >>>>fe80::%lo0/64                     link#11                       U
>> >>>>lo0
>> >>>>fe80::1%lo0                       link#11                       UHS
>> >>>>lo0
>> >>>>ff01::%lo0/32                     fe80::1%lo0                   U
>> >>>>lo0
>> >>>>ff02::%lo0/32                     fe80::1%lo0                   U
>> >>>>lo0
>> >>>>DellT410one# kldstat
>> >>>>Id Refs Address            Size     Name
>> >>>> 1   19 0xffffffff80100000 dbf5d0   kernel
>> >>>> 2    3 0xffffffff80ec0000 4c358    vboxdrv.ko
>> >>>> 3    1 0xffffffff81012000 131998   zfs.ko
>> >>>> 4    1 0xffffffff81144000 1ff1     opensolaris.ko
>> >>>> 5    2 0xffffffff81146000 2940     vboxnetflt.ko
>> >>>> 6    2 0xffffffff81149000 8e38     netgraph.ko
>> >>>> 7    1 0xffffffff81152000 153c     ng_ether.ko
>> >>>> 8    1 0xffffffff81154000 e70      vboxnetadp.ko
>> >>>>DellT410one# pciconf -lv
>> >>>>..
>> >>>>bce0 at pci0:1:0:0:        class=0x020000 card=0x028d1028
>> >>>>chip=0x163b14e4 rev=0x20 hdr=0x00
>> >>>>    vendor     = 'Broadcom Corporation'
>> >>>>    class      = network
>> >>>>    subclass   = ethernet
>> >>>>bce1 at pci0:1:0:1:        class=0x020000 card=0x028d1028
>> >>>>chip=0x163b14e4 rev=0x20 hdr=0x00
>> >>>>    vendor     = 'Broadcom Corporation'
>> >>>>    class      = network
>> >>>>    subclass   = ethernet
>> >>>
>> >>>Could you please provide "pciconf -lvcb" output instead, specific to the
>> >>>bce chips?  Thanks.
>> >>
>> >>Her it is:
>> >>
>> >>bce0 at pci0:1:0:0:        class=0x020000 card=0x028d1028
>> >>chip=0x163b14e4 rev=0x20 hdr=0x00
>> >>    vendor     = 'Broadcom Corporation'
>> >>    class      = network
>> >>    subclass   = ethernet
>> >>    bar   [10] = type Memory, range 64, base 0xda000000, size
>> >>33554432, enabled
>> >>    cap 01[48] = powerspec 3  supports D0 D3  current D0
>> >>    cap 03[50] = VPD
>> >>    cap 05[58] = MSI supports 16 messages, 64 bit enabled with 1 message
>> >>    cap 11[a0] = MSI-X supports 9 messages in map 0x10
>> >>    cap 10[ac] = PCI-Express 2 endpoint max data 256(512) link x4(x4)
>> >>ecap 0003[100] = Serial 1 842b2bfffe6864e4
>> >>ecap 0001[110] = AER 1 0 fatal 0 non-fatal 1 corrected
>> >>ecap 0004[150] = unknown 1
>> >>ecap 0002[160] = VC 1 max VC0
>> >
>> >Thanks Peter.
>> >
>> >Adding Yong-Hyeon and David to the discussion, since they've both worked
>> >on the bce(4) driver in recent months (most of the changes made recently
>> >are only in HEAD), and also adding Jack Vogel of Intel who maintains
>> >em(4).  Brief history for the devs:
>> >
>> >The issue is described "Network memory allocation failures" and was
>> >reported last year, but two users recently (Scott and Peter) have
>> >reported the issue again:
>> >
>> >http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/thread.html#58708
>> >
>> >And was mentioned again by Scott here, which also contains some
>> >technical details:
>> >
>> >http://lists.freebsd.org/pipermail/freebsd-stable/2011-July/063172.html
>> >
>> >What's interesting is that Scott's issue is identical in form but he's
>> >using em(4), which isn't known to behave like this.  Both individuals
>> >are using VirtualBox, though we're not sure at this point if that is the
>> >piece which is causing the anomaly.
>> >
>> >Relevant details of Scott's system (em-based):
>> >
>> >http://www.cap-press.com/misc/
>> >
>> >Relevant details of Peter's system (bce-based):
>> >
>> >http://lists.freebsd.org/pipermail/freebsd-stable/2011-July/063221.html
>> >http://lists.freebsd.org/pipermail/freebsd-stable/2011-July/063223.html
>> >
>> >I think the biggest complexity right now is figuring out how/why scp
>> >fails intermittently in this nature.  The errno probably "trickles down"
>> >to userland from the kernel, but the condition regarding why it happens
>> >is unknown.
>>
>> BTW: I also saw 2 of the errors coming from a BIND9 running in a
>> jail on that box.
>>
>> DellT410one# fgrep -i allocate /jails/bind/20110315/var/log/messages
>> Apr 13 05:17:41 bind named[23534]: internal_send:
>> 192.168.50.145#65176: Cannot allocate memory
>> Jun 21 23:30:44 bind named[39864]: internal_send:
>> 192.168.50.251#36155: Cannot allocate memory
>> Jun 24 15:28:00 bind named[39864]: internal_send:
>> 192.168.50.251#28651: Cannot allocate memory
>> Jun 28 12:57:52 bind named[2462]: internal_send:
>> 192.168.165.154#1201: Cannot allocate memory
>>
>> My initial guess: it happens sooner or later somehow - whether it is
>> a lot of traffic in one go (ssh/scp copies of virtual disks) or a
>> lot of traffic over a longer period (a nameserver gets asked again
>> and again).
>
> Scott, are you also using jails?  If both of you are: is there any
> possibility you can remove use of those?  I'm not sure how VirtualBox
> fits into the picture (jails + VirtualBox that is), but I can imagine
> jails having different environmental constraints that might cause this.
>
> Basically the troubleshooting process here is to remove pieces of the
> puzzle until you figure out which piece is causing the issue.  I don't
> want to get the NIC driver devs all spun up for something that, for
> example, might be an issue with the jail implementation.

I understand this. As said, I do some afterhours debugging tonight.

The scp/ssh problems are happening _outside_ the jails. The bind runs  
_inside_ the jail.

I wanted to use the _host_ system to send VirtualBox virtual disks and  
  filesystems used by jails to archive them and/or having them  
available on other FreeBSD systems (as a cold standby solution).

Regards
Peter