Still unresolved - Re: one virtualbox vm disrupts all vms and entire network

Mon Jul 9 13:11:43 UTC 2012

On Tue, Jun 12, 2012 at 6:24 PM, Gary Palmer <gpalmer at freebsd.org> wrote:

> On Thu, Jun 07, 2012 at 03:56:22PM -0400, Steve Tuts wrote:
> > On Thu, Jun 7, 2012 at 3:54 AM, Steve Tuts <yiz5hwi at gmail.com> wrote:
> >
> > >
> > >
> > > On Thu, Jun 7, 2012 at 2:58 AM, Bernhard Fr?hlich <decke at bluelife.at
> >wrote:
> > >
> > >> On Do.,   7. Jun. 2012 01:07:52 CEST, Kevin Oberman <
> kob6558 at gmail.com>
> > >> wrote:
> > >>
> > >> > On Wed, Jun 6, 2012 at 3:46 PM, Steve Tuts <yiz5hwi at gmail.com>
> wrote:
> > >> > > On Wed, Jun 6, 2012 at 3:50 AM, Bernhard Froehlich
> > >> > > <decke at freebsd.org>wrote:
> > >> > >
> > >> > > > On 05.06.2012 20:16, Bernhard Froehlich wrote:
> > >> > > >
> > >> > > > > On 05.06.2012 19:05, Steve Tuts wrote:
> > >> > > > >
> > >> > > > > > On Mon, Jun 4, 2012 at 4:11 PM, Rusty Nejdl
> > >> > > > > > <rnejdl at ringofsaturn.com> wrote:
> > >> > > > > >
> > >> > > > > >  On 2012-06-02 12:16, Steve Tuts wrote:
> > >> > > > > > >
> > >> > > > > > >  Hi, we have a Dell poweredge server with a dozen
> interfaces.
> > >> > > > > > >  It hosts
> > >> > > > > > > > a
> > >> > > > > > > > few guests of web app and email servers with
> > >> > > > > > > > VirtualBox-4.0.14.  The host
> > >> > > > > > > > and all guests are FreeBSD 9.0 64bit.  Each guest is
> bridged
> > >> > > > > > > > to a distinct
> > >> > > > > > > > interface.  The host and all guests are set to 10.0.0.0
> > >> > > > > > > > network NAT'ed to
> > >> > > > > > > > a
> > >> > > > > > > > cicso router.
> > >> > > > > > > >
> > >> > > > > > > > This runs well for a couple months, until we added a new
> > >> > > > > > > > guest recently.
> > >> > > > > > > > Every few hours, none of the guests can be connected.
>  We
> > >> > > > > > > > can only connect
> > >> > > > > > > > to the host from outside the router.  We can also go to
> the
> > >> > > > > > > > console of the
> > >> > > > > > > > guests (except the new guest), but from there we can't
> ping
> > >> > > > > > > > the gateway 10.0.0.1 any more.  The new guest just
> froze.
> > >> > > > > > > >
> > >> > > > > > > > Furthermore, on the host we can see a vboxheadless
> process
> > >> > > > > > > > for each guest,
> > >> > > > > > > > including the new guest.  But we can not kill it, not
> even
> > >> > > > > > > > with "kill -9".
> > >> > > > > > > > We looked around the web and someone suggested we
> should use
> > >> > > > > > > > "kill -SIGCONT" first since the "ps" output has the "T"
> flag
> > >> > > > > > > > for that vboxheadless process for that new guest, but
> that
> > >> > > > > > > > doesn't help.  We also
> > >> > > > > > > > tried all the VBoxManager commands to poweroff/reset etc
> > >> > > > > > > > that new guest,
> > >> > > > > > > > but they all failed complaining that vm is in Aborted
> state.
> > >> > > > > > > >  We also tried
> > >> > > > > > > > VBoxManager commands to disconnect the network cable for
> > >> > > > > > > > that new guest,
> > >> > > > > > > > it
> > >> > > > > > > > didn't complain, but there was no effect.
> > >> > > > > > > >
> > >> > > > > > > > For a couple times, on the host we disabled the
> interface
> > >> > > > > > > > bridging that new
> > >> > > > > > > > guest, then that vboxheadless process for that new guest
> > >> > > > > > > > disappeared (we
> > >> > > > > > > > attempted to kill it before that).  And immediately all
> > >> > > > > > > > other vms regained
> > >> > > > > > > > connection back to normal.
> > >> > > > > > > >
> > >> > > > > > > > But there is one time even the above didn't help - the
> > >> > > > > > > > vboxheadless process
> > >> > > > > > > > for that new guest stubbonly remains, and we had to
> reboot
> > >> > > > > > > > the host.
> > >> > > > > > > >
> > >> > > > > > > > This is already a production server, so we can't upgrade
> > >> > > > > > > > virtualbox to the
> > >> > > > > > > > latest version until we obtain a test server.
> > >> > > > > > > >
> > >> > > > > > > > Would you advise:
> > >> > > > > > > >
> > >> > > > > > > > 1. is there any other way to kill that new guest
> instead of
> > >> > > > > > > > rebooting? 2. what might cause the problem?
> > >> > > > > > > > 3. what setting and test I can do to analyze this
> problem?
> > >> > > > > > > > ______________________________****_________________
> > >> > > > > > > >
> > >> > > > > > > >
> > >> > > > > > > I haven't seen any comments on this and don't want you to
> > >> > > > > > > think you are being ignored but I haven't seen this but
> also,
> > >> > > > > > > the 4.0 branch was buggier
> > >> > > > > > > for me than the 4.1 releases so yeah, upgrading is
> probably
> > >> > > > > > > what you are looking at.
> > >> > > > > > >
> > >> > > > > > > Rusty Nejdl
> > >> > > > > > > ______________________________****_________________
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >  sorry, just realize my reply yesterday didn't go to the
> list,
> > >> > > > > > > so am
> > >> > > > > > re-sending with some updates.
> > >> > > > > >
> > >> > > > > > Yes, we upgraded all ports and fortunately everything went
> back
> > >> > > > > > and especially all vms has run peacefully for two days now.
>  So
> > >> > > > > > upgrading to the latest virtualbox 4.1.16 solved that
> problem.
> > >> > > > > >
> > >> > > > > > But now we got a new problem with this new version of
> > >> virtualbox:
> > >> > > > > > whenever
> > >> > > > > > we try to vnc to any vm, that vm will go to Aborted state
> > >> > > > > > immediately. Actually, merely telnet from within the host
> to the
> > >> > > > > > vnc port of that vm will immediately Abort that vm.  This
> > >> > > > > > prevents us from adding new vms. Also, when starting vm
> with vnc
> > >> > > > > > port, we got this message:
> > >> > > > > >
> > >> > > > > > rfbListenOnTCP6Port: error in bind IPv6 socket: Address
> already
> > >> > > > > > in use
> > >> > > > > >
> > >> > > > > > , which we found someone else provided a patch at
> > >> > > > > >
> > >> http://permalink.gmane.org/**gmane.os.freebsd.devel.**emulation/10237
> <
> > >> http://permalink.gmane.org/gmane.os.freebsd.devel.emulation/10237>
> > >> > > > > >
> > >> > > > > > So looks like when there are multiple vms on a ipv6 system
> (we
> > >> > > > > > have 64bit FreeBSD 9.0) will get this problem.
> > >> > > > > >
> > >> > > > >
> > >> > > > > Glad to hear that 4.1.16 helps for the networking problem.
> The VNC
> > >> > > > > problem is also a known one but the mentioned patch does not
> work
> > >> > > > > at least for a few people. It seems the bug is somewhere in
> > >> > > > > libvncserver so downgrading net/libvncserver to an earlier
> version
> > >> > > > > (and rebuilding virtualbox) should help until we come up with
> a
> > >> > > > > proper fix.
> > >> > > > >
> > >> > > >
> > >> > > > You are right about the "Address already in use" problem and the
> > >> > > > patch for it so I will commit the fix in a few moments.
> > >> > > >
> > >> > > > I have also tried to reproduce the VNC crash but I couldn't.
> > >> Probably
> > >> > > > because
> > >> > > > my system is IPv6 enabled. flo@ has seen the same crash and
> has no
> > >> > > > IPv6 in his kernel which lead him to find this commit in
> > >> > > > libvncserver:
> > >> > > >
> > >> > > >
> > >> > > > commit 66282f58000c8863e104666c30cb67**b1d5cbdee3
> > >> > > > Author: Kyle J. McKay <mackyle at gmail.com>
> > >> > > > Date:   Fri May 18 00:30:11 2012 -0700
> > >> > > >     libvncserver/sockets.c: do not segfault when
> > >> > > > listenSock/listen6Sock == -1
> > >> > > >
> > >> > > > http://libvncserver.git.**
> > >> sourceforge.net/git/gitweb.**cgi?p=libvncserver/
> > >> > > > **libvncserver;a=commit;h=**66282f5<
> > >>
> http://libvncserver.git.sourceforge.net/git/gitweb.cgi?p=libvncserver/libvncserver;a=commit;h=66282f5
> > >> >
> > >> > > >
> > >> > > >
> > >> > > > It looks promising so please test this patch if you can
> reproduce
> > >> the
> > >> > > > crash.
> > >> > > >
> > >> > > >
> > >> > > > --
> > >> > > > Bernhard Froehlich
> > >> > > > http://www.bluelife.at/
> > >> > > >
> > >> > >
> > >> > > Sorry, I tried to try this patch, but couldn't figure out how to
> do
> > >> > > that. I use ports to compile everything, and can see the file is
> at
> > >> > >
> > >>
> /usr/ports/net/libvncserver/work/LibVNCServer-0.9.9/libvncserver/sockets.c
> > >> > > .  However, if I edit this file and do make clean, this patch is
> wiped
> > >> > > out before I can do "make" out of it.  How to apply this patch in
> the
> > >> > > ports?
> > >> >
> > >> > To apply patches to ports:
> > >> > # make clean
> > >> > # make patch
> > >> > <Apply patch>
> > >> > # make
> > >> > # make deinstall
> > >> > # make reinstall
> > >> >
> > >> > Note that the final two steps assume a version of the port is
> already
> > >> > installed. If not: 'make install'
> > >> > I you use portmaster, after applying the patch: 'portmaster -C
> > >> > net/libvncserver' --
> > >>
> > >> flo has already committed the patch to net/libvncserver so I guess it
> > >> fixes the problem. Please update your portstree and verify that it
> works
> > >> fine.
> > >>
> > >
> > > I confirmed after upgrading all ports and noticing libvncserver
> upgraded
> > > to 0.99_1 and reboot, then I can vnc to the vms now.  Also, starting
> vms
> > > with vnc doesn't have that error now, instead it issues the following
> info,
> > > so all problem are solved.
> > >
> > > 07/06/2012 03:49:14 Listening for VNC connections on TCP port 5903
> > > 07/06/2012 03:49:14 Listening for VNC connections on TCP6 port 5903
> > >
> > > Thanks everyone for your great help!
> > >
> >
> > Unfortunately, seems that the original problem of one vm disrupts all vms
> > and entire network appears to remain, albeit to less scope.  After
> running
> > on virtualbox-ose-4.1.16_1 and libvncserver-0.9.9_1 for 12 hours, all vms
> > lost connection again.  Also, phpvirtualbox stopped responding, and
> > attempts to restart vboxwebsrv hanged.  And trying to kill (-9) the
> > vboxwebsrv process won't work.  The following was the output of "ps
> > aux|grep -i box" at that time:
> >
> > root 3322  78.7 16.9 4482936 4248180  ??  Is    3:42AM   126:00.53
> > /usr/local/bin/VBoxHeadless --startvm vm1
> > root 3377   0.2  4.3 1286200 1078728  ??  Is    3:42AM    15:39.40
> > /usr/local/bin/VBoxHeadless --startvm vm2
> > root 3388   0.1  4.3 1297592 1084676  ??  Is    3:42AM    15:06.97
> > /usr/local/bin/VBoxHeadless --startvm vm7 -n -m 5907 -o
> jtlgjkrfyh9tpgjklfds
> > root 2453   0.0  0.0 141684   7156  ??  Ts    3:38AM     4:14.09
> > /usr/local/bin/vboxwebsrv
> > root 2478   0.0  0.0  45288   2528  ??  S     3:38AM     1:29.99
> > /usr/local/lib/virtualbox/VBoxXPCOMIPCD
> > root 2494   0.0  0.0 121848   5380  ??  S     3:38AM     3:13.96
> > /usr/local/lib/virtualbox/VBoxSVC --auto-shutdown
> > root 3333   0.0  4.3 1294712 1079608  ??  Is    3:42AM    19:35.09
> > /usr/local/bin/VBoxHeadless --startvm vm3
> > root 3355   0.0  4.3 1290424 1079332  ??  Is    3:42AM    16:43.05
> > /usr/local/bin/VBoxHeadless --startvm vm5
> > root 3366   0.0  8.5 2351436 2140076  ??  Is    3:42AM    17:32.35
> > /usr/local/bin/VBoxHeadless --startvm vm6
> > root 3598   0.0  4.3 1294520 1078664  ??  Ds    3:50AM    15:01.04
> > /usr/local/bin/VBoxHeadless --startvm vm4 -n -m 5904 -o
> > u679y0uojlkdfsgkjtfds
> >
> > You can see the vboxwebsrv process has the "T" flag there, and the
> > vboxheadless process for vm4 has "D" flag there.  Both of such processes
> I
> > can never kill them, not even with "kill -9".  So on the host I disabled
> > the interface bridged to vm4 and restarted network, and fortunately both
> > the vm4 and the vboxwebsrv processed disappeared.  And at that point all
> > other vms regained network.
> >
> > There may be one hope that the "troublemaker" may be limited to one of
> the
> > vms that started with vnc, although there was no vnc connection at that
> > time, and the other vm with vnc was fine.  And this is just a hopeful
> guess.
> >
> > Also I found no log or error message related to virtualbox in any log
> > file.  The VBoxSVC.log only had some information when started but never
> > since.
>
> If this is still a problem then
>
> ps alxww | grep -i box
>
> may be more helpful as it will show the wait channel of processes stuck
> in the kernel.
>
> Gary
>

We avoided this problem by running all vms without vnc.  But forgot this
problem and left one vm on with vnc, together with the other few running
vms yesterday, and hit this problem again on virtualbox 4.1.16.  Only the
old trick of turning off the host interface corresponding to the vm with
vnc and then restarting host network got us out of the problem.

We then upgraded virtualbox to 4.1.18, turning off all vms, wait until "ps
aux|grep -i box" reported nothing, then started all vms.  And let no vm
with vnc running.

Still the problem hit us again.  Here is the output of " ps alxww | grep -i
box" as you suggested:

1011    42725    1    0    20    0    1289796    1081064    IPRT S    Is
??    30:53.24  VBoxHeadless --startvm vm5

after "kill -9 42725", the line changed to

1011    42725    1    0    20    0    1289796    1081064    keglim    Ts
??    30:53.24  VBoxHeadless --startvm vm5

after "kill -9" for another vm, the line changed to something like

1011    42754    1    0    20    0    1289796    1081064    -    Ts
??    30:53.24  VBoxHeadless --startvm vm7

and controlvm command don't work, and these command stuck there
themselves.  The following are their outputs:

0    89572    79180    0    21    0    44708    1644    select     I+
v6    0:00.01    VBoxManage controlvm projects_outside acpipowerbutton
0    89605    89586    0    21    0    44708    2196    select     I+
v7    0:00.01    VBoxManage controlvm projects_outside poweroff

We now rebooted the host, and left no vm with vnc running.