Improving gcore

Wed Mar 21 23:35:14 UTC 2012

Sometimes I have trouble capturing the "correct" state of a multithreaded process using gcore. That is, it looks like target process might have done some work since the time command was issued and the core file was generated.

Looking at the code, gcore calls ptrace(PT_ATTACH...), which internally issues SIGSTOP, and calls waitpid() to wait until the process stops. So, it's quite possible that some threads that are not sleeping interruptibly will continue to run until the process notices the signal. Signals are only checked when a thread that is tagged to handle the signal crosses the user boundary (return from syscall, trap). When the thread finally handles SIGSTOP, it needs to stop all threads, which is done by lighting a flag-bit it each thread. This bit is checked as each thread crosses the user boundary. So, there will always be some state change in the target process from the time SIGSTOP is posted to the time all threads are actually stopped. 

I was wondering if I could improve this a bit by calling PT_SUSPEND on all threads, instead of posting SIGSTOP and waiting for all threads to stop. Once the core is generated, unsuspend all threads. As with SIGSTOP, individual thread will only notice suspension as they cross user boundary. But there is no overhead of tagging a thread to handle the signal and that thread doing the suspension. The idea is to try and generate the core file which reflects the running state of the process as closely as possible.

Does this sound reasonable ?

Thanks,
Sushanth