Question
We had several view processes running for hours. They had to be killed on the web server. This TWiki was installed about a month prior.
>
> > Subject: Processes
>
> >
>
> > Combined, these processes were eating up all the memory (2 gigs)
>
> > and swap (9 gigs) on NWSWWW.
>
> >
>
> > I killed them all around 9:40 last night.
>
> >
>
> > perf_dat 19774 23252 1 18:27:40 ? 106:44
>
> /usr/bin/perl -wT view.pl
>
> > perf_dat 27836 683 1 14:55:14 ? 210:30
>
> /usr/bin/perl -wT view.pl
>
> > perf_dat 20040 755 1 20:04:59 ? 49:27
>
> /usr/bin/perl -wT view.pl
>
> > perf_dat 23023 1138 1 14:41:20 ? 208:40
>
> /usr/bin/perl -wT view.pl
I tried to figure out from the log file which topics were being viewed - in those topics, did not see any obvious recursion or anything that might explain this behavior.
We have (at least one other) TWiki in use for past few years, on the same server, and have not seen this problem.
Environment
--
JohnBlevin - 05 Jan 2004
Answer
The
TWikiRelease01Feb2003 has no known recursive loop. However, I just fixed the
CalendarPlugin which had one under certain circumstances of including and included topic.
To debug, try to disable all Plugins (see TWiki.cfg). Check also if Apache's error log has something unusual.
--
PeterThoeny - 06 Jan 2004
More Info
I found out later that one of our users had closed the browser while the view
script was still executing. They did this several times during the course of
the day. Could there be a problem with view processes continuing to run and
chewing up resources, if the browser is terminated before view script completes?
--
JohnBlevin - 09 Jan 2004
I imagine this is possible - Apache does have a feature to limit the CPU used by processes, but I would have thought it would just kill the child process when closing its file descriptors (which should be part of finishing the HTTP transaction). However, researching this sounds worthwhile!
--
RichardDonkin - 10 Jan 2004
This happened yesterday and again this morning with both
bin/view and
bin/rdiff on twiki version 10 Mar 2004. My sysadmin is getting unhappy as it is noticeably impairing the other sites being hosted.
I went through
data/log*.txt ,viewing and rdiffing the some of the same pages and couldn't duplicate the problem at that time.
the apache error log contains lines like: "Premature end of script headers: rdiff"
LATER:
I went through the log and opened all the pages viewed since 00:00am today. Nothing untoward happened until I opened the last ~15 pages at once and then closed the browser before they finished loading. the Apache error log has some premature end of script headers for this morning but no errors for the 3 runaway rdiff process I just kicked off now. I've renamed rdiff to .rdiff so it can't be called.
LATER STILL: the premature end of script message occurs when the admin kills the runaway process, so it's a resultant error not a causative one.
--
MattWilkie - 26 Mar 2004
Hey Matt,
is this running with
RcsLite or
RcsWrap?
--
SvenDowideit - 27 Mar 2004
Lite.
renaming rdiff has helped, but not much as it still happens with view.
--
MattWilkie - 27 Mar 2004
This isn't much help in debugging this problem, but on Apache servers where you have admin access, you can use Apache 1.2 or higher
CPU resource limits
to kill runaway processes automatically.
--
RichardDonkin - 27 Mar 2004
#!/bin/env /usr/bin/perl
# ugly hack of a script to kill CGI's which
# have gone out of control
#
# 1) run top, get PID's of hung processes
# 2) put them in the kill line below
# 3) point browser at bin/kill-procs
# 4) edit again and comment out the kill line, remove the PID's <-- DON'T SKIP THIS STEP!
# `kill -9 ### ###`
## for some reason this doesn't work:
#print <<THIS;
#Content-Type: text/html
#<html><body>Killed processes</body></html>
#THIS
thanks to
MS for helping giving me a band-aid. On monday I'll follow up with the sysadmin and see if the CPU Limit thing is feasible. (does that actually kill processes or just slow them down? the description is not clear)
--
MattWilkie - 28 Mar 2004
SvenDowideit may have squashed the bug over the weekend. In
RcsLite.pm change
$version == $target to
$version <= $target . Going on
1836 hours now with no hung cgi's!
Index: lib/TWiki/Store/RcsLite.pm
===================================================================
RCS file: /cvsroot/twiki/twiki/lib/TWiki/Store/RcsLite.pm,v
retrieving revision 1.11
diff -r1.11 RcsLite.pm
834a835
>
838c839
< if( $version == $target ) {
---
> if( $version <= $target ) {
Okay I'm pretty sure we can say this bug is now squashed.
--
MattWilkie - 29,30 Mar 2004
If you do not have root access on your hosted site see
QuickAndDirtyExecUtilityForHostedSites
--
PeterThoeny - 01 Apr 2004
It's great that you found a fix for the problem! I was really
psyched, until I found that our configuration is using
RcsWrap
not
RcsLite. So, the recursion problem in
RcsLite is apparently
not what is causing the problem in our case.
Is there is a similar bug for
RcsWrap? I took a look through
the file, but don't find similar recursion...
The only other thing I found, is that there is another TWiki site
on our web server, which has the identical CGI scripts to ours,
except that they have as the first line
#!/opt/exp/bin/perl -wT [version 5.6.1]
but ours is
#!/usr/bin/perl -wT [version 5.005_03]
The two TWiki's also have a different $safeEnvPath and
$rcsDir (though the
RCS version in both $rcsDir's is 5.7) ...
--
JohnBlevin - 22 Apr 2004
I would try changing the first line of your scripts to match the newer perl version.
The
RcsLite bug just exacerbated a bug already present in Apache2. There is a workaround, see
TWikiOnApache2dot0Hangs, Owiki:ApacheTwoHangs,
TimeOutSavingTWikiPreferences.
--
MattWilkie - 23 Apr 2004
Our web administrator says we are running Apache 1.3.26. I guess that rules out our problem being related to Apache 2.
I'm going to update our scripts to point to perl 5.6.1, other than that have no leads at this point. Any ideas/suggestions?
--
JohnBlevin - 26 Apr 2004
After nearly 10 months of problem-free operation, this one bit us again today.
PID USERNAME THR PRI NICE SIZE RES STATE TIME CPU COMMAND
18136 perf_dat 1 10 0 366M 343M cpu1 17:43 27.90% view.pl
17964 perf_dat 1 10 0 366M 343M cpu1 17:45 27.61% view.pl
20369 perf_dat 1 0 0 325M 302M run 15:46 27.49% view.pl
263 root 6 58 0 7448K 5456K sleep 51:48 0.37% automountd
2117 hosu 1 58 0 1776K 1456K cpu3 0:00 0.24% top-sun4u-5.8
We're still not using Apache 2, not using RcsLite. If this happens again they might pull the plug on our TWiki. Anybody have any new info on this?
--
JohnBlevin - 27 Oct 2004