librelist archives

« back to archive

Workers stop responding to requests

Workers stop responding to requests

From:
Mason Jones
Date:
2010-05-17 @ 20:49
I'm hoping someone can help with suggestions of how to diagnose
this... I've got a server running with 4 workers hanging out waiting
to be given a job to do. I just found that they'd stopped responding
-- I'd go through my web interface to kick off a job, and it would
sort of vanish. In resque-web I could see the four workers, and I
could see that nothing changed when I kicked off the job, no jobs
showed in the queue, and I still saw 0 of 4 workers working.

I killed the four workers, restarted them, and everything went back to
working just fine.

I suspect that if this happened once, it'll happen again. Anyone have
suggestions for what to look at when it does, in order to try to track
down the cause? I don't have logging turned on right now, but I'll do
that of course.

One of the things that surprised me is that I wouldn't have expected
restarting the workers to fix this, because it seemed as though the
jobs weren't even getting put on the queue. I expected to see the jobs
sitting on the queue waiting for workers, but that wasn't the case.

Thanks for any thoughts.

Workers stop responding to requests

From:
Mason Jones
Date:
2010-05-21 @ 18:16
Following up on my previous messages about this... I restarted my
Resque workers yesterday, and although they're still running, as of
this morning they've stopped handling requests. I've been checking out
the stats and other information, and it's interesting. I'm hoping some
of this might make someone realize "Oh, it's because of foo!" because
it's definitely puzzling me.

- Requests are getting to Resque: I triggered a new job and I saw the
"Processed" number increment in resque-web stats
- I have VERBOSE turned on, and I do see one of the workers output the
expected "got Job"/"done Job"
- There have been 5 jobs processed, but I only see "got Job"/"done
Job" shown 4 times, which seems interesting
- I have 4 workers running; 3 of the jobs were processed by worker #3,
1 of them by worker #2
- The workers are running in the background with stdout/stderr
redirected to individual files
- I haven't yet gotten logging to work in production mode from my code
in the worker unfortunately, so I'm not sure what's happening inside
there
- In resque-web > Stats > Keys, I see:

queues	set	1
stat:processed	string	1
stat:processed:domU-12-31-39-0A-61-81:13622:load_patent	string	1
stat:processed:domU-12-31-39-0A-61-81:13633:load_patent	string	1
worker:domU-12-31-39-0A-61-81:13615:load_patent:started	string	30
worker:domU-12-31-39-0A-61-81:13622:load_patent:started	string	30
worker:domU-12-31-39-0A-61-81:13633:load_patent:started	string	30
worker:domU-12-31-39-0A-61-81:13642:load_patent:started	string	30
workers	set	4

Which seems right, I think. I triggered another job which I saw go to
worker #1, which then added the "stat:processed" key for that worker
since it hadn't gotten a job yet. So the queue processing itself is
working, I believe.

But I can't tell if the jobs are being pulled off the queue but the
worker isn't invoking my code, or what; but it's definitely odd. The
workers were happy yesterday before I restarted them and the code
hasn't changed (I just restarted them to test backgrounding vs
foregrounding).

I can leave them in this state for the moment, in case anyone has a
great idea for debugging this further while they're "stuck". Any ideas
are welcome!

And further thoughts about how to get my log.info() output to show up
from the workers in production mode would be fantastic too.

Thanks...

Re: Workers stop responding to requests

From:
Mason Jones
Date:
2010-05-23 @ 00:26
I've thankfully been able to get logging working and verified that the
problem was, not surprisingly, the "MySQL server has gone away" issue.
I know there's the gist for setting up an initializer to retry, but
given the "reconnect: true" option for Rails 2.3+, is there an
advantage to using the initializer? Certainly the reconnect option is
simpler, so I'm just curious what others have decided to use.

Thx.

On Fri, May 21, 2010 at 11:16 AM, Mason Jones <masonoise@gmail.com> wrote:
> Following up on my previous messages about this... I restarted my
> Resque workers yesterday, and although they're still running, as of
> this morning they've stopped handling requests. I've been checking out
> the stats and other information, and it's interesting. I'm hoping some
> of this might make someone realize "Oh, it's because of foo!" because
> it's definitely puzzling me.
>
> - Requests are getting to Resque: I triggered a new job and I saw the
> "Processed" number increment in resque-web stats
> - I have VERBOSE turned on, and I do see one of the workers output the
> expected "got Job"/"done Job"
> - There have been 5 jobs processed, but I only see "got Job"/"done
> Job" shown 4 times, which seems interesting
> - I have 4 workers running; 3 of the jobs were processed by worker #3,
> 1 of them by worker #2
> - The workers are running in the background with stdout/stderr
> redirected to individual files
> - I haven't yet gotten logging to work in production mode from my code
> in the worker unfortunately, so I'm not sure what's happening inside
> there
> - In resque-web > Stats > Keys, I see:
>
> queues  set     1
> stat:processed  string  1
> stat:processed:domU-12-31-39-0A-61-81:13622:load_patent string  1
> stat:processed:domU-12-31-39-0A-61-81:13633:load_patent string  1
> worker:domU-12-31-39-0A-61-81:13615:load_patent:started string  30
> worker:domU-12-31-39-0A-61-81:13622:load_patent:started string  30
> worker:domU-12-31-39-0A-61-81:13633:load_patent:started string  30
> worker:domU-12-31-39-0A-61-81:13642:load_patent:started string  30
> workers set     4
>
> Which seems right, I think. I triggered another job which I saw go to
> worker #1, which then added the "stat:processed" key for that worker
> since it hadn't gotten a job yet. So the queue processing itself is
> working, I believe.
>
> But I can't tell if the jobs are being pulled off the queue but the
> worker isn't invoking my code, or what; but it's definitely odd. The
> workers were happy yesterday before I restarted them and the code
> hasn't changed (I just restarted them to test backgrounding vs
> foregrounding).
>
> I can leave them in this state for the moment, in case anyone has a
> great idea for debugging this further while they're "stuck". Any ideas
> are welcome!
>
> And further thoughts about how to get my log.info() output to show up
> from the workers in production mode would be fantastic too.
>
> Thanks...
>

Re: [resque] Re: Workers stop responding to requests

From:
Tony Arcieri
Date:
2010-05-23 @ 22:43
On Sat, May 22, 2010 at 6:26 PM, Mason Jones <masonoise@gmail.com> wrote:

> I've thankfully been able to get logging working and verified that the
> problem was, not surprisingly, the "MySQL server has gone away" issue.
> I know there's the gist for setting up an initializer to retry, but
> given the "reconnect: true" option for Rails 2.3+, is there an
> advantage to using the initializer? Certainly the reconnect option is
> simpler, so I'm just curious what others have decided to use.
>

I decided to use JRuby, which goes through JDBC for database connectivity
and the JDBC drivers are a lot more robust.

-- 
Tony Arcieri
Medioh! A Kudelski Brand

Re: [resque] Re: Workers stop responding to requests

From:
Mason Jones
Date:
2010-05-23 @ 22:57
On Sun, May 23, 2010 at 3:43 PM, Tony Arcieri <tony.arcieri@medioh.com> wrote:
> On Sat, May 22, 2010 at 6:26 PM, Mason Jones <masonoise@gmail.com> wrote:
>>
>> I've thankfully been able to get logging working and verified that the
>> problem was, not surprisingly, the "MySQL server has gone away" issue.
>> I know there's the gist for setting up an initializer to retry, but
>> given the "reconnect: true" option for Rails 2.3+, is there an
>> advantage to using the initializer? Certainly the reconnect option is
>> simpler, so I'm just curious what others have decided to use.
>
> I decided to use JRuby, which goes through JDBC for database connectivity
> and the JDBC drivers are a lot more robust.

That's an interesting point, thanks. At the moment we're running a
pretty vanilla setup, but it's a new app. I've considered looking into
JRuby, as well as Rubinius given the new release. I hadn't considered
that JRuby would use JDBC, but that makes sense. After many many years
working with Java I sort of felt like I was done dealing with the JVM,
but...perhaps not.

Re: [resque] Re: Workers stop responding to requests

From:
Luke Antins
Date:
2010-05-23 @ 10:07
Mason Jones wrote:
> I've thankfully been able to get logging working and verified that the
> problem was, not surprisingly, the "MySQL server has gone away" issue.

Grate! Glad you got the problem tracked down, hopefully your posts will help 
others in the future!

> I know there's the gist for setting up an initializer to retry, but
> given the "reconnect: true" option for Rails 2.3+, is there an
> advantage to using the initializer? Certainly the reconnect option is
> simpler, so I'm just curious what others have decided to use.

I use Sequel myself, and I don't use Rails so I could be very wrong here..
But I'm sure the initilizer stuff was mostly done because there wasn't a 
reconnect option previously?

I'd suggest giving it a try with `reconnect: true`, if it doesn't solve the 
problem give the initilizer route a go.

-- 
Luke Antins
http://lividpenguin.com

Re: [resque] Re: Workers stop responding to requests

From:
Mason Jones
Date:
2010-05-23 @ 20:44
On Sun, May 23, 2010 at 3:07 AM, Luke Antins <luke@lividpenguin.com> wrote:
> Mason Jones wrote:
>> I've thankfully been able to get logging working and verified that the
>> problem was, not surprisingly, the "MySQL server has gone away" issue.
>
> Grate! Glad you got the problem tracked down, hopefully your posts will help
> others in the future!

Yes, I hope so. The issue with logging not showing up in production,
I'm sad to say, was me being an idiot. The buffered logging in Rails
production got me, because of course the work is spread across
multiple workers and none of them were reaching the buffer size before
they died. A simple logger.flush in the right places made all the
difference. Sigh.

>> I know there's the gist for setting up an initializer to retry, but
>> given the "reconnect: true" option for Rails 2.3+, is there an
>> advantage to using the initializer? Certainly the reconnect option is
>> simpler, so I'm just curious what others have decided to use.
>
> I use Sequel myself, and I don't use Rails so I could be very wrong here..
> But I'm sure the initilizer stuff was mostly done because there wasn't a
> reconnect option previously?
>
> I'd suggest giving it a try with `reconnect: true`, if it doesn't solve the
> problem give the initilizer route a go.

That was my thinking, but since the Resque page on github refers to a
gist I figured I would check. The reconnect appears to be doing the
trick so far.

Thanks for the thoughts.

Re: [resque] Workers stop responding to requests

From:
Chris Wanstrath
Date:
2010-05-17 @ 21:23
On Mon, May 17, 2010 at 1:49 PM, Mason Jones <masonoise@gmail.com> wrote:

> I'm hoping someone can help with suggestions of how to diagnose
> this... I've got a server running with 4 workers hanging out waiting
> to be given a job to do. I just found that they'd stopped responding
> -- I'd go through my web interface to kick off a job, and it would
> sort of vanish. In resque-web I could see the four workers, and I
> could see that nothing changed when I kicked off the job, no jobs
> showed in the queue, and I still saw 0 of 4 workers working.

Check the "processed" stat on the "/stats/resque" resque-web page.
Does it go up?

> I suspect that if this happened once, it'll happen again. Anyone have
> suggestions for what to look at when it does, in order to try to track
> down the cause? I don't have logging turned on right now, but I'll do
> that of course.

Logging with VVERBOSE might help too.

Re: [resque] Workers stop responding to requests

From:
Mason Jones
Date:
2010-05-17 @ 23:09
On Mon, May 17, 2010 at 2:23 PM, Chris Wanstrath <chris@ozmm.org> wrote:
> On Mon, May 17, 2010 at 1:49 PM, Mason Jones <masonoise@gmail.com> wrote:
>
>> I'm hoping someone can help with suggestions of how to diagnose
>> this... I've got a server running with 4 workers hanging out waiting
>> to be given a job to do. I just found that they'd stopped responding
>> -- I'd go through my web interface to kick off a job, and it would
>> sort of vanish. In resque-web I could see the four workers, and I
>> could see that nothing changed when I kicked off the job, no jobs
>> showed in the queue, and I still saw 0 of 4 workers working.
>
> Check the "processed" stat on the "/stats/resque" resque-web page.
> Does it go up?

Interesting question. I'll check that out if/when this happens again.

Right now I just got an alert letting me know that all 4 workers were
gone, which they were, with no info about how/why. I now have logging
running so we'll see what I can learn from that. Very odd stuff. They
weren't processing any jobs when they vanished, but that's all I can
tell right now.

>> I suspect that if this happened once, it'll happen again. Anyone have
>> suggestions for what to look at when it does, in order to try to track
>> down the cause? I don't have logging turned on right now, but I'll do
>> that of course.
>
> Logging with VVERBOSE might help too.

Sounds good. I noticed that the regular VERBOSE logging doesn't
include timestamps... It would be nice to have that to know when a job
was processed and completed. I can look into adding that unless
there's a reason why it wasn't done.

Thanks.

Re: [resque] Workers stop responding to requests

From:
Chris Wanstrath
Date:
2010-05-18 @ 21:31
On Mon, May 17, 2010 at 4:09 PM, Mason Jones <masonoise@gmail.com> wrote:

> Sounds good. I noticed that the regular VERBOSE logging doesn't
> include timestamps... It would be nice to have that to know when a job
> was processed and completed. I can look into adding that unless
> there's a reason why it wasn't done.

Good point. I'd like to improve the logging in 2.0, this seems like a
good candidate.

Re: [resque] Workers stop responding to requests

From:
Mason Jones
Date:
2010-05-18 @ 22:55
On Tue, May 18, 2010 at 2:31 PM, Chris Wanstrath <chris@ozmm.org> wrote:
> On Mon, May 17, 2010 at 4:09 PM, Mason Jones <masonoise@gmail.com> wrote:
>
>> Sounds good. I noticed that the regular VERBOSE logging doesn't
>> include timestamps... It would be nice to have that to know when a job
>> was processed and completed. I can look into adding that unless
>> there's a reason why it wasn't done.
>
> Good point. I'd like to improve the logging in 2.0, this seems like a
> good candidate.

Another quick note is that in the VVERBOSE logging, it appears that
the timestamps are being shown in 12-hour format instead of 24-hour?
I'm seeing lines from 21:15 which are showing as 09:15.

I'm still struggling with workers simply vanishing -- just happened
again last night and then again an hour ago. I had VVERBOSE on, but it
doesn't show anything too useful, unfortunately. Just seems like the
processes disappear for no reason, while they're not even working, and
the last thing I see in the logging output is the usual:

** [09:15:21 2010-05-18] 11091: Checking load_task
** [09:15:21 2010-05-18] 11091: Sleeping for 5
** [09:15:21 2010-05-18] 11091: resque-1.8.1: Waiting for load_task

This was using the same versions of everything I have on my Mac, where
I've had a worker running for well over a week, but for the hell of it
I just updated resque to 1.8.5, redis-namespace to 0.4.4, and redis-rb
to 1.0.6 to see if it makes any difference (running Redis 1.3.8).
We'll see how it goes now. I have them running in separate terminals
going to stdout now so I can see exactly when they die, even if not
necessarily how.

Ideas are welcome! It's a weird situation that may force me to move to
something other than resque until we can figure it out, but I don't
want to...

Re: [resque] Workers stop responding to requests

From:
Luke Antins
Date:
2010-05-18 @ 23:54
Mason Jones wrote:

> Ideas are welcome! It's a weird situation that may force me to move to
> something other than resque until we can figure it out, but I don't
> want to...

Curious...
Are the workers and the Redis server on the same host? or separate machines?

And just to clarify...
The actually worker processes are still running? Just not processing any jobs 
from the queue or displaying in the web interface? (ie, they still show up in a 
`ps aux` or `top`)

Or do the worker processes die?

-- 
Luke Antins
http://lividpenguin.com

Re: [resque] Workers stop responding to requests

From:
Mason Jones
Date:
2010-05-19 @ 05:13
On Tue, May 18, 2010 at 4:54 PM, Luke Antins <luke@lividpenguin.com> wrote:
> Mason Jones wrote:
>
>> Ideas are welcome! It's a weird situation that may force me to move to
>> something other than resque until we can figure it out, but I don't
>> want to...
>
> Curious...
> Are the workers and the Redis server on the same host? or separate machines?

They're on the same machine right now. The Redis server is currently
being used only by Resque, as well.

> And just to clarify...
> The actually worker processes are still running? Just not processing any jobs
> from the queue or displaying in the web interface? (ie, they still show up in a
> `ps aux` or `top`)
>
> Or do the worker processes die?

They actually die -- gone from the web interface, gone from Redis (not
appearing in the worker list), and gone from a 'ps' list. It's almost
as if they're being sent a "kill -9". I checked /var/log/messages for
anything from the out-of-memory killer just in case, but nothing (as
expected, since there seems to be plenty of memory remaining at any
given time).

Re: [resque] Workers stop responding to requests

From:
Luke Antins
Date:
2010-05-19 @ 05:47
Mason Jones wrote:
> On Tue, May 18, 2010 at 4:54 PM, Luke Antins<luke@lividpenguin.com>  wrote:
>> Mason Jones wrote:
>>
>>> Ideas are welcome! It's a weird situation that may force me to move to
>>> something other than resque until we can figure it out, but I don't
>>> want to...
>> Curious...
>> Are the workers and the Redis server on the same host? or separate machines?
>
> They're on the same machine right now. The Redis server is currently
> being used only by Resque, as well.

Okay, shouldn't be an issue here if its all local.

>> And just to clarify...
>> The actually worker processes are still running? Just not processing any jobs
>> from the queue or displaying in the web interface? (ie, they still show up in a
>> `ps aux` or `top`)
>>
>> Or do the worker processes die?
>
> They actually die -- gone from the web interface, gone from Redis (not
> appearing in the worker list), and gone from a 'ps' list. It's almost
> as if they're being sent a "kill -9". I checked /var/log/messages for
> anything from the out-of-memory killer just in case, but nothing (as
> expected, since there seems to be plenty of memory remaining at any
> given time).

Have you seen the problem since firing up the workers in another terminal?
If so, did you get a backtrace/error this time round?

I've usually had my workers running under `screen`, so if things do go sour, I 
can then re-attach to the session and see all the scrollback.

This isn't a 'fix', but you could always use monit/god to restart the workers 
should they die?

-- 
Luke Antins
http://lividpenguin.com

Re: [resque] Workers stop responding to requests

From:
Mason Jones
Date:
2010-05-19 @ 16:35
On Tue, May 18, 2010 at 10:47 PM, Luke Antins <luke@lividpenguin.com> wrote:
> Mason Jones wrote:
>>> And just to clarify...
>>> The actually worker processes are still running? Just not processing any jobs
>>> from the queue or displaying in the web interface? (ie, they still 
show up in a
>>> `ps aux` or `top`)
>>>
>>> Or do the worker processes die?
>>
>> They actually die -- gone from the web interface, gone from Redis (not
>> appearing in the worker list), and gone from a 'ps' list. It's almost
>> as if they're being sent a "kill -9". I checked /var/log/messages for
>> anything from the out-of-memory killer just in case, but nothing (as
>> expected, since there seems to be plenty of memory remaining at any
>> given time).
>
> Have you seen the problem since firing up the workers in another terminal?
> If so, did you get a backtrace/error this time round?

It's interesting, I haven't seen the problem since yesterday -- my
workers are still running happily, which is longer than they've lived
so far on this server.

> I've usually had my workers running under `screen`, so if things do go sour, I
> can then re-attach to the session and see all the scrollback.

Funny you should say that -- that's exactly what I did yesterday, so I
could catch things better. It's obviously easier and clearer than
having the workers all running in the background logging to files.

> This isn't a 'fix', but you could always use monit/god to restart the workers
> should they die?

I definitely intend to have that setup shortly anyway, but...like you
say, it's of course not a fix. If I can figure out what's going on
here, I can hopefully provide the info for anyone else to avoid it,
too.

We'll see what happens today, if they keep running or not. If they do,
then I'll have to wonder why! Could be the combination of versions of
the various gems, potentially (resque, redis-rb, redis-namespace, and
redis itself).

Re: [resque] Workers stop responding to requests

From:
Chris Wanstrath
Date:
2010-05-19 @ 17:07
On Tue, May 18, 2010 at 10:47 PM, Luke Antins <luke@lividpenguin.com> wrote:

> Have you seen the problem since firing up the workers in another terminal?
> If so, did you get a backtrace/error this time round?
>
> I've usually had my workers running under `screen`, so if things do go sour, I
> can then re-attach to the session and see all the scrollback.

+1 to this. It's also my preferred way of tracking down worker issues.