librelist archives

« back to archive

dead worker pruning problem when migrating to a later version of Resque

dead worker pruning problem when migrating to a later version of Resque

From:
Andrew W. Gibbs
Date:
2010-06-27 @ 20:40
Resque Savants,

I started a sub-project for a client using Resque-1.2.3 back in
December and got a decent system up and running.  More recently we
have been attempting to migrate to Resque-1.9, but we have come across
an issue in which whenever a worker gets started up it renders Resque
unaware of all previously registered workers.  The problem seems to
center on Resque's usage of the "ps" command and assumptions about how
workers are started that are easy to violate.  The seemingly relevant
code is in lib/resque/worker.rb...

##########

def prune_dead_workers
  all_workers = Worker.all
  known_workers = worker_pids unless all_workers.empty?
  all_workers.each do |worker|
    host, pid, queues = worker.id.split(':')
    next unless host == hostname
    next if known_workers.include?(pid)
    log! "Pruning dead worker: #{worker}"
    worker.unregister_worker
  end
end

def worker_pids
  `ps -A -o pid,comm | grep [r]esque`.split("\n").map do |line|
    line.split(' ')[0]
  end
end

##########

I can see very plainly why I'm running into trouble.  I have a script
called process_jobs that does some initial configuration and then
invokes Worker.work.  If this script gets invoked multiple times (say,
by Monit) then Resque (per the UI rendered by resque-web) shows only
the most recently launched worker as being registered.

I just fired up my script with the following command (ignore the dummy
UUID first arg; second arg is the queue list)...

#############

RAILS_ENV=production ./script/process_jobs abcdefghij unfetched_cases

#############

... and in the resque-web UI noted a PID of 23848.  Then I went to the
command line ...

#############

41 ~ $ ps -ef | grep 23848 | grep -v grep
1000     23848 20887  1 12:35 pts/3    00:00:04 resque-1.9.4: Waiting for 
unfetched_cases                         
242 ~ $ ps -A -o pid,comm | grep 23848
23848 process_jobs
243 ~ $ ps -A -o pid,args | grep 23848
23848 resque-1.9.4: Waiting for unfetched_cases 

#############

As you can see, the "ps -A -o pid,comm" command, invoked just like
Worker.worker_pids invokes it, is not going to return anything that
would survive a "grep resque".  Consequently, when a new worker starts
up and invokes prune_dead_workers, the previously registered workers
are going to get whacked.

Per the "ps" man page, with the "comm" format "modifications to the
command name will not be shown" whereas they will with the "args"
format.

I came up with a simple but inelegant solution for my own problems.  I
renamed the "process_jobs" script to "process_resque_jobs".  That
said, this is a pretty brittle fix, and I suspect other users of
Resque are going to get bitten like I did.  Is there a better way to
manage this stuff than "ps"?  Even the "worker" Rake task that comes
stock with Resque exhibits the problem since the "comm" ps format
shows "rake" instead of "resque".

What I still find confusing is why this issue does not manifest in
Resque-1.2.3.  Did process management/naming change between 1.2.3 and
1.9?  The same "ps" idiom was in use in resque-1.2.3 but I just
verified that I don't have this problem under 1.2.3.

I am also wondering if the following is somehow related as I don't
understand its origin...

##################

248 ~ $ ps -ef | grep term | grep resque 
1000     22514 20887  0 11:23 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
1000     22569 20887  0 11:26 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
1000     22592 20887  0 11:27 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
1000     23394 20887  0 12:13 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
1000     23400 20887  0 12:13 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
1000     23405 20887  0 12:14 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10

###################

Any insight you have would be much appreciated...

  -- AWG

Re: [resque] dead worker pruning problem when migrating to a later version of Resque

From:
Chris Wanstrath
Date:
2010-06-28 @ 17:36
The newest version of Resque should be using command, not comm. There was 
a patch that changed this behavior but it has since been reverted.

http://github.com/defunkt/resque/blob/master/lib/resque/worker.rb

What version of Resque are you using?

On Jun 27, 2010, at 1:40 PM, "Andrew W. Gibbs" <awgibbs@awgibbs.com> wrote:

> Resque Savants,
> 
> I started a sub-project for a client using Resque-1.2.3 back in
> December and got a decent system up and running.  More recently we
> have been attempting to migrate to Resque-1.9, but we have come across
> an issue in which whenever a worker gets started up it renders Resque
> unaware of all previously registered workers.  The problem seems to
> center on Resque's usage of the "ps" command and assumptions about how
> workers are started that are easy to violate.  The seemingly relevant
> code is in lib/resque/worker.rb...
> 
> ##########
> 
> def prune_dead_workers
>  all_workers = Worker.all
>  known_workers = worker_pids unless all_workers.empty?
>  all_workers.each do |worker|
>    host, pid, queues = worker.id.split(':')
>    next unless host == hostname
>    next if known_workers.include?(pid)
>    log! "Pruning dead worker: #{worker}"
>    worker.unregister_worker
>  end
> end
> 
> def worker_pids
>  `ps -A -o pid,comm | grep [r]esque`.split("\n").map do |line|
>    line.split(' ')[0]
>  end
> end
> 
> ##########
> 
> I can see very plainly why I'm running into trouble.  I have a script
> called process_jobs that does some initial configuration and then
> invokes Worker.work.  If this script gets invoked multiple times (say,
> by Monit) then Resque (per the UI rendered by resque-web) shows only
> the most recently launched worker as being registered.
> 
> I just fired up my script with the following command (ignore the dummy
> UUID first arg; second arg is the queue list)...
> 
> #############
> 
> RAILS_ENV=production ./script/process_jobs abcdefghij unfetched_cases
> 
> #############
> 
> ... and in the resque-web UI noted a PID of 23848.  Then I went to the
> command line ...
> 
> #############
> 
> 41 ~ $ ps -ef | grep 23848 | grep -v grep
> 1000     23848 20887  1 12:35 pts/3    00:00:04 resque-1.9.4: Waiting 
for unfetched_cases                         
> 242 ~ $ ps -A -o pid,comm | grep 23848
> 23848 process_jobs
> 243 ~ $ ps -A -o pid,args | grep 23848
> 23848 resque-1.9.4: Waiting for unfetched_cases 
> 
> #############
> 
> As you can see, the "ps -A -o pid,comm" command, invoked just like
> Worker.worker_pids invokes it, is not going to return anything that
> would survive a "grep resque".  Consequently, when a new worker starts
> up and invokes prune_dead_workers, the previously registered workers
> are going to get whacked.
> 
> Per the "ps" man page, with the "comm" format "modifications to the
> command name will not be shown" whereas they will with the "args"
> format.
> 
> I came up with a simple but inelegant solution for my own problems.  I
> renamed the "process_jobs" script to "process_resque_jobs".  That
> said, this is a pretty brittle fix, and I suspect other users of
> Resque are going to get bitten like I did.  Is there a better way to
> manage this stuff than "ps"?  Even the "worker" Rake task that comes
> stock with Resque exhibits the problem since the "comm" ps format
> shows "rake" instead of "resque".
> 
> What I still find confusing is why this issue does not manifest in
> Resque-1.2.3.  Did process management/naming change between 1.2.3 and
> 1.9?  The same "ps" idiom was in use in resque-1.2.3 but I just
> verified that I don't have this problem under 1.2.3.
> 
> I am also wondering if the following is somehow related as I don't
> understand its origin...
> 
> ##################
> 
> 248 ~ $ ps -ef | grep term | grep resque 
> 1000     22514 20887  0 11:23 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
> 1000     22569 20887  0 11:26 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
> 1000     22592 20887  0 11:27 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
> 1000     23394 20887  0 12:13 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
> 1000     23400 20887  0 12:13 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
> 1000     23405 20887  0 12:14 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
> 
> ###################
> 
> Any insight you have would be much appreciated...
> 
>  -- AWG

Re: [resque] dead worker pruning problem when migrating to a later version of Resque

From:
Andrew W. Gibbs
Date:
2010-06-30 @ 02:09
We recently started testing against 1.9.4 which appears to be the
latest release that still was using "comm" instead of "command".  It
was just bad luck that we didn't procrastinate a couple of days and
snag the 1.9.5 release.  We were using 1.2.3 prior to 1.9.4 which
apparently had "command" instead of "comm".  I guess we should
grab 1.9.6 and work with that.

  -- AWG

On Mon, Jun 28, 2010 at 10:36:48AM -0700, Chris Wanstrath wrote:
> The newest version of Resque should be using command, not comm. There 
was a patch that changed this behavior but it has since been reverted.
> 
> http://github.com/defunkt/resque/blob/master/lib/resque/worker.rb
> 
> What version of Resque are you using?
> 
> On Jun 27, 2010, at 1:40 PM, "Andrew W. Gibbs" <awgibbs@awgibbs.com> wrote:
> 
> > Resque Savants,
> > 
> > I started a sub-project for a client using Resque-1.2.3 back in
> > December and got a decent system up and running.  More recently we
> > have been attempting to migrate to Resque-1.9, but we have come across
> > an issue in which whenever a worker gets started up it renders Resque
> > unaware of all previously registered workers.  The problem seems to
> > center on Resque's usage of the "ps" command and assumptions about how
> > workers are started that are easy to violate.  The seemingly relevant
> > code is in lib/resque/worker.rb...
> > 
> > ##########
> > 
> > def prune_dead_workers
> >  all_workers = Worker.all
> >  known_workers = worker_pids unless all_workers.empty?
> >  all_workers.each do |worker|
> >    host, pid, queues = worker.id.split(':')
> >    next unless host == hostname
> >    next if known_workers.include?(pid)
> >    log! "Pruning dead worker: #{worker}"
> >    worker.unregister_worker
> >  end
> > end
> > 
> > def worker_pids
> >  `ps -A -o pid,comm | grep [r]esque`.split("\n").map do |line|
> >    line.split(' ')[0]
> >  end
> > end
> > 
> > ##########
> > 
> > I can see very plainly why I'm running into trouble.  I have a script
> > called process_jobs that does some initial configuration and then
> > invokes Worker.work.  If this script gets invoked multiple times (say,
> > by Monit) then Resque (per the UI rendered by resque-web) shows only
> > the most recently launched worker as being registered.
> > 
> > I just fired up my script with the following command (ignore the dummy
> > UUID first arg; second arg is the queue list)...
> > 
> > #############
> > 
> > RAILS_ENV=production ./script/process_jobs abcdefghij unfetched_cases
> > 
> > #############
> > 
> > ... and in the resque-web UI noted a PID of 23848.  Then I went to the
> > command line ...
> > 
> > #############
> > 
> > 41 ~ $ ps -ef | grep 23848 | grep -v grep
> > 1000     23848 20887  1 12:35 pts/3    00:00:04 resque-1.9.4: Waiting 
for unfetched_cases                         
> > 242 ~ $ ps -A -o pid,comm | grep 23848
> > 23848 process_jobs
> > 243 ~ $ ps -A -o pid,args | grep 23848
> > 23848 resque-1.9.4: Waiting for unfetched_cases 
> > 
> > #############
> > 
> > As you can see, the "ps -A -o pid,comm" command, invoked just like
> > Worker.worker_pids invokes it, is not going to return anything that
> > would survive a "grep resque".  Consequently, when a new worker starts
> > up and invokes prune_dead_workers, the previously registered workers
> > are going to get whacked.
> > 
> > Per the "ps" man page, with the "comm" format "modifications to the
> > command name will not be shown" whereas they will with the "args"
> > format.
> > 
> > I came up with a simple but inelegant solution for my own problems.  I
> > renamed the "process_jobs" script to "process_resque_jobs".  That
> > said, this is a pretty brittle fix, and I suspect other users of
> > Resque are going to get bitten like I did.  Is there a better way to
> > manage this stuff than "ps"?  Even the "worker" Rake task that comes
> > stock with Resque exhibits the problem since the "comm" ps format
> > shows "rake" instead of "resque".
> > 
> > What I still find confusing is why this issue does not manifest in
> > Resque-1.2.3.  Did process management/naming change between 1.2.3 and
> > 1.9?  The same "ps" idiom was in use in resque-1.2.3 but I just
> > verified that I don't have this problem under 1.2.3.
> > 
> > I am also wondering if the following is somehow related as I don't
> > understand its origin...
> > 
> > ##################
> > 
> > 248 ~ $ ps -ef | grep term | grep resque 
> > 1000     22514 20887  0 11:23 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
> > 1000     22569 20887  0 11:26 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
> > 1000     22592 20887  0 11:27 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
> > 1000     23394 20887  0 12:13 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
> > 1000     23400 20887  0 12:13 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
> > 1000     23405 20887  0 12:14 pts/3    00:00:00 
/home/racetrack/.tmp/set_term_title_temp/resque-1.9.4 10
> > 
> > ###################
> > 
> > Any insight you have would be much appreciated...
> > 
> >  -- AWG