librelist archives

« back to archive

Reliable fetch double execution on Heroku

Reliable fetch double execution on Heroku

From:
Felix Holmgren
Date:
2014-05-08 @ 13:29
Hi!


We’ve noticed that Sidekiq Pro’s reliable fetch functionality very 
frequently causes the same job to be run repeatedly if dynos are shut down
during job execution. Here’s a minimal app illustrating the problem:
https://github.com/Felixyz/reliable-fetch-heroku-bug




Running on Heroku here:
http://sidekiq-hk-bug.herokuapp.com/




The procedure for reproducing the bug is:


1) Trigger job enqueuing (by hitting the button)
2) Within a few seconds, restart the dynos (using Heroku’s restart command)


Without reliable fetch activated, the problem doesn’t seem to be triggered
(or at least very rarely). This seems odd to me, since Sidekiq tries to 
enqueue jobs even without reliable fetch.



Here are a few lines excerpted from the above app’s log:



2014-05-08T12:39:00.884320+00:00 app[sidekiq.1]: **DEBUG** 
jid:1ae95404f033f34e3154950a uuid:37bedb91-c794-4cf9-bfac-e81bb571af85   
dyno: sidekiq.1 - START
2014-05-08T12:39:10.117901+00:00 app[sidekiq.1]: 2014-05-08T12:39:10Z 2 
TID-owch50gww INFO: Shutting down
2014-05-08T12:39:10.122992+00:00 app[sidekiq.1]: 2014-05-08T12:39:10Z 2 
TID-owch68c0o INFO: Shutting down 0 quiet workers
2014-05-08T12:39:10.123088+00:00 app[sidekiq.1]: 2014-05-08T12:39:10Z 2 
TID-owch68c0o INFO: Pausing up to 8 seconds to allow workers to finish...
2014-05-08T12:39:11.088661+00:00 app[sidekiq.1]: 2014-05-08T12:39:11Z 2 
TID-otwu7vpm0 WARN: ReliableFetch: recovering work on 9 messages
=====> 2014-05-08T12:39:11.141967+00:00 app[sidekiq.1]: **DEBUG** 
jid:1ae95404f033f34e3154950a uuid:37bedb91-c794-4cf9-bfac-e81bb571af85   
dyno: sidekiq.1 - START  <=====
2014-05-08T12:39:15.900254+00:00 app[sidekiq.1]: **DEBUG** 
jid:1ae95404f033f34e3154950a uuid:37bedb91-c794-4cf9-bfac-e81bb571af85   
dyno: sidekiq.1 - DONE
2014-05-08T12:39:18.124198+00:00 app[sidekiq.1] - 'Work still in progress'
2014-05-08T12:39:21.145543+00:00 app[sidekiq.1]: **DEBUG** 
jid:1ae95404f033f34e3154950a uuid:37bedb91-c794-4cf9-bfac-e81bb571af85   
dyno: sidekiq.1 - !!! --This job has already been executed-- !!!
2014-05-08T12:39:21.146694+00:00 app[sidekiq.1]: **DEBUG** 
jid:1ae95404f033f34e3154950a uuid:37bedb91-c794-4cf9-bfac-e81bb571af85   
dyno: sidekiq.1 - DONE



In the line marked with arrows (=====>) the job is restarted while the 
previous worker is still running. (Pardon me if I’m using dated 
terminology.)


It seems to me that there are two possibilities:


1) Heroku has started a new dyno with the same name (’sidekiq.1’) before 
the old one has yet been fully shut down. This is not my understanding of 
how Heroku’s platform works, but I could be wrong.
2) Sidekiq has returned the job to the queue, but the ’sidekiq.1’ process 
(dyno) immediately picks it up again, thus running two workers processing 
the same job simultaneously. If this is the case, I assume it would count 
as a bug in Sidekiq.


But even if the job wasn’t picked up by the same dyno, the fact that 
Sidekiq can’t prevent jobs from being run multiple times makes our worker 
code considerably more complex. I understand this is considered correct 
according to Sidkiq’s contract 
(https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/manager.rb#L211-216
), but it’s quite burdensome to check for possible job duplication, and 
there are (in our case) possible race conditions that we need to guard 
against with database locking. I suspect this would be the case for many 
apps.


In short, this is pretty unexpected and annoying. The problem occurs 
almost every time the app is restarted while jobs are running.



Have I missed something? Are there any obvious ways of handling this?


All best
/Felix


Sent with Unibox

Re: [sidekiq] Reliable fetch double execution on Heroku

From:
Mike Perham
Date:
2014-05-08 @ 15:36
It looks like option 1 is what is happening.  The old process is still
working on the job (and therefore the job is still in the private queue in
Redis) while the new process starts up and looks for jobs to recover within
the same private queue.  You might want to confirm with Heroku support that
new and old processes can co-exist at the same time but that wouldn't
surprise me at all.

When shutting down, Sidekiq Pro will attempt to move jobs in the private
queue back onto the public queue so that they can be restarted by any
process but this isn't done until the default 8 sec job timeout has passed.



On Thu, May 8, 2014 at 6:29 AM, Felix Holmgren <felix.holmgren@topvisible.se
> wrote:

>  Hi!
>
> We’ve noticed that Sidekiq Pro’s reliable fetch functionality very
> frequently causes the same job to be run repeatedly if dynos are shut down
> during job execution. Here’s a minimal app illustrating the problem:
> https://github.com/Felixyz/reliable-fetch-heroku-bug
>
> Running on Heroku here:
> http://sidekiq-hk-bug.herokuapp.com/
>
> The procedure for reproducing the bug is:
>
> 1) Trigger job enqueuing (by hitting the button)
> 2) Within a few seconds, restart the dynos (using Heroku’s restart
> command)
>
> Without reliable fetch activated, the problem doesn’t seem to be triggered
> (or at least very rarely). This seems odd to me, since Sidekiq tries to
> enqueue jobs even without reliable fetch.
>
> Here are a few lines excerpted from the above app’s log:
>
> 2014-05-08T12:39:00.884320+00:00 app[sidekiq.1]: **DEBUG**
> jid:1ae95404f033f34e3154950a uuid:37bedb91-c794-4cf9-bfac-e81bb571af85
> dyno: sidekiq.1 - START
> 2014-05-08T12:39:10.117901+00:00 app[sidekiq.1]: 2014-05-08T12:39:10Z 2
> TID-owch50gww INFO: Shutting down
> 2014-05-08T12:39:10.122992+00:00 app[sidekiq.1]: 2014-05-08T12:39:10Z 2
> TID-owch68c0o INFO: Shutting down 0 quiet workers
> 2014-05-08T12:39:10.123088+00:00 app[sidekiq.1]: 2014-05-08T12:39:10Z 2
> TID-owch68c0o INFO: Pausing up to 8 seconds to allow workers to finish...
> 2014-05-08T12:39:11.088661+00:00 app[sidekiq.1]: 2014-05-08T12:39:11Z 2
> TID-otwu7vpm0 WARN: ReliableFetch: recovering work on 9 messages
> =====> 2014-05-08T12:39:11.1419 67+00:00 app[sidekiq.1]: **DEBUG**
> jid:1ae95404f033f34e3154950a uuid:37bedb91-c794-4cf9-bfac-e81bb571af85
> dyno: sidekiq.1 - START  <=====
> 2014-05-08T12:39:15.900254+00:00 app[sidekiq.1]: **DEBUG**
> jid:1ae95404f033f34e3154950a uuid:37bedb91-c794-4cf9-bfac-e81bb571af85
> dyno: sidekiq.1 - DONE
> 2014-05-08T12:39:18.124198+00:00 app[sidekiq.1] - 'Work still in progress'
> 2014-05-08T12:39:21.145543+00:00 app[sidekiq.1]: **DEBUG**
> jid:1ae95404f033f34e3154950a uuid:37bedb91-c794-4cf9-bfac-e81bb571af85
> dyno: sidekiq.1 - !!! --This job has already been executed-- !!!
> 2014-05-08T12:39:21.146694+00:00 app[sidekiq.1]: **DEBUG**
> jid:1ae95404f033f34e3154950a uuid:37bedb91-c794-4cf9-bfac-e81bb571af85
> dyno: sidekiq.1 - DONE
>
> In the line marked with arrows (=====> ) the job is restarted while the
> previous worker is still running. (Pardon me if I’m using dated
> terminology.)
>
> It seems to me that there are two possibilities:
>
> 1) Heroku has started a new dyno with the same name (’sidekiq.1’) before
> the old one has yet been fully shut down. This is not my understanding of
> how Heroku’s platform works, but I could be wrong.
> 2) Sidekiq has returned the job to the queue, but the ’sidekiq.1’ process
> (dyno) immediately picks it up again, thus running two workers processing
> the same job simultaneously. If this is the case, I assume it would count
> as a bug in Sidekiq.
>
> But even if the job wasn’t picked up by the *same* dyno, the fact that
> Sidekiq can’t prevent jobs from being run multiple times makes our worker
> code considerably more complex. I understand this is considered correct
> according to Sidkiq’s contract (
> https://github.com/mperham/sidekiq/blob/master/lib/sidekiq/manager.rb#L211-216),
> but it’s quite burdensome to check for possible job duplication, and there
> are (in our case) possible race conditions that we need to guard against
> with database locking. I suspect this would be the case for many apps.
>
> In short, this is pretty unexpected and annoying. The problem occurs
> almost *every time* the app is restarted while jobs are running.
>
> Have I missed something? Are there any obvious ways of handling this?
>
> All best
> /Felix
>
> Sent with Unibox <http://www.uniboxapp.com/t/sig>
>



-- 
Mike Perham | mike@sidekiq.org