librelist archives

« back to archive

Do you treat Redis as stateful in your Resque deployment?

Do you treat Redis as stateful in your Resque deployment?

From:
Tony Arcieri
Date:
2010-03-22 @ 22:23
I'm writing a script to re-enqueue failed jobs.  The jobs are failing
largely due to bugs in the workers or other network infrastructure problems
we're still resolving.  Whenever we resolve any of these issues, I'd like
for the failed jobs to automatically get retried.

I'm wondering about the best way to handle this.  I'd almost like to track
the entire lifecycle of jobs within the database.  An alternative would be
to depend on Redis for this persistance.  However, I tend to think of my
message queues (as well as the job workers) as being stateless, so that I
can retry jobs based completely on the database state, rather than depending
on the queue or the workers to handle errors.

Beyond that, we're working in a stateless environment (EC2).  We can set up
persistent storage for Redis, but do we really have to?  We already have
mechanisms in place to persist, monitor, and backup the database state.  Do
we really need to do this all for Redis as well?

My inclination is to treat Redis as a stateless job dispatcher and manage
the state for the job lifecycle in a database.  Does anyone else use this
approach?

-- 
Tony Arcieri
Medioh! A Kudelski Brand

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Brian O'Rourke
Date:
2010-03-22 @ 22:38
Hi Tony,

At CrowdFlower we extended Resque's failure subsystem to handle this.
Basically we wrap the failed jobs of certain workers with a new class which
tracks the number of times a job has been attempted, and we use the
excellent resque_scheduler plugin to retry the job again in the future.

Have a look, see if this helps you to solve your problem:
http://gist.github.com/340622

--bpo



On Mon, Mar 22, 2010 at 3:23 PM, Tony Arcieri <tony@medioh.com> wrote:

> I'm writing a script to re-enqueue failed jobs.  The jobs are failing
> largely due to bugs in the workers or other network infrastructure problems
> we're still resolving.  Whenever we resolve any of these issues, I'd like
> for the failed jobs to automatically get retried.
>
> I'm wondering about the best way to handle this.  I'd almost like to track
> the entire lifecycle of jobs within the database.  An alternative would be
> to depend on Redis for this persistance.  However, I tend to think of my
> message queues (as well as the job workers) as being stateless, so that I
> can retry jobs based completely on the database state, rather than depending
> on the queue or the workers to handle errors.
>
> Beyond that, we're working in a stateless environment (EC2).  We can set up
> persistent storage for Redis, but do we really have to?  We already have
> mechanisms in place to persist, monitor, and backup the database state.  Do
> we really need to do this all for Redis as well?
>
> My inclination is to treat Redis as a stateless job dispatcher and manage
> the state for the job lifecycle in a database.  Does anyone else use this
> approach?
>
> --
> Tony Arcieri
> Medioh! A Kudelski Brand
>

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Tony Arcieri
Date:
2010-03-22 @ 23:11
On Mon, Mar 22, 2010 at 4:38 PM, Brian O'Rourke <brian@crowdflower.com>wrote:

> Hi Tony,
>
> At CrowdFlower we extended Resque's failure subsystem to handle this.
> Basically we wrap the failed jobs of certain workers with a new class which
> tracks the number of times a job has been attempted, and we use the
> excellent resque_scheduler plugin to retry the job again in the future.
>

What failure modes is this equipped to handle?  I suppose I'm still
uncertain of exactly how Resque handles failed jobs, but what if one of the
workers consumes a job and then a total node failure occurs while the job is
being processed?  Will it somehow wind up as a failed job?  Or will the job
be lost?

For our use case we cannot stand to lose any jobs.  If jobs fail in any
manner they must be retried, and we need to cover all failure modes.




-- 
Tony Arcieri
Medioh! A Kudelski Brand

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Brian O'Rourke
Date:
2010-03-22 @ 23:41
Tony,

The code provided is equipped to handle "normal" failures, the sort that
will be handled by this code:
http://github.com/defunkt/resque/blob/master/lib/resque/worker.rb#L142

There are catastrophic failure conditions like the total node failure,
failure to fork, etc, that would not be handled by the failure
subsystem. Handling every conceivable error may be challenging.

--bpo


On Mon, Mar 22, 2010 at 4:11 PM, Tony Arcieri <tony@medioh.com> wrote:

> On Mon, Mar 22, 2010 at 4:38 PM, Brian O'Rourke <brian@crowdflower.com>wrote:
>
>> Hi Tony,
>>
>> At CrowdFlower we extended Resque's failure subsystem to handle this.
>> Basically we wrap the failed jobs of certain workers with a new class which
>> tracks the number of times a job has been attempted, and we use the
>> excellent resque_scheduler plugin to retry the job again in the future.
>>
>
> What failure modes is this equipped to handle?  I suppose I'm still
> uncertain of exactly how Resque handles failed jobs, but what if one of the
> workers consumes a job and then a total node failure occurs while the job is
> being processed?  Will it somehow wind up as a failed job?  Or will the job
> be lost?
>
> For our use case we cannot stand to lose any jobs.  If jobs fail in any
> manner they must be retried, and we need to cover all failure modes.
>
>
>
>
> --
> Tony Arcieri
> Medioh! A Kudelski Brand
>

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Tony Arcieri
Date:
2010-03-23 @ 00:12
On Mon, Mar 22, 2010 at 5:41 PM, Brian O'Rourke <brian@crowdflower.com>wrote:

> There are catastrophic failure conditions like the total node failure,
> failure to fork, etc, that would not be handled by the failure
> subsystem. Handling every conceivable error may be challenging.
>

I need to recover from catastrophic failures.  It sounds like my best bet is
to track the job lifecycle in the database, as we have already done the
necessary work to recover the database from catastrophic failures.

So I guess that brings me back to one of my original questions: does anyone
else track the job lifecycle in their database and only use Resque for job
dispatch?

-- 
Tony Arcieri
Medioh! A Kudelski Brand

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Ryan Carver
Date:
2010-03-23 @ 16:54
>
> I need to recover from catastrophic failures.  It sounds like my best bet
> is to track the job lifecycle in the database, as we have already done the
> necessary work to recover the database from catastrophic failures.
>
> So I guess that brings me back to one of my original questions: does anyone
> else track the job lifecycle in their database and only use Resque for job
> dispatch?



Hi Tony, yes we also have jobs that must be performed. In our case we have a
db model that tracks the state. In the case of a failure, we store the
exception and mark the state as complete. Then, we make a copy of the record
and requeue it. This way we have a complete history of the status.

Also, this is why for us, simply retrying a job with the same arguments
won't work.



>
>
> --
> Tony Arcieri
> Medioh! A Kudelski Brand
>

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Brian O'Rourke
Date:
2010-03-23 @ 17:19
On Tue, Mar 23, 2010 at 9:54 AM, Ryan Carver <ryan@fivesevensix.com> wrote:

> I need to recover from catastrophic failures.  It sounds like my best bet
>> is to track the job lifecycle in the database, as we have already done the
>> necessary work to recover the database from catastrophic failures.
>>
>> So I guess that brings me back to one of my original questions: does
>> anyone else track the job lifecycle in their database and only use Resque
>> for job dispatch?
>
>
>
> Hi Tony, yes we also have jobs that must be performed. In our case we have
> a db model that tracks the state. In the case of a failure, we store the
> exception and mark the state as complete. Then, we make a copy of the record
> and requeue it. This way we have a complete history of the status.
>
> Also, this is why for us, simply retrying a job with the same arguments
> won't work.
>

Out of curiosity, how do you handle the catastrophic failures Tony
mentioned?

My own quick brainstorm:
* You could find out when most major failures occur by adding hooks to where
the worker cleans itself up on startup.
* Resque uses a list pop operation to grab jobs off the queue. Failures can
occur after the job is popped and before it's saved as being "worked on" by
a worker. So if the node fails directly after a job is popped (due to
failure to fork, for example), your job is lost. To avoid this, you either
need to poll resque from your separate management system ("has my job run
yet?", "do you still know about my job?"), or change Resque to move jobs
atomically from the "queued" state to a "running" state.
* One way to make the transition from a "queued" state to a "running" state
in resque might be using the new multi/exec functionality in Redis.

2 cents
--bpo


>
>
>
>>
>>
>> --
>> Tony Arcieri
>> Medioh! A Kudelski Brand
>>
>
>

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Brandon Mitchell
Date:
2010-03-23 @ 17:26
On Mar 23, 2010, at 10:19 AM, Brian O'Rourke wrote:

> * Resque uses a list pop operation to grab jobs off the queue. Failures 
can occur after the job is popped and before it's saved as being "worked 
on" by a worker. So if the node fails directly after a job is popped (due 
to failure to fork, for example), your job is lost. To avoid this, you 
either need to poll resque from your separate management system ("has my 
job run yet?", "do you still know about my job?"), or change Resque to 
move jobs atomically from the "queued" state to a "running" state.

Redis supports this exact operation via the RPOPLPUSH command - atomically
pop from one key and push on to another. Resque should probably have a 
'running' queue, in addition to the 'pending' (user-defined) and 'failed' 
queues.

Brandon Mitchell

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Will Schenk
Date:
2010-03-23 @ 20:24
> So I guess that brings me back to one of my original questions: does anyone
> else track the job lifecycle in their database and only use Resque for job
> dispatch?

That's exactly what we do at benchcoach.  We use the workflow gem to
keep state in the database, and when the workers start a job it
updates it throughout the lifecycle.  Some of the jobs have
occasionally run away (we're scoring various permutations on fantasy
leagues and the head to head leagues especially are a combinatorial
nightmare.)

A seperate job goes through and looks for jobs that are in states for
longer than they reasonably should.  At that point we can either
requeue if it's a temporary thing or whatever.

What's also nice is that we set up the resque queuing stuff as a state
transition inside the workflow, so it's sure to happen right away.
i.e.

class FantasyLeague < ActiveRecord::Base
  include Workflow

  workflow do
    state :new do
      event :queue, :transitions_to => :queued
      event :sync, :transitions_to => :syncing
    end

    state :queued do
      event :sync, :transitions_to => :syncing

      on_entry do
        Resque.enqueue( priority ? LeagueSyncer, self.id )
      end
    end

    state :syncing do
      event :bad_login, :transitions_to => :bad_credentials
      event :no_credentials, :transitions_to => :bad_credentials
      event :done, :transitions_to => :queue_scoring
    end

    state :queue_scoring do
      event :project_current_rosters, :transitions_to => :waiver_wire

      on_entry do
        Resque.enqueue( priority ? Fangelica, self.id )
      end
    end

# many more states
  end
end

-w

-- 
Will Schenk
http://www.sublimeguile.com

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Tony Arcieri
Date:
2010-03-23 @ 21:07
On Tue, Mar 23, 2010 at 3:24 PM, Will Schenk <wschenk@gmail.com> wrote:

> That's exactly what we do at benchcoach.  We use the workflow gem to
> keep state in the database, and when the workers start a job it
> updates it throughout the lifecycle.  Some of the jobs have
> occasionally run away (we're scoring various permutations on fantasy
> leagues and the head to head leagues especially are a combinatorial
> nightmare.)
>
> A seperate job goes through and looks for jobs that are in states for
> longer than they reasonably should.  At that point we can either
> requeue if it's a temporary thing or whatever.
>
> What's also nice is that we set up the resque queuing stuff as a state
>  transition inside the workflow, so it's sure to happen right away.


This sounds like a great architecture.  Thanks for pointing out the workflow
gem.  I'd never heard of it.

-- 
Tony Arcieri
Medioh! A Kudelski Brand

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Thibaut Barrère
Date:
2010-03-23 @ 10:03
Hi,

interesting dicussion!

> So I guess that brings me back to one of my original questions: does
anyone else track the
> job lifecycle in their database and only use Resque for job dispatch?

In my case I just happen to serialize the job information on disk before
creating a Resque job, in case I need to recreate it afterwards.

I put my Redis instance in append only mode to minimize the risk of loosing
a job too. It's slower but fast enough in my case.

What I'd need here is some way to "reenqueue all (or one) failed jobs" on
demand manually.

-- Thibaut

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Jason
Date:
2010-03-23 @ 14:11
If you just wanting to ensure that there is a connection, or retry until
there is a connection, it seems like it might be a waste to deal with
rescheduling.   Couldn't you just have a begin / rescue block in your actor
class that catches the exceptions, and based on the exception have a retry
inside the rescue block?

Off the top of my head, using Net::SMTP as an example:

http://pastie.org/882672

<http://pastie.org/882672>Obviously you would want to have a handler method
in where those rescue puts statements are to do something sensible..  But it
seems to me that would handle a connection issue or a network blip better
than rescheduling?   Anyone see an issue with this approach?

Just my thoughts...




On Tue, Mar 23, 2010 at 3:03 AM, Thibaut Barrère
<thibaut.barrere@gmail.com>wrote:

> Hi,
>
> interesting dicussion!
>
>
> > So I guess that brings me back to one of my original questions: does
> anyone else track the
> > job lifecycle in their database and only use Resque for job dispatch?
>
> In my case I just happen to serialize the job information on disk before
> creating a Resque job, in case I need to recreate it afterwards.
>
> I put my Redis instance in append only mode to minimize the risk of loosing
> a job too. It's slower but fast enough in my case.
>
> What I'd need here is some way to "reenqueue all (or one) failed jobs" on
> demand manually.
>
> -- Thibaut
>



-- 
Most people are born and years later die without really having lived at all.
They play it safe and tiptoe through life with no aspiration other than to
arrive at death safely. -- Tony Campolo, "Carpe Diem"

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Michael Russo
Date:
2010-03-23 @ 14:04
On Tue, Mar 23, 2010 at 6:03 AM, Thibaut Barrère
<thibaut.barrere@gmail.com>wrote:

>
> What I'd need here is some way to "reenqueue all (or one) failed jobs" on
> demand manually.
>
>
Thibaut, support is being added to the Python implementation (pyres) to
re-queue failed jobs manually.  Here's a commit (in a fork) with the feature
working:
http://github.com/aezell/pyres/commit/4391047978d4a5a3234811f8297520a29a8a1a04

Personally, I don't like the idea of automatically re-trying failed jobs
because not every job will be idempotent.

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Thibaut Barrère
Date:
2010-03-23 @ 15:01
> Thibaut, support is being added to the Python implementation (pyres) to
re-queue failed jobs manually.  Here's a
> commit (in a fork) with the feature working:
>
http://github.com/aezell/pyres/commit/4391047978d4a5a3234811f8297520a29a8a1a04

Thanks - I will see how to adapt this to the Ruby version.

 Personally, I don't like the idea of automatically re-trying failed jobs
> because not every job will be idempotent.
>

Well I guess it depends of the context: in my case if there's a failure, it
translates into "we need to provide a dev fix to make this work", most of
the time. So an automatic retry wouldn't help much here.

But a base resque failure-backend that can be either derived or configured
to define the retry strategy is still useful though; I could use this as
well later on.

-- Thibaut

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Matt George
Date:
2010-03-23 @ 15:13
One thing of note that we did in the python port of resque is the  
addition of the retry_every and retry_timeout class attributes to  
jobs. This was put in place alongside an implementation of the resque- 
scheduler plugin in pyres. (A sidenote here, in pyres we added the  
scheduling directly to pyres because of the difficulty with a plugin  
architecture). So now, when a job fails, the job class calls retry on  
itself, which if successful will schedule it for a future retry. The  
retry only happens on jobs that have the class attributes I mentioned  
above. You can see the commit that does this below:
http://github.com/binarydud/pyres/commit/1b692b209d536fe87fa44acacb4a4b7cdd69a6ed


Matt George
Software Developer | Emma®
matt.george@myemma.com

Emma helps organizations everywhere communicate & market in style.
Visit us online at http://www.myemma.com

On Mar 23, 2010, at 10:01 AM, Thibaut Barrère wrote:

> > Thibaut, support is being added to the Python implementation  
> (pyres) to re-queue failed jobs manually.  Here's a
> > commit (in a fork) with the feature working:
> > http://github.com/aezell/pyres/commit/4391047978d4a5a3234811f8297520a29a8a1a04
>
> Thanks - I will see how to adapt this to the Ruby version.
>
> Personally, I don't like the idea of automatically re-trying failed  
> jobs because not every job will be idempotent.
>
> Well I guess it depends of the context: in my case if there's a  
> failure, it translates into "we need to provide a dev fix to make  
> this work", most of the time. So an automatic retry wouldn't help  
> much here.
>
> But a base resque failure-backend that can be either derived or  
> configured to define the retry strategy is still useful though; I  
> could use this as well later on.
>
> -- Thibaut
>
>
>

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Ryan Carver
Date:
2010-03-23 @ 16:59
I'll put in another plug for the job hooks proposal:

If you define on_failure(exception, *args) in your job, it will be called
when perform throws an exception. Maybe I'm off on this, but defining how
jobs are retried globally doesn't feel right - I want different types of
retry for each job class.

http://github.com/rcarver/resque-lock-retry/blob/master/lib/resque/job_hooks.rb

This can easily be used to implement all of the awesome retry logic you guys
are talking about (retry_every, retry_timeout, etc)


On Tue, Mar 23, 2010 at 8:13 AM, Matt George <matt.george@myemma.com> wrote:

> One thing of note that we did in the python port of resque is the addition
> of the retry_every and retry_timeout class attributes to jobs. This was put
> in place alongside an implementation of the resque-scheduler plugin in
> pyres. (A sidenote here, in pyres we added the scheduling directly to pyres
> because of the difficulty with a plugin architecture). So now, when a job
> fails, the job class calls retry on itself, which if successful will
> schedule it for a future retry. The retry only happens on jobs that have the
> class attributes I mentioned above. You can see the commit that does this
> below:
>
> 
http://github.com/binarydud/pyres/commit/1b692b209d536fe87fa44acacb4a4b7cdd69a6ed
>
>
> Matt George
> Software Developer | *Emma®*
> matt.george@myemma.com
>
> Emma helps organizations everywhere communicate & market in style.
> Visit us online at http://www.myemma.com
>
> On Mar 23, 2010, at 10:01 AM, Thibaut Barrère wrote:
>
> > Thibaut, support is being added to the Python implementation (pyres) to
> re-queue failed jobs manually.  Here's a
> > commit (in a fork) with the feature working:
> >
> http://github.com/aezell/pyres/commit/4391047978d4a5a3234811f8297520a29a8a1a04
>
> Thanks - I will see how to adapt this to the Ruby version.
>
>  Personally, I don't like the idea of automatically re-trying failed jobs
>> because not every job will be idempotent.
>>
>
> Well I guess it depends of the context: in my case if there's a failure, it
> translates into "we need to provide a dev fix to make this work", most of
> the time. So an automatic retry wouldn't help much here.
>
> But a base resque failure-backend that can be either derived or configured
> to define the retry strategy is still useful though; I could use this as
> well later on.
>
> -- Thibaut
>
>
>
>
>

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Tony Arcieri
Date:
2010-03-23 @ 17:00
Yeah, hooks are definitely awesome... what is nice is because of how simple
Resque is this stuff is fairly easy to layer on myself.  I just wanted to
get some proposals as to how others were using it.

On Tue, Mar 23, 2010 at 11:59 AM, Ryan Carver <ryan@fivesevensix.com> wrote:

> I'll put in another plug for the job hooks proposal:
>
> If you define on_failure(exception, *args) in your job, it will be called
> when perform throws an exception. Maybe I'm off on this, but defining how
> jobs are retried globally doesn't feel right - I want different types of
> retry for each job class.
>
>
> http://github.com/rcarver/resque-lock-retry/blob/master/lib/resque/job_hooks.rb
>
> This can easily be used to implement all of the awesome retry logic you
> guys are talking about (retry_every, retry_timeout, etc)
>
>
> On Tue, Mar 23, 2010 at 8:13 AM, Matt George <matt.george@myemma.com>wrote:
>
>> One thing of note that we did in the python port of resque is the addition
>> of the retry_every and retry_timeout class attributes to jobs. This was put
>> in place alongside an implementation of the resque-scheduler plugin in
>> pyres. (A sidenote here, in pyres we added the scheduling directly to pyres
>> because of the difficulty with a plugin architecture). So now, when a job
>> fails, the job class calls retry on itself, which if successful will
>> schedule it for a future retry. The retry only happens on jobs that have the
>> class attributes I mentioned above. You can see the commit that does this
>> below:
>>
>> 
http://github.com/binarydud/pyres/commit/1b692b209d536fe87fa44acacb4a4b7cdd69a6ed
>>
>>
>>  Matt George
>> Software Developer | *Emma®*
>> matt.george@myemma.com
>>
>> Emma helps organizations everywhere communicate & market in style.
>>  Visit us online at http://www.myemma.com
>>
>> On Mar 23, 2010, at 10:01 AM, Thibaut Barrère wrote:
>>
>> > Thibaut, support is being added to the Python implementation (pyres) to
>> re-queue failed jobs manually.  Here's a
>> > commit (in a fork) with the feature working:
>> >
>> http://github.com/aezell/pyres/commit/4391047978d4a5a3234811f8297520a29a8a1a04
>>
>> Thanks - I will see how to adapt this to the Ruby version.
>>
>>  Personally, I don't like the idea of automatically re-trying failed jobs
>>> because not every job will be idempotent.
>>>
>>
>> Well I guess it depends of the context: in my case if there's a failure,
>> it translates into "we need to provide a dev fix to make this work", most of
>> the time. So an automatic retry wouldn't help much here.
>>
>> But a base resque failure-backend that can be either derived or configured
>> to define the retry strategy is still useful though; I could use this as
>> well later on.
>>
>> -- Thibaut
>>
>>
>>
>>
>>
>


-- 
Tony Arcieri
Medioh! A Kudelski Brand

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Tony Arcieri
Date:
2010-03-23 @ 16:58
On Tue, Mar 23, 2010 at 9:04 AM, Michael Russo <mjrusso@gmail.com> wrote:

>  Personally, I don't like the idea of automatically re-trying failed jobs
> because not every job will be idempotent.
>

They can be if you have a base class for all your jobs that ensures
idempotence, at least in regard to how job "completions" are handled (i.e.
you implement some mechanism to check if the job already completed and if so
discard the duplicate results).  Of course, if any of your jobs cause side
effects in stateless components of your system (e.g. we create objects on
S3), you'll still need something to clean that up.

-- 
Tony Arcieri
Medioh! A Kudelski Brand

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Chris Wanstrath
Date:
2010-03-23 @ 20:58
On Mon, Mar 22, 2010 at 4:11 PM, Tony Arcieri <tony@medioh.com> wrote:

> For our use case we cannot stand to lose any jobs.  If jobs fail in any
> manner they must be retried, and we need to cover all failure modes.

This is not one of Resque's design goals. I would recommend you seek a
job processing system which guarantees jobs will not be lost.

For Resque to make this guarantee, Redis would also need to make this
guarantee. As far as I know, Redis can have pretty great persistence
but I'm not sure it promises to provide complete, fail-proof
persistence yet.

But even if Redis did guarantee this (just to make the point that I am
no way blaming Redis or think it's even a bad thing), Resque doesn't
either. So there are potentially two roadblocks to promising jobs are
never lost.

Making sure jobs are never lost is non-trivial and if my website
depended on that feature, I'd personally seek a system in which a
great deal of time has been invested into solving that problem.

You might want to read the Resque introductory blog post if you have
not already to get a feel for what problems Resque solves:

http://github.com/blog/542-introducing-resque

--
Chris Wanstrath
http://github.com/defunkt

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Tony Arcieri
Date:
2010-03-23 @ 21:05
On Tue, Mar 23, 2010 at 3:58 PM, Chris Wanstrath <chris@ozmm.org> wrote:

> On Mon, Mar 22, 2010 at 4:11 PM, Tony Arcieri <tony@medioh.com> wrote:
>
> > For our use case we cannot stand to lose any jobs.  If jobs fail in any
> > manner they must be retried, and we need to cover all failure modes.
>
> This is not one of Resque's design goals. I would recommend you seek a
> job processing system which guarantees jobs will not be lost.
>
> For Resque to make this guarantee, Redis would also need to make this
>  guarantee.


That's not true at all.  It is certainly possible to have a hybrid
stateful/stateless system that is still robust and ensures jobs are never
lost.  However, it might require a slightly different definition of a "job"
than the one that exists under Resque.  You can think of a job as dispatched
to Resque as an "attempted job" or something, with the concept of a job
itself as having a complete lifecycle whose state is tracked somewhere
secure and fault-tolerant (i.e. a database)

There are several approaches to consider.  The one that has always made
sense to me is assigning jobs timeouts and retrying them if they do not
complete after a prescribed amount of time.  In that case all state is
stored in the database and all other pieces of the system such as the actual
job dispatcher (Resque in this case) and the workers can be treated as
stateless, which was my original question (i.e. do you depend on Resque
being stateful?).  As was noted earlier, such a system would require all
jobs to be idempotent, so if they accidentally complete twice everything is
still fine aside from the fact you wasted resources running a job twice.

You also seem to be suggesting that I want Redis to provide this all out of
the gate, and I certainly don't.  I'm perfectly willing to build the
necessary code to retry failed jobs.  I also don't know of a better starting
point in Ruby for the sort of distributed job queue than Resque, short of
building my own from scratch, which I don't have time to do.

-- 
Tony Arcieri
Medioh! A Kudelski Brand

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Chris Wanstrath
Date:
2010-03-23 @ 22:45
On Tue, Mar 23, 2010 at 2:05 PM, Tony Arcieri <tony@medioh.com> wrote:

> You also seem to be suggesting that I want Redis to provide this all out of
> the gate, and I certainly don't.  I'm perfectly willing to build the
> necessary code to retry failed jobs.  I also don't know of a better starting
> point in Ruby for the sort of distributed job queue than Resque, short of
> building my own from scratch, which I don't have time to do.

I am suggesting no such thing. I am suggesting you seek a library with
the functionality you desire, of which there are plenty, rather than
adding to Resque.

-- 
Chris Wanstrath
http://github.com/defunkt

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Tony Arcieri
Date:
2010-03-23 @ 23:06
On Tue, Mar 23, 2010 at 4:45 PM, Chris Wanstrath <chris@ozmm.org> wrote:

> I am suggesting no such thing. I am suggesting you seek a library with
> the functionality you desire, of which there are plenty, rather than
> adding to Resque.
>

What would you suggest I use instead?  The workflow gem looks awesome.  Any
other recommendations?

-- 
Tony Arcieri
Medioh! A Kudelski Brand

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Chris Wanstrath
Date:
2010-03-23 @ 23:20
On Tue, Mar 23, 2010 at 4:06 PM, Tony Arcieri <tony@medioh.com> wrote:

>> I am suggesting no such thing. I am suggesting you seek a library with
>> the functionality you desire, of which there are plenty, rather than
>> adding to Resque.
>
> What would you suggest I use instead?  The workflow gem looks awesome.  Any
> other recommendations?

You might want to look at Kestrel, Twitter's message queue system. I
know it can handle a large number of jobs and makes a point to never
lose any. AMQP also seems to be quite popular with a number of Ruby
libraries.

I haven't used either, personally, but I would start at either.

-- 
Chris Wanstrath
http://github.com/defunkt

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Will Schenk
Date:
2010-03-23 @ 23:33
On Tue, Mar 23, 2010 at 7:20 PM, Chris Wanstrath <chris@ozmm.org> wrote:
> You might want to look at Kestrel, Twitter's message queue system. I
> know it can handle a large number of jobs and makes a point to never
> lose any. AMQP also seems to be quite popular with a number of Ruby
> libraries.
>
> I haven't used either, personally, but I would start at either.

If you are looking for reliable messaging -- which I agree isn't the
problem that resque set out to solve -- personally I'd look toward
RabbitMQ or ActiveMQ.  Kestrel seems too experimental for my tastes
(less that 1 year old, one production instance, scala is a new
language, etc.)  And AMQP is the protocol that RabbitMQ and others
speak rather than a server itself.  (Or you could get a real one like
tibco or something, but, uh, that solution probably won't fit your
constraints.)

But if your goal is to make sure that something you are already
tracking in the database has been "processed" or whatever and then
written back to that same row when it's done anyway, I don't think
that reliable messaging is really what you are looking for.  With what
we're doing at benchcoach with the workflow stuff we're using resque
solely to distribute the work rather than relying upon the semantics
of the messaging system.  It's basically a worker coordination and
distribution system for us.  If a job gets lost, or "gets scheduled
twice" it doesn't matter because it isn't the "job" that's running,
but rather that the object is inspected and moved to the correct
state.  (i.e. who gives a shit if it's idempotent or not.)  It
divorces the concept of "scheduling the job" with "performing the next
necessary action".

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Tony Arcieri
Date:
2010-03-23 @ 23:41
On Tue, Mar 23, 2010 at 5:30 PM, Aaron Quint <aaron@quirkey.com> wrote:

> If durability is your angle - definitely look at RabbitMQ. I've used
> it on a couple of projects and its main selling points are speed,
> stability, and guarantees about messages.
>

On Tue, Mar 23, 2010 at 5:33 PM, Will Schenk <wschenk@gmail.com> wrote:

> If you are looking for reliable messaging -- which I agree isn't the
> problem that resque set out to solve -- personally I'd look toward
> RabbitMQ or ActiveMQ.
>

Well apparently I'm doing a terrible job of explaining myself, because this
is the exact opposite of what I want.  I'm sure this is largely due to me
saying I want persistent "jobs", which makes people assume that if the jobs
live in Redis/RabbitMQ/Kestrel whatever that they persist somehow there.
That isn't the case.

So instead, how about I talk about workflows, where a workflow is composed
of many jobs that may or may not fail.

I want:

   - The lifecycles of workflows tracked end-to-end, with a workflow's
   constituent jobs automatically retried if they fail
   - As few stateful components in the system as possible.  I already need a
   database.  We already have a backup solution for the database in place.  I'd
   rather not manage any additional stateful components unless absolutely
   necessary.

So no, I absolutely do not want to have to depend on any kind of disk logged
message queue.  I would rather that message queue die in a fire, and the
jobs be retried from the state of the workflow in the database, than have to
worry about fault-tolerance of the message queue.

Given all that, something like the workflow gem + Resque seems like a great
combination.

-- 
Tony Arcieri
Medioh! A Kudelski Brand

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Dusty Doris
Date:
2010-03-24 @ 04:00
I've been following this thread with some interest and I'd like to
share what I'm doing right now.

At first I used something like the workflow concept, where I would
store state in the actual database itself.  This worked great, I was
using a home brewed state machine to take care of job state inside the
model.  However, I am using this mainly for sending SMS messages, so I
came up with a few conclusions over time.

First, I do want to retry failed attempts automatically.  This is
because I am hitting a remote API.  Sometimes I might have a network
problem that prevents it from completing.  Or, perhaps an issue on the
remote API.  So, in those cases, I want to retry.  But, only so many
times.  After so many tries, my data is too stale and I'm not
interested in trying again.

However, I still want to know about failures in case something like
the API is completely down.  So, I only want to retry so many times.
I also want to create a bit of a configurable delay on the retries.

Finally, I am going for raw speed here because I may need to process
thousands of messages at once.  So, I didn't want the overhead of
updating state in my database (mongodb) on each attempt.  I wanted to
take care of state on the worker itself and then record the results
when I've reached a conclusion.

The link to this gist below shows the state machine that I'm using in
the actual resque jobs.  I'm passing the job a hash to work with,
instead of an ID to an established object.  What I do is try to
process the job so many times.  Upon success or failure, I will then
write to the database to record what happened.  Again, I'm OK with
losing some jobs because of the nature of the queue I'm working with.

The example below is for email, but it is very similar to the SMS
workflow I am using.  I'm just posting this because the SMS workflow
contains some very specific information about our aggregator.  But,
you can see what I'm doing here with email, which I ended up using the
same pattern on.

Basically, I control state inside the resque job.  I keep track of the
job by re-queing a job when it fails and incrementing the attempts
count.  Each time I run the job, I check the attempts count in a
rescue clause.  If I exceed my max attempts, then I consider it a
failure.

If my job finishes, I write to the db.  If it fails so many times, I
also write to the db, but then I raise an exception.  That way resque
pops that in the failed queue and my monitoring system picks it up and
can alert me if I have x number of failed jobs.  This will let me know
if there is a major problem.  (eg:  I'm ok with < X failures, but more
than that I am interested in what is going on)

Take a look, if you like it perhaps it will fill your needs.  If not,
then I think managing the state in your database object itself is a
good alternative.

Let me know if you have any suggestions.

http://gist.github.com/341963

Take care

- Dusty Doris

BTW - resque is kicking some ass for me right now, thanks!

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Tony Arcieri
Date:
2010-03-24 @ 16:55
On Tue, Mar 23, 2010 at 10:00 PM, Dusty Doris <dusty@doris.name> wrote:

> Finally, I am going for raw speed here because I may need to process
> thousands of messages at once.  So, I didn't want the overhead of
> updating state in my database (mongodb) on each attempt.  I wanted to
> take care of state on the worker itself and then record the results
> when I've reached a conclusion.
>

I'm in the opposite boat, I have a smaller number of relatively heavy jobs
(it's a video encoding workflow)

-- 
Tony Arcieri
Medioh! A Kudelski Brand

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Tony Arcieri
Date:
2010-03-23 @ 23:30
On Tue, Mar 23, 2010 at 5:20 PM, Chris Wanstrath <chris@ozmm.org> wrote:

> AMQP also seems to be quite popular with a number of Ruby
> libraries.
>

I'd probably build my own system around RabbitMQ if Resque didn't exist.  I
almost considered using Nanite.  But it's nice having a nice,
already-developed web UI for examining the state of the system :)

-- 
Tony Arcieri
Medioh! A Kudelski Brand

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Tony Arcieri
Date:
2010-03-23 @ 23:28
On Tue, Mar 23, 2010 at 5:20 PM, Chris Wanstrath <chris@ozmm.org> wrote:

> You might want to look at Kestrel, Twitter's message queue system. I
> know it can handle a large number of jobs and makes a point to never
> lose any. AMQP also seems to be quite popular with a number of Ruby
> libraries.
>

I really don't like the Kestrel architecture, precisely because it does
exactly what I complained about at the begging of the thread: adds more
stateful components to the system which we need to ensure are reliable and
fault tolerant.  I'm trying to keep as much of the system as stateless as
possible, preferably relying on the database to be the only stateful
component.  I'm not worried about losing jobs-in-flight.  I just want to
ensure that if a job is lost in flight that it's eventually retried.

-- 
Tony Arcieri
Medioh! A Kudelski Brand

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Aaron Quint
Date:
2010-03-23 @ 23:30
On Tue, Mar 23, 2010 at 7:20 PM, Chris Wanstrath <chris@ozmm.org> wrote:
> On Tue, Mar 23, 2010 at 4:06 PM, Tony Arcieri <tony@medioh.com> wrote:
>
>>> I am suggesting no such thing. I am suggesting you seek a library with
>>> the functionality you desire, of which there are plenty, rather than
>>> adding to Resque.
>>
>> What would you suggest I use instead?  The workflow gem looks awesome.  Any
>> other recommendations?
>
> You might want to look at Kestrel, Twitter's message queue system. I
> know it can handle a large number of jobs and makes a point to never
> lose any. AMQP also seems to be quite popular with a number of Ruby
> libraries.
>
> I haven't used either, personally, but I would start at either.

If durability is your angle - definitely look at RabbitMQ. I've used
it on a couple of projects and its main selling points are speed,
stability, and guarantees about messages. Personally, since I started
using Resque and Redis I haven't really looked back. What Redis +
Resque lacks in guarantees, it makes up for in its awesome API and
ease of use.

--AQ

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Ryan Carver
Date:
2010-03-22 @ 22:54
Very cool, and the max attempts stuff in redis is quite nice.

In our scenario, for network exceptions I want to retry as soon as
reasonably possible. What's the minimum time you can get with
resque_scheduler?

Fwiw, my attempt at this stuff is in
http://github.com/rcarver/resque-lock-retry. To retry, it currently sleeps
for 1s and then requeues the job. Obviously not good, but the simplest way I
could get a quick retry without running another process.



On Mon, Mar 22, 2010 at 3:38 PM, Brian O'Rourke <brian@crowdflower.com>wrote:

> Hi Tony,
>
> At CrowdFlower we extended Resque's failure subsystem to handle this.
> Basically we wrap the failed jobs of certain workers with a new class which
> tracks the number of times a job has been attempted, and we use the
> excellent resque_scheduler plugin to retry the job again in the future.
>
> Have a look, see if this helps you to solve your problem:
> http://gist.github.com/340622
>
> --bpo
>
>
>
> On Mon, Mar 22, 2010 at 3:23 PM, Tony Arcieri <tony@medioh.com> wrote:
>
>> I'm writing a script to re-enqueue failed jobs.  The jobs are failing
>> largely due to bugs in the workers or other network infrastructure problems
>> we're still resolving.  Whenever we resolve any of these issues, I'd like
>> for the failed jobs to automatically get retried.
>>
>> I'm wondering about the best way to handle this.  I'd almost like to track
>> the entire lifecycle of jobs within the database.  An alternative would be
>> to depend on Redis for this persistance.  However, I tend to think of my
>> message queues (as well as the job workers) as being stateless, so that I
>> can retry jobs based completely on the database state, rather than depending
>> on the queue or the workers to handle errors.
>>
>> Beyond that, we're working in a stateless environment (EC2).  We can set
>> up persistent storage for Redis, but do we really have to?  We already have
>> mechanisms in place to persist, monitor, and backup the database state.  Do
>> we really need to do this all for Redis as well?
>>
>> My inclination is to treat Redis as a stateless job dispatcher and manage
>> the state for the job lifecycle in a database.  Does anyone else use this
>> approach?
>>
>> --
>> Tony Arcieri
>> Medioh! A Kudelski Brand
>>
>
>

Re: [resque] Do you treat Redis as stateful in your Resque deployment?

From:
Brian O'Rourke
Date:
2010-03-22 @ 23:38
Ryan,

resque_scheduler polls with a 5 second sleep. You could make that more
granular or if you want 0 delay, just call Resque.enqueue instead of
Resque.enqueue_at on line 75 of the gist below.


On Mon, Mar 22, 2010 at 3:54 PM, Ryan Carver <ryan@fivesevensix.com> wrote:

> Very cool, and the max attempts stuff in redis is quite nice.
>
> In our scenario, for network exceptions I want to retry as soon as
> reasonably possible. What's the minimum time you can get with
> resque_scheduler?
>
> Fwiw, my attempt at this stuff is in
> http://github.com/rcarver/resque-lock-retry. To retry, it currently sleeps
> for 1s and then requeues the job. Obviously not good, but the simplest way I
> could get a quick retry without running another process.
>
>
>
> On Mon, Mar 22, 2010 at 3:38 PM, Brian O'Rourke <brian@crowdflower.com>wrote:
>
>> Hi Tony,
>>
>> At CrowdFlower we extended Resque's failure subsystem to handle this.
>> Basically we wrap the failed jobs of certain workers with a new class which
>> tracks the number of times a job has been attempted, and we use the
>> excellent resque_scheduler plugin to retry the job again in the future.
>>
>> Have a look, see if this helps you to solve your problem:
>> http://gist.github.com/340622
>>
>> --bpo
>>
>>
>>
>> On Mon, Mar 22, 2010 at 3:23 PM, Tony Arcieri <tony@medioh.com> wrote:
>>
>>> I'm writing a script to re-enqueue failed jobs.  The jobs are failing
>>> largely due to bugs in the workers or other network infrastructure problems
>>> we're still resolving.  Whenever we resolve any of these issues, I'd like
>>> for the failed jobs to automatically get retried.
>>>
>>> I'm wondering about the best way to handle this.  I'd almost like to
>>> track the entire lifecycle of jobs within the database.  An alternative
>>> would be to depend on Redis for this persistance.  However, I tend to think
>>> of my message queues (as well as the job workers) as being stateless, so
>>> that I can retry jobs based completely on the database state, rather than
>>> depending on the queue or the workers to handle errors.
>>>
>>> Beyond that, we're working in a stateless environment (EC2).  We can set
>>> up persistent storage for Redis, but do we really have to?  We already have
>>> mechanisms in place to persist, monitor, and backup the database state.  Do
>>> we really need to do this all for Redis as well?
>>>
>>> My inclination is to treat Redis as a stateless job dispatcher and manage
>>> the state for the job lifecycle in a database.  Does anyone else use this
>>> approach?
>>>
>>> --
>>> Tony Arcieri
>>> Medioh! A Kudelski Brand
>>>
>>
>>
>