I’d love to hear more from others on their experience with recursive workflows, particularly as it can be such a benefit for certain types of apps. In fact, we built an app that is completely dependent on recursive WFs, so related issues (potential bugs, performance efficiencies) have been ongoing ‘concerns’ for us. There also seem to be various ways to approach this, so I’d love to learn more about what works best for others.
Nigel, you mentioned a worry about how to monitor whether a recursive WF is working. If it helps, we’ve done a few things to help alleviate our worries:
⦁ Add a ‘Failsafe’ step at the end of all recursive workflows that fires ‘only when’ the recursive Schedule API step is empty (i.e., when the recursive WF didn’t reschedule itself). The failsafe sends an email to us with details on the project error.
⦁ Add a ‘count’ field in the relevant data type that increases or decreases by 1 every time the recursive WF fires. For example, we send emails to project participants on a regular basis, so have an API first count the number of participants (partCount=100) and then another resursive WF add 1 to ‘partEmailsCount’ field for that project whenever an email is sent. We can then check to see if partCount=partEmailsCount.
⦁ Create a SuperUser dashboard with a RG that lists any project errors that would have occurred due to a failed recursive WF. For example, projects where partCount doesn’t equal partEmailsCount. We then have a button that allows us to restart that recursive WF for that project (and passes partEmailsCount as a parameter so that we can start where we left off).
⦁ Before a recursive WF launches, we schedule an ‘error check’ API a few hours later that emails us with a list of all project-critical information (e.g., whether there are any null fields that should contain important data). We did this because an early version of the app had a bug that passed a blank parameter forward in the recursive WF that was required to create records successfully.
⦁ Increased the interval before the recursive WF schedules itself (e.g., current date + 10 seconds), to ensure the whole WF processes before it recurs. In other words, ensuring that WFs don’t ‘pile up’ and cause a timeout or other problem that would end the recursive WF.
This may be overkill but it helps a worrier like me sleep a little better. 
Speaking of which, we recently landed a client who wants to use the app for 1,000-1,500 employees (launching in June), including bulk uploading participants, creating monthly surveys for them, and sending a set of survey invitations and reminders every few weeks - all using recursive WFs (the app provides a way to get a regular ‘pulse’ on employee performance, stress, and engagement). So, any war stories or advice from others in a similar situation would be appreciated! 