Reducing Failed Scheduled Jobs – My Soap Box

If you’ve heard me present on Sherlock® lately, you’ve heard me talk about scheduled job failures and what they do within your environment. Wherever I go, I take some tools with me to monitor and help to tune the percentage of scheduled job failures within an environment. I’d really like to share some of that with you.

There are a host of reasons why you should care about failed job volume within an SAP Business Objects deployment.

  1. Wasted time in your BOBJ environment is expensive both in terms of resources wasted in batch and in cycles that may cut into the online day processing for users (really, does batch always finish on time?).
  2. Wasted time in the database. Take a report that fails to send to a destination for example. It runs from the BOBJ side. It runs to completion on the database. When that all finishes, the job bombs because it can’t send to a file system, etc.
  3. It’s a bad best practice. With such a simple activity to monitor, it’s a simple application control to put in place to observe failure rates.

One of the core components of the Sherlock® Executive Dashboard is the Batch Performance tab (shown here). This dashboard strives to provide management and administrators with insight into what the failure rates are, month over month, for a six month period for performance analysis.

The great part is this data is already in Auditor. You don’t need a third party tool or dashboard to tell you what your failure rates are. Let’s start with a pretty standard Auditor query rolled up a bit. It’s written for MySQL, but can easily be adjusted for the other database platforms Auditor deploys on. It’s kind of long, so rather than display it in line here, I’m uploading it as a .sql (plain text) file for you to download if you wish.

text
failed_schedules.sql.txt

This query does a few simple things.

  1. The Derived Table – This inner query does the math to get an aggregated number of jobs, the number of jobs that failed, and the number of successful jobs.
  2. The Outer Query – With the results of the inner query in hand, this mashes those metrics up against the actual month the event occurred in.

That is exactly what gives this dashboard its juice. Setting a simple constant in the dashboard, we’re able to calculate the failure rate over the total number of jobs and monitor our performance month over month. Do you roll off your Auditor history? Many don’t. That’s probably a blog post for another day. But today, you can probably limit this query to a rolling six month metric (or less if you prefer).

Well that’s great. So now what? You know that you now have N% of your jobs failing. The next step is to put together a detail report to identify all the reports that are failing day over day. One beef I have with Auditor is not actually capturing the error message. For that, you’ll have to dig into the CMC, report by report, or use Sherlock® to capture this information in bulk. What does that query look like? It’s pretty simple actually

SELECT * FROM AUDIT_EVENT WHERE EVENT_TYPE_ID IN (327682,327683)

I may be over-simplifying a bit. But consider that this query simply tells you all the events by user that failed at a detail level. You can further limit it by a range of START_TIMESTAMP dates, etc. to get the details on who is scheduling reports that are failing. You do need to do a bit more work to get the actual report name from Auditor (it’s in the AUDIT_DETAIL table), but I’m trying to keep this post brief 🙂 When in doubt, use the Activity universe to get the logic for getting document names.

I think it’s also about user education and proper rights. I’m not sure every user in an environment should have schedule rights. Many will fire off a job and due to the complexities in InfoView of actually viewing those jobs, novice users will generally fire and forget. Again, another blog post for another day. If it’s running every day and failing every day, it’s a waste.

So that’s it. I’m hoping you see the relative ease in monitoring and reporting on failed jobs to improve the health of your deployment.

Leave a Reply