function readOnly(count){ }
Starting November 20, the site will be set to read-only. On December 4, 2023,
forum discussions will move to the Trailblazer Community.
+ Start a Discussion
MJ09MJ09 

Need to talk with Batch Apex Product Manager

I’m the developer of a package that has a heavy dependence on a scheduled Batch Apex job. The package currently runs in a dozen or so orgs, some of which have fairly large amounts of data. One org in particular has over 3 million records that are processed by the Batch Apex job.

 

Over the past 3 months, we’ve been encountering a lot of stability problems with Batch Apex.  We’ve opened cases for several of these issues, and they’ve been escalated to Tier 3 Support, but it consistently takes 2 weeks or more to get a case escalated, and then it can several more weeks to get a meaningful reply form Tier 3.

 

We really need to talk with the Product Manager responsible for Batch Apex. We asked Tier 3 to make that introduction, but they said they couldn’t. We’re trying to work with Sales to set up a discussion with a Product Manager, but so far, we haven’t had any luck there either. We’re hoping that a Product Manager might see this post and get in touch with us. We need to find out whether Batch Apex is a reliable-enough platform for our application.

 

Here are a few examples of the problems we’ve been having:

 

  • The batch job aborts in the start() method. Tier 3 Support told us that the batch job was occasionally timing out because its initial  query was too complex. We simplified the query (at this point, there are no WHERE or ORDER BY clauses), but we occasionally see timeouts or near timeouts. However, from what we can observe in the Debug Logs, actually executing the query (creating the QueryLocator) takes only a few seconds, but then it can take many minutes for the rest of the start() method to complete. This seems inconsistent with the “query is too complex” timeout scenario that Tier 3 support described.  (Case 04274732.)
  • We get the “Unable to write to ACS Stores” problem. We first saw this error last Fall, and once it was eventually fixed, Support assured us that the situation would be monitored so it couldn’t happen again. Then we saw it happen in January, and once it was eventually fixed, Support assured us (again) that the situation would be monitored so it couldn’t happen again. However, having seen this problem twice, we have no confidence that it won’t arise again. (Case 04788905.)
  • In one run of our job, we got errors that seemed to imply that the execute() method was being called multiple times concurrently. Is that possible? If so, (a) the documentation should say so, and (b) it seems odd that after over 6 months of running this batch job in a dozen different orgs, it suddenly became a problem.

 

  • We just got an error saying, “First error: SQLException [java.sql.SQLException: ORA-00028: your session has been killed. SQLException while executing plsql statement: {?=call cApiCursor.mark_used_auto(?)}(01g3000000HZSMW)] thrown but connection was canceled.” We aborted the job and ran it again, and the error didn’t happen again.
  • We recently got an error saying, “Unable to access query cursor data; too many cursors are in use.” We got the error at a time when the only process running on behalf of that user was the Batch Apex process itself. (Perhaps this is symptomatic of the “concurrent execution” issue, but if the platform is calling our execute() method multiple times at once, shouldn’t it manage cursor usage better?)
  • We have a second Batch Apex job that uses an Iterable rather than a QueryLocator. When Spring 11 was released, that Batch Apex job suddenly began to run without calling the execute() method even once. Apparently, some support for the way we were creating the Iterable changed, and even though we didn’t change the API version of our Apex class, that change caused our Batch Apex job to stop working. (Case 04788905.)
  • We just got a new error, "All attempts to execute message failed, message was put on dead message queue."

 

We really need to talk with a Product Manager responsible for Batch Apex. We need to determine whether Batch Apex is sufficiently stable and reliable for our needs. If not, we’ll have to find a more reliable platform, re-implement our package, and move our dozen or more customers off of Salesforce altogether.

 

If you’re responsible for Batch Apex or you know who is, please send me a private message so we can make contact. Thank you!

 

tmatthiesentmatthiesen

Thank you for posting your concerns.  To answer the below questions:

 

1) "The batch job aborts in the start() method."  This may be a symptom of a Start method taking over 20 minutes to complete.  There are a few tricks to optimize your start query: remove joins and or fields and perform these operations in the Executes.  This isn't as efficient, but it lowers the amount of work in the Start and spreads the work across the executes.

2) "Unable to write to ACS."  This is a trust issue on our side.  We've been working on this and believe these should be resolved by the end of the month.  ACS is our transient data store.  This is where we store the cursors during a batch job.  We've had a few situations where the ACS severs were effectively overloaded and the ACS reads/writes were timing out. 

3) "Errors that seemed to imply multiple concurrent executes"  Our framework guarantees that only once execute for a given job can run at a time.  You might what to check to see if you have parallel jobs running.

4) "Your session has been killed"  This message occurs when a start method goes over 20 minutes or an individual execute goes over 10 minutes.  We have sweepers that block any jobs that exceed these thresholds.

5) "Unable to access cursor data"  This is the same as #2.

6) "Iterable changed behavior"  I'll need to look at this case.  I'm not aware of anything that has changed between releases.

7) "All attempts to execute message failed"  A single execute in your batch job is generating an internal exception.  Our framework will try up to three times before placing the execute on the dead message queue.

 

I'm happy to review all these points with you over phone/email.  Please send me a follow up email to: tmatthiesen@salesforce.com.

 

I appreciate your frustration with Batch and my team will work with you to remedy this situation. 

 

regards,

 

Taggart

vck01vck01

 We are also facing the issue mentioned in 7th point "All attempts to execute message failed, message was put on dead message queue" .

Is there any solutions for this?? . we are  sure code is not generating any exception as it worked till yesterday.We now are processing nearly 11million records and we get this exception all of a sudden for more number of records...Please let us know the solution for this....And also as pointed out by "Kahn" ,we also heavily depend on batch apex for certain tasks  and it is very important for us to make sure we  dont face any issues as pointed out....

tmatthiesentmatthiesen

This is usually caused by the start method query taking longer than 2 minutes.  I would recommend leveraging indexed fields in the where clause of the query.

Bing@putBing@put

We're facing "All attempts to execute message failed, message was put on dead message queue" too.  The query in Start() is uing LastModifiedDate as selector, which as I understand is indexed because it's an audit field.  It would have to return hundreds of thousands of rows though (from Task), with lots of fields.  Achieving a 2-minute return time is going to be problematic...

vck01vck01

Salesforce team should give batch apex a high pripority and solve all the issues listed out in the first post. .IT seems that Plataform face a lot many issues if the data is at an enertprise level where we need to process millions of records.Some batches are processed and then you recieve some strange error.There will be no provision to roll back previos data ..And now the whole data is messed up...This has become real problematic and is testing our patience ..This could also question the credibility and scalability of the platfrom,....

 

 

We work on a product and we dont have everything in our hand to change  the code all the time(As code is packaged)).We have our own design pattern defined and data shuld be processed in bulk(At a time) .If any of the above issues crops up..what is that we can do???...One time work around would be fine,2nd  time also ok.But what if this happens all the time......Please come up with some permanent solution for this..Or else suggest us strong reliable practical solution......We can handle the Exceptions those are in our hand.like heapsize ,Numboer of fields sent to query...and the batch size ....but the issues that are posted,they are not in our reach...Salesforce has to sort this out....Hope to recieve a positive reply.....!

ChellappaChellappa

We are also facing same issue.

 

All attempts to execute message failed, message was put on dead message queue

 

I am having a simple Batch with 60000 records .

 

Even for this same issue.

 

Hope to find a positive reply soon..

 

Thanks,

Chellappa

Here-n-nowHere-n-now

I think the time needed to return the scope is the most critical...  if the start() query takes a long time to run (for various reasons, but mostly because filter on non-indexed fields and sheer number of records for the object), the outcome will be a lot more likely the dreaded error message.

rsoese@systecsrsoese@systecs

A functioning Batch Apex is also crucial to out work on a big ISV product. We are currently developing a Data Aggregation facility that by design must trigger a huge GROUP BY query in the start method. We nearly all the time have our first batch aborted with the Message 


First error: connection was cancelled here

 

The start() methods is returning the result of a HUGE GROUP BY query in the form of a 

Iterable<AggregateResult>. When saying Huge I mean calling this query on 10.000 records:

 

SELECT group1, ..., group32, SUM(aggregate1) aggregate1, ..., SUM(aggregate68) aggregate32
FROM RawData__c' 
GROUP BY group1, ..., group32;  

 As this may result in many thousand AggregateResults I need to perform this in Batch and cannot chunk or split this as described.

 

Sometime it also abort with a UNABLE_TO_LOCK_ROW, unable to obtain exclusive access to this record: [].

 

Whom should we contact to get help?!

 

 Best regards

 

Robert


 

rsoese@systecsrsoese@systecs

BTW: Our GROUP BY query just looks complex but calling it in SoqlExplorer and a LIMIT 2000 it just takes a second to execute. Calling it with 10.000 records cannot make a Batch's start() take 20min.

Here-n-nowHere-n-now

The limit for the Start( ) call is actually 2 minutes, not 20, which is the limit for each execute( ) call.  You'll also have to consider the impact of caching - any query that is run again in a short time will likely to return fairly quickly, because some of the result set is still in the cache and will not require an index read and/or table scan.  The first cold run, i.e., all relevant results have expired in the cache, may have a very different performance and orders of magnitude slower.

 

The frustrating thing is that it's hard to predict and profile the performance, and the general rules for caching is unknown either.  So you won't know if your Start( ) is a cold run for the query.

 

If you hit the issue often enough, you may want to change your architecture for the job.  Instead of asking the system to do your "group by" and aggregate, which could potentially take a while, only use Start( ) to gather the scopes, i.e., retrieve the IDs involved in the operation.  That is usually very quick - I've jobs retrieving 800,000+ Id's from a 4 million+ record object without a problem.  Once you have the Id's, have the Execute( ) query the scope for field details, then use a "running total" type of algorithm to iterate your aggregates.  You're going to need the batch class to be Stateful, so you can keep your iterim results.  Then finally in the Finish( ) call, do whatever you need to the grand aggregate results.

rsoese@systecsrsoese@systecs

Hy Here-n-now

 

I appreciate your answer and your tricks to find workarounds. But the general problem doesn't always go away with that. Please take a second look at my query performed in start()

 

SELECT group1, ..., group32, SUM(aggregate1) aggregate1, ..., SUM(aggregate68) aggregate32
FROM RawData__c' 
GROUP BY group1, ..., group32;  

 There is no way to move the grouping work inside the execute because to find the right groups the needs to be a single operation performed on all 10.000 records. How would I be able to chunk that with you proposed ID solution.

 

Could you please give an example in code or soql?

 

Best regards,

 

Robert

 

 

Here-n-nowHere-n-now

Hi Robert,

 

Yes you can almost always chunk an aggregated query in a batchable scenario.  I can give a high level overview...  see here's where I usually ask for payment first.  :-)

 

The trick is to have batchable class level data structure to store your "running total", and of course your Batchable needs to be Stateful, so the data in the structure would persiste (and accumulate) from Execute () to Execute ().  For instance, for your purpose you can have a Map to handle that.  Assuming your group fields are strings, and the aggregate fields are plain Decimal, so the data structure is simply this in your Batchable:

 

 Map<string,May<string,Decimal>> myAggregates=new Map<string,Map<string,Decimal>>();

 

Your complexity here is you have so many levels of grouping and aggregates, so you need you map key to be the concatenation of all the group fields (in order), and have a map to locate all your aggregates.

 

In your Start ()  the querylocator just needs

return Database.getQueryLocator('SELECT Id FROM rawData__c');

 

Then in your Execute () you do the following:

  1. Query the details of the current scope, for instance,
    RawData__c [] rawDetails=[SELECT group1, group2, ... group 32, aggregate1, aggregate2,... aggregate 68 FROM RawData__c WHERE id in: scopeIDs];
  2. Loop through your details, and add in the running aggregates - if the Map already has the group value, simple add the aggregate value; otherwise create the new key and put in the value.  Sampel code:
    for (RawData__c r : rawDetails) {
       key=r.group1+'-'+r.group2+'-'+r.group3...+r.group32;   //you may need to pick your separator depending on the data, and deal with null values
       if(!myAggregates.containsKey(key)) {
          myAggregates.put(key,new Map<string,Decimal>{'aggregate1'=>0,'aggregate2'=>0,...});
      }
       myAggregates.get(key).put('aggregate1',myAggregates.get(key).get('aggregate1')+r.aggregate1__c);
       myAggregates.get(key).put('aggregate2',myAggregates.get(key).get('aggregate2')+r.aggregate2__c);
          ...
    }

 

Then, finally for in your Finish( ) call, just parse the Map and do whatever you want with the aggregates.

Of course I omitted some needed supporting code but I trust you can figure them out.  As you can see, it's not simple, but it's a structure that can reduce a long complex query at Start() to crunching numbers at the Execute () level.  A necessary evil to work around the whiny Start () on SFDC's batchable platform.

rsoesemannrsoesemann

Thanks for sharing this for free ;-)

 

I see your point and will test this out.

But: I cannot close this case as its main topic is about Batch APEX stopping its working without any reasonable error or reason.

 

Batch beeing not reliable remains for me a major showstopper in creating enterprise architecture on Force.com.

 

Best regards,

 

Robert

MJ09MJ09

As the author of the original post, I can report that Batch Apex has gotten significantly more reliable. I haven't seen any major problems, even when working with a few million records, in the last several months.

Here-n-nowHere-n-now

I agree the batch platform has gotten more robust, which is certainly a great thing.  But I'd also argue, in agreement with Robert, that it can still use some improvements.  For one, if the Start ( ) call failed for whatever reason, either the "dead message queue"/"unable to write to ACS store" one, or something else of the platform nature, you don't really get any notification, and it's certainly not a catchable exception.  I can give one such example.  In the last few months we've encountered several times that jobs failed to start, or failed half way with this kind of status message:

 

First error: SQLException [java.sql.SQLException: ORA-01013: user requested cancel of current operation 
: select /*ApexBatch.Class.Activity_stager_phone_outbound.start: line 116*/ * 
from (select "Id" 
from (select /*+ index(t ieactivity_sysmod) */ 
t.las...

 

We were informed that it's the result of some sort of platform glitch, such as maintenance server reset.  That's really a troublesome type of silent death to deal with because of the lack of notifcation and the uncatchable nature.  Because of this we've been building an extensive monitoring framework by using integration tools to pick up status messages from the jobs, and raise alarm if expected messages are missing.  I suggest having such measures as the final safety net, as you may have built perfect the apex code within all the limits imaginable, you can still get hit by traps like that at the most inconvenient time.

 


 


rsoese@systecsrsoese@systecs

Hi MJ Kahn,

 

great to hear that you see an improvement. Could you please share what the Salesforce.com architects that replied to your original post did work out with you and your team?

 

Best regards,

 

Robert

rsoese@systecsrsoese@systecs

Hy Here-n-now,

 

thanks for your supporting words ;-) Glad to hear that there is at least another user out there who shares my definition of an enterprise ready platform ;-)

 

Regarding your suggestion to chunk an aggregate query... Dont' you run into governor limits when you do the grouping inside an APEX Map. You were talking about millions of records that possible could result in 100.000 or more Map keys. How does this magic work?

 

Another reason why this won't work for me is that aggregation is not the sole purpose of my Batch. Actually I am performing a lot on each AggregateResult. With your solution I only have those after finish() was called. So I would have to chain Batch jobs. Something which probably would not work.

 

Regards,


Robert

Here-n-nowHere-n-now

Hi Robert,

 

The size of the map probably doesn't matter... I don't think there's a limit on that anymore (I haven't personally tested though).  What matters is the heap size - as long as your data structures don't consume more than the 6 MB allowed you should be fine.  My sense is that it'll be fine, but of course all the common estimates and study necessary to ensure that should be done.

 

I did take my eyes off the big picture a little though...  If you do need to do a lot with your AggregateResults, it may seem unbalanced to do the heavy lifting in the Finish( ) call, possibly running afoul with the limits.  However, the truth is, since you have to use Iterable with an aggregated scope, the original sin has already been committed - an Interable based Batchable is subject to the normal row limit, not the batch limit; only QueryLocater based Batchable enjoy the special treatment.  So if you results have more than 10,000 groups, you're dead in the water anyways before evening getting to the Finish( ) call.

 

As for chaining jobs, I usually use an Inbound Email service class to relay the chain passed on from Finish( ).  You're going to have to store the group results in a temp object, as you prob. can't easily serialize the intermediate AggregateResults and pass to the next job that way.

 

Cheers,

B

 

 

MJ09MJ09

Robert asked what Salesforce did to improve our Batch Apex situation.

 

Many of our issues occured while the batch job was starting up. Our start() method returned a Query Locator that pointed to millions of records. Sometimes it took many hours for the start() method to run, and sometimes it would fail, either displaying some kind of internal error message or getting into some kind of blocked state.

 

According to Taggart Matthiesen, who was extremely helpful, his team had been working on some behind-the-scenes Batch Apex improvements, focusing on how the platform marshalls the results of the start() method's Query Locator so they can be passed into the execute() method invocations. They had a trial version of those improvements available, and they switched our org to use that version. It's been over a year since then, and while I haven't asked, I assume that the new version is now generally available.

 

Taggart also suggested we simplify the query in our start() method. Originally, our query used an ORDER BY that wasn't really necessary. Sorting millions of records when you don't have to is unnecessarily time-consuming -- removing that ORDER BY helped our batch job start faster and made it less subject to hitting a time-out. We were fortunate that we could remove the ORDER BY without impacting our logic -- you may not be that lucky.

MJ09MJ09

When I need to chain batch jobs, I schedule a job to run a minute later, where the scheduled job launches the next batch job. I find that easier to manage than using an inbound email service, where the email address changes when you spin up a sandbox. Also, with the scheduled  job, you can more easily pass data from the first batch job into the second.

Here-n-nowHere-n-now

@MJ re: chaining jobs

Interesting approach.  Do you just call system.schedule in Finish( )?  Why would that be easier to pass data from batch to batch (other than primitive types)?

MJ09MJ09

Imagine you've got 3 Apex classes:

 

  • Batch Job 1 - implements Batchable
  • Scheduleable Job - implements Schedulable
  • Batch Job 2 - implements Batchable

 

Batch Job 1's finish() method instantiates Schedulable Job, and can set its instance variables to whatever values you want to pass into Batch Job 2. The method then schedules the instantiated Schedulable Job to run just over a minute later.

 

Scheduleable Job instantiates Batch Job 2, and sets Batch Job 2's instance variables to reflect what was passed into it. Schedulable Job then launches Batch Job 2.

 

Batch Job 2's start() method runs, and has access to the data that Schedulable Job fed into it.

 

Of course, you can put both Scheduleable Job and Batch Job 2 into the same Apex class, and have it implement both Schedulable AND Batchable.

 

Here-n-nowHere-n-now

Thanks MJ... good approach.  I sometimes need to chain the same job to have it run in cycles - with a little adaption I think this should work too.

Scott.MScott.M

We are running into similar issues with batch jobs being aborted. This is the error we are getting:

 

SQLException [java.sql.SQLException: ORA-01013: user requested cancel of current operation. 

 

Is this the error that gets thrown if the start method times out?

 

Cheers,

Scott

ckarimanoor29ckarimanoor29

Hi ,

 

You are right.

This is what you get, when the Batch Jobs get aborted.

 

This is mostly because your initial query gets timed out , due to multiple conditions.

If the query runs for more than 2 mins, it will time out.

 

HOpe this answers your query.

 

Regards,

Chellappa

Scott.MScott.M

It's not super helpful since as other people have stated the queries themselves don't appear to be what's taking a long time to execute. It seems like we can have apex jobs abort simply because there's to much load on salesforce servers. Is that possible? 

Here-n-nowHere-n-now

Yes your query could run slower than you expected, hitting the 2 min limit.  I can't say definitively that being a result of high load on their server, but that does feel like a possibility.  So optimizing your start query as much as possible is certainly a good thing - you'd want as much margin as you can spare.

 

It's unfortunate that the batch platform is a best effort platform, instead of a guaranteed service one.  It's also unfortunate that when it does fail for its own internal reasons, there's no useful feedback to build an automatic recovery around.  You'll have to set up your own monitoring-response system.

dp_derekdp_derek

hello everyone.

 

I calling web services in java using soap.

 

But an error occurred."connection was cancelled here".

 

Can you help me....

Here-n-nowHere-n-now

@dp_derek: this might not be the best place for your question.  Try the "Java Development" discussion board (link at left nav bar).

rajarakrajarak

We are facing this issue even now.

 

"First error: SQLException [java.sql.SQLException: ORA-01013: user requested cancel of current operation
: select........"

The issue is not happening every day but happens few times a week.

 

 

We have approx 400K records in the object we are dealing with the batch process and sufficient filter in the query to narrow the result set to few thousand records.

 

 

Bathula NaveenkumarBathula Naveenkumar
Hi Rajarak,

did you find any solution for the above, me also facing same problem we are dealing with 400K records daily, but from last few days batch job is teminating giving this error
"First error: SQLException [common.exception.SfdcSqlException: ORA-01013: user requested cancel of current operation
select /*ApexBatch.Class.PlanDeleteBatch.start: line 33*/ *
from (select *
from (select t.custom_entity_data_id "Id"
from (select /*+ ..."

Can you/any one  help me.........
Camille CodyCamille Cody
The team of admins at our organization has been getting a daily error message from Salesforce about Batch Apex jobs, but none of us knows what it means, what it's affecting, how to fix it, or even how to discontinue the daily, irritating email. The error given us is as shown:

 Error #1:

Error Type: Batch Apex error
Error Date: 2020-02-12 03:01:21
Message: "First error: Update failed. First exception on row 0 with id a08G000000qj0YrIAI; first error: FIELD_CUSTOM_VALIDATION_EXCEPTION, You can specify either an Organization or Contact for a Recurring Donation, but not both.: []"
Context: npsp__RD_RecurringDonations_BATCH

Stack Trace:
 null

Can someone tell me what to do here?