+ Start a Discussion

Query all rows using Bulk API

We have a lot of Salesforce data.  I would like to automate a weekly export.  The native Data Export UI is too clunky with 30GB data, and it's difficult to automate.  It's still an option, but I'm also exploring the option of using the Bulk API to query for the data. 


There are some limitations associated with the Bulk API.  Only 10 files of 1GB each can be returned by one batch.  This means I'll need to break my queries into smaller sets.  I'm doing some testing and tried to download 1 million Cases.   I received errors about exceeding the 10 retry limit: "InternalServerError : Retried more than 10 times."  I think because it was simply too slow to return all of the data.  My query specifies every field on the Case object.  So, I'm trying to break my batches into sets of 100,000 records.


In order to do this, I basically need to paginate through the Case rows.  Unfortunately, OFFSET is not available in Bulk API queries.  Also, it has a limit of 2,000 rows, so this wouldn't work.  My plan is to do some up-front queries (non-bulk API).  I want to order the Cases by CreatedDate and and limit my bulk-API query to the first 100k records.  Then, knowing the maximum CreatedDate of the first 100K records, I would do a second bulk-API query for the next 100k records where CreatedDate > previousBatchMaxCreatedDate.  And so on....


It's a little clunky sounding but I thought doable.  The problem is, I can't find the SOQL query that will tell me the max CreatedDate in a list of ordered Cases.  This does not work:  select Max(CreatedDate) from Case Order by CreatedDate ASC LIMIT 100000 .  I receive an error: " Ordered field must be grouped or aggregated: CreatedDate."


Is there an easy way around this problem?  I don't want to bring in all of the records.  I just need the max CreatedDate in each set of 100k records. Then I'll use that for my Bulk API query. Thoughts?


I'm also open to other ideas about how to retrieve this data reliably.  We use DBAmp but it seems to have issues and take ~2days  to complete when calling Replicate_All.  Alternatively, we could schedule Data Exports and then write some code to authenticate into SF and scan the Export page for all of the zip files.  The program would follow those links and download the files.  That might be the easiest solution in the end.  Scrubbing HTML just seems brittle.


Your thoughts are appreciated,