+ Start a Discussion

Querying over 1,000,000 records looking for best approach

I am developing a Java application which will query tables which may hold over 1,000,000 records.  I have tried everything I could to be as efficient as possible but I am only able to achieve on avg. about 5,000 records a minute and a maximum of 10,000 at one point.  I have tried reverse engineering the data loader and my code seems to be very similar but still no luck.


Is threading a viable solution here? I have tried this but with very minimal results.  


I have been reading and have applied every thing possible it seems (compressing requests/responses, threads etc.) but I cannot achieve data loader like speeds.  


To note, it seems that the queryMore method seems to be the bottle neck.


Does anyone have any code samples or experiences they can share to steer me in the right direction?




Relational Junction has employed a technique successfully since Novembe 2006. that involves querying variable length time

intervals, stepping through time since the org was created a few days at a time, and varying the interval depending on how

many records were returned. This technique makes Relational Junction the most scalable replication product available.


Sesame Software filed a patent application in 2007, which we expect to be approved in 2010. We strongly suggest employing

some other technique, such as "striping" the data with a queryable field other than time stamps for the initial load of your


Message Edited by Rick.Banister on 02-10-2010 07:32 AM

IIRC, the data loader does a 2 threaded approach, one that's calling queryMore for the next batch while the other thread is writing out the results of the previous batch. (I'm assuming you've already made sure you're doing keep-alives, compression, etc). Are you using WSC or Axis or someother soap toolkit ?




Hey Simon, thank's for the reply.


I'm not sure I'm doing keep-alives can you tell me an example in the dataloader where this is being done so I can take a look for myself and make sure I implement this? 

As for compression, I believe I am using it via the SforceServiceLocatorCompressed class everyone seems to be using, which extends the base SforceServiceLocator class.  However although I had started with setting the properties as follows: 



call.setProperty(HTTPConstants.MC_ACCEPT_GZIP, Boolean.TRUE);call.setProperty(HTTPConstants.MC_GZIP_REQUEST, Boolean.TRUE);


 I noticed in the dataloader sfdc does it a little different and changed mine to mimic theirs:

call.setProperty("sfdc.acceptGzip", Boolean.TRUE);call.setProperty("sfdc.sendGzip", Boolean.TRUE);



 Using both approaches as well as not utilizing the compressed service locator class did not yeild any changes.  So I am unsure if I am setting up this correctly.  Any guidance would be greatly appreciated.  here is where I enable it:


try {binding = (SoapBindingStub) new SforceServiceLocatorCompressed().getSoap();} catch (ServiceException e) {e.printStackTrace();} QueryOptions qo = new QueryOptions(); qo.setBatchSize(2000); binding.setHeader(new SforceServiceLocator().getServiceName().getNamespaceURI(), "QueryOptions", qo); binding.setHeader(new SforceServiceLocatorCompressed().getServiceName().getNamespaceURI(), "QueryOptions", qo);...



 I have been going around in circles trying to mimic the time that dataloader is getting to retrieve the same amount of records but even while seemingly having exactly identical code to their's I am still only achieving about 10,000 lines / minute.


Any help/insight would be greatly appreciated.


Thanks again for your reply 


Depends on which version of the data loader your looking at, but the ones that use axis, use a custom http client class that you have to update your axis config to use instead of the axis default. but current versions of hte data loader don't use axis at all, and you'll never match the current data loaders perf using axis 1.x.

I'm not exactly sure which version I am running.  I don't think this version is threaded as it seems to have a thread only to print out the stats and progress of the current operation taking place from what I can see.


With regards to my issue, do I have to do somethign simliar to what the individual in this thread has done? http://community.salesforce.com/sforce/board/message?board.id=JAVA_development&message.id=5668


I am using Axis 1.4 like he is and I have added commons-codec and commons-httpclient to my libraries but i have never dealt with a  client-config.wsdd file before.  I have been looking for where this is located but cannot seem to find one, or specifically point one out via eclipse.   Do you know where this file resides and would it need to be edited before compression actually works?


Thank's again for the assistance. 

commons-http won't be used until you update the wsdd file to tell it to use commons-http instead. there's a default one in the axis jar, you can put your own higher up in the classpath to override it.

Hi Simon, 


thank's for the help I think i'm getting pretty close as I have edited the Client-config.wsdd file directly and edited the transport entry and the deployment name.  But now I am getting an error in java:


Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/commons/httpclient/UsernamePasswordCredentials


this is being thrown on the following line:


loginResult = binding.login(USERNAME, PASSWORD);



Also the following is my new client-config.wsdd:



<?xml version="1.0" encoding="UTF-8"?><deployment name="commonsHTTPConfig" xmlns="http://xml.apache.org/axis/wsdd/" xmlns:java="http://xml.apache.org/axis/wsdd/providers/java"> <globalConfiguration> <parameter name="disablePrettyXML" value="true"/> <parameter name="enableNamespacePrefixOptimization" value="false"/> </globalConfiguration> <transport name="http" pivot="java:org.apache.axis.transport.http.CommonsHTTPSender"/> <transport name="local" pivot="java:org.apache.axis.transport.local.LocalSender"/> <transport name="java" pivot="java:org.apache.axis.transport.java.JavaSender"/></deployment>



I should also mention I have tried both with and without passing the VM arguments  -Daxis.ClientConfigFile=client-config.wsdd.


Thank's again 



I had the wrong http client downloaded.  It needed to be 3.1 and I downloaded 4.1 so that has rectified the error.  Although now I am going just as fast as before.  So I'm not sure what else there is left for me to  try?
In my experience, query (and queryMore) are quite fast as long as all you query is the Id. Once you have the Id, you can parallelize getting the rest of the fields by executing retrieves in other threads.



I am also developing one utility which talks to SalesForce, even I have to develop which should be scalable over 100000 recrods while retieve.


I think you have solved the issue.  Can you tell me how you managed you to do this?


Awaiting for you reply.