function readOnly(count){ }
Starting November 20, the site will be set to read-only. On December 4, 2023,
forum discussions will move to the Trailblazer Community.
+ Start a Discussion
AlwaysConfusedAlwaysConfused 

Consuming the cloud from .NET

Hey guys,

 

 

I'm trying to figure out how to do this type of thing ...

http://blogs.msdn.com/pedram/archive/2008/04/21/how-to-consume-rest-services-with-wcf.aspx

 

for salesforce, but can't seem to figure out the url i need to build prior to making the request.

 

I have built some async code based on creating the web reference and making standard web service calls but can't seem to match the performance put out by the APEX data loader.

 

I've looked around and found that from the code (what little i actually understand of java) and some forum threads on here the data loader effectively uses 2 threads, 1 to pull the data and a second to save it locally i can see that in the code:

http://sourceforge.net/projects/sforce-app-dl/files/

 

I'm trying to do the same thing but i'm dumping the data to a SQL server database this end instead of a CSV file.

Whilst my code works it's nowhere as efficient as the data loader and i can't seem to figure out why.

 

Essentially my code works by doing the following:

1. make async query

2. when response comes back call "querymoreasync" (recursive until query completes)

3. on another thread save the results

 

I figured this would at least come close to the dataloader but it somehow is about 5 times slower :(

 

When I dug a bit deeper I found that the data loader actually requests then handles the data as a stream, i'm not actually handling the stream at the moment in favour of relying on .NET's generated classes from my web reference which could explain some of the performance hit.

 

Anyone else trying to do this?

Any ideas?

 

 

Jon Mountjoy_Jon Mountjoy_

Hi AlwaysConfused

 

I moved the thread here -as this is a pretty advanced question!  It's also more likely to get an answer here.

 

I'm not sure quite sure what you're trying to do with the API, but have you looked at the Bulk Data API?

 

Jon 

AlwaysConfusedAlwaysConfused

Bulk data API ???

 

... Oh ...

 

Time to do some digging then ...

AlwaysConfusedAlwaysConfused

Ok i looked in to the bulk API and it comes with some limitations which make it unsuitable for my problem.

 

1. only 500 batches per rolling 24 hour period can be sent to the server (i have a lot mor than that when you consider the limitation in point 2)

2. Max batch size is 10,000 records ... hmmm ... how does 2.7 million records sound ???

3. batch data can be no longer than 10MB ... hmmm ... i'm looking at about 1.6GB of data woops !!!

 

The sample provided uploads data, I'm looking to query data, not found many samples at all for using .NET and I don't recall seeing a single "large results set query" sample anywhere for .NET.

 

So i started from scratch ...

 

Essentially i wrapped up the normal async stuff in the API and this gives me back a Query result that contains a query locator.

From this point i simply call query more, until i hit the end of the results set.

 

After a while though the query seems to just fall over so my current problem is getting back to the nth record using a new query ...

http://community.salesforce.com/sforce/board/message?board.id=NET_development&thread.id=8088

 

 

And all those java boffs out there tell me how great java is ... i've done this loads in .NET only environment, in fact i've seen colleagues process 3GB XML and CSV files without a problem across web services (with compression of course).

 

I've basically used the same mechanism and after an hour or so the server just seems to give up talking to me.

A bit odd ...

 

Why would my code work for the first million records then at some random point in the query just give up?

It can't be the code or the first batch would fail. 

DevAngelDevAngel

Hi Paul,

 

A couple of points of clarification around the limits.

 

The max amount of data in the rolling 24 hour period is 

 

500 batches * 10,000 batch size = 5,000,000 records.

500 batches * 10MB batch data size = ~4.9 gigs.

 

The main problem here is that is all good if you importing TO salesforce.com and if I read your post correctly, you are trying to pull the data to SQL Server.

 

 

So that means using the core API and the query method.  The query method will allow you to retrieve up to 2000 records.  If you have more that 2000, as do you, then you follow up the query method with a queryMore passing to the method the queryLocator (cursor essentially) that you receive from the query and continue calling the queryMore method until all the records have been pulled using the queryLocator from each prior queryMore call.

 

For 2.7m records you would end up with 1350 query/queryMore calls. If you single thread this and assume 30 seconds for each 2k records, I reckon you are looking at about 11 1/2 hours of processing - 15 seconds per 2k is about 5 1/2 hours.

 

I think with those kind of processing times I'd look at the data export tools in the app.

 

Also I took a look at the article from your original post. Is there something about WCF that prevents using the wsdl to generate your Soap Client?

 

Cheers,

 

jesperfjjesperfj

Bulk API doesn't do query. It currently only works for loading or deleting records (using insert, upsert, update or delete operations). For query, these tips might be obvious, but just in case:

 

- Make sure you set the batch size to max (2000). Each call to query/queryMore will then retrieve 2000 records

- Make sure you have optimized all the HTTP stuff, use keep-alive and use compression

- The performance of your XML parser will matter quite a bit. Data Loader uses WSC which we custom wrote to improve performance over Axis and other toolkits. I am not sure how .NET libraries compare

 

AlwaysConfusedAlwaysConfused

Thanks for the feedback guys ...

 

@DevAngel :

 

Yeh I clicked shortly after my post that I had confused myself on the math a litle there, but as you confirmed this bulk API is only for pushing data and not for pulling.

 

The time estimates seem reasonable to me but in reality are not the case, I seem to get issues with memory usage over long time periods and then the query simply stops because it has nowhere left to put the data it pulls.

 

I think my problem here is my logic in my code though it works something like this:

 

  1. Query SF (async).
  2. On response querymore (async).
  3. Raise the data received event (threaded call) to save the data.
  4. App then waits for response to come back ...

The SF server is pretty quick and typically I can save about 20k contacts per minute based on pulling around 30 fields.

My problem is that when the query is bigger the results are also bigger, and smaller batch sizes means more requests.

For some reason i seem to be stacking up a lot of ram usage and I can't seem to figure out why.

 

Basic principle is this :

Through the dataloader total data pulled using the same query is about 1.6GB.

Through my app after receiving and processing (saving to SQL) about 200k i have about 1GB of ram in use.

What's all that about!!!

 

Actual time taken on this multithreaded approach is roughly 2 hours for a basic query based on 30 or so fields.

 

My actual query contains about 100 fields, so in theory (although i don't have any finites yet) i'm talking a little over 4.5GB of data, with the GZip compression used on the service that brings it back to something sort of reasonable under 2GB of network traffic i would assume.

 

The API does however drop my batch sizes from 1500 (i found this to be the most efficient with my code) to 1000 which has a performance hit on my smaller query of about 3k to 5k per minute for some reason, odd considering it's multithreaded approach.

 

Another interesting point is that if i run the query and then only fetch up to the first say 200k then drop the query and re run it immediately carrying on where the previous one left off i can actually avoid this peculiar memory issue and pull records at a rediculous rate, my only limitation then seems to be a combination of concurrency issues with the number of connections i make to my SQL server this end and the bandwidth available between my code and the SF API.

 

I create a new connection (i created a separate tool class for this) for each batch of records that comes back and insert the records then drop and repeat atthe moment, for some odd reason SQL doesn't seem to like having long running open connections of an hour or 2 ... it cries and cuts me off ... not very polite but it's only trying to clean up i guess ... right ??? ...

 

If anyone is interested in a C# generic wrapper for Salesforce and is interested in sharing a bit of knowledge here i'd be happy to show you what i've got in return for some input or feedback on how i might improve it.

 

 

 

AlwaysConfusedAlwaysConfused

Ok bit of an update ...

 

I managed to figure out that the bottleneck was a combination of traffic on the the local network and some post download processing so I updated the wrapper to pull everything in a single batch instead of dropping the query after every 200k or so.

 

with a little bit of fine tuning I seem to be able to pull records faster than the dataloader ... I get roughly 50k to 70k records per minute which is about 20k to 30k faster than the dataloader.

 

Just need to figure out my upsert bugs now :)