function readOnly(count){ }
Starting November 20, the site will be set to read-only. On December 4, 2023,
forum discussions will move to the Trailblazer Community.
+ Start a Discussion
crop1645crop1645 

Inbound email handler, UTF-8, CSVs

Note - this post is more along the lines of providing a reference to others rather than a request for specific help

 

Business use case:

  1. Excel files are emailed to dozens of external users around the world. Data source is SFDC Sobject data. Recipients are not SFDC users.
  2. Users update the Excel files, save as CSV and email to SFDC inbound Email Service.
  3. Inbound Email Services parses CSV and updates SObjects 

Problem(s):

  1. Any accented characters such as ã as in São Paulo will update the SFDC SObject as S�o Paulo or
  2.  'BLOB is not a valid UTF-8 string' error is thrown in the Inbound Email Handler

Why does this happen?

  1. SFDC uses UTF-8 encoding for characters
  2. Excel can happily open .xlsx files created from UTF-8 data, and, if you have the right fonts on your system,, will happily display the characters as you would expect
  3. HOWEVER, Excel can not export to CSV in UTF-8. It just can't.
  4. The CSV file is encoded using your machine's default character set, perhaps Windows-1252 or ISO-8859-1 or something else
  5. When you attach the CSV to an outbound email, it will be received by the SFDC Inbound Email Handler as either a text or binary attachment (I believe this depends on the email client)
  6. If received as a text attachment, a character encoded in Windows-1252 or ISO 8859-1 (1 byte) will be thought of by the Inbound Email Handler APEX as UTF-8
  • Example: the character ž is hex 9E or, in bits 10011110.  However, if you look at the UTF-8 spec (summarized) you see that any byte starting with a '1' bit always means that UTF-8 thinks it will be a multi-byte description of the character.  Even worse, there is no UTF-8 valid encoding for 1001... so, SFDC replaces the 9E with the UTF-8 fallback code U+FFFD - and this displays as �.

7. If received as a binary attachment, an attempt to use the method toString() on the attachment's Blob body fails for the same reason as #6 above -  'BLOB is not a valid UTF-8 string'

 

Now, how to work around this problem?

 

A. There are various ways to get a CSV into UTF-8. Unfortunately, they all require the user to do unnatural acts:

  • Excel > CSV, then open CSV in Notepad++ and encode in UTF-8, then Save
  • Copy/paste values from Excel into a Google Docs or Open Office spreadsheet and use those tools to save as CSV, encoded to UTF-8
  • Install an Excel addin that can save as CSV to UTF-8

B. What most definitely does not work is to do a Save As in Excel (as of Excel 2010) and use the Tools dialog on the Save As window - Web Options | Encoding | Save this document as Unicode (UTF-8). Excel ignores this for CSV saves.

 

What also doesn't work in Excel is to Save as Unicode Text as this is UTF-16.

 

C. Various redesign options:

  • Only process updates for fields that can't take accented characters (like, say, changing Stage or some other picklist-driven field)
  • Use try-catch around the Messaging.Inboundemail.BinaryAttachment toString() method, and, if UTF-8 error, send a message back to the sender with detailed instructions on how to create a UTF-8 CSV (or route to the admin)
  • Don't use CSV files for update payloads; build a Sites application, use SFDC-SFDC, SFDC partner portal
  • Open up your SFDC org to the external users with very tight security (relevant especially for updates to Std Objects) - this avoids the whole CSV update path altogether
  • Redo your InboundEmailHandler to accept .xlsx files. Use an APEX callout to a third party service that can parse the Excel .xlsx binary and send it back as a UTF-8 CSV.  I have not tried this.
  • Have your recipients install an Excel addin that creates UTF-8 output such as http://jaimonmathew.wordpress.com/2011/08/23/excel_addin_to_work_with_unicode_csv/ (I have not tried this)
  • Train your recipients to use the Google Docs/Open Office route mentioned above under (A).
  • Have your recipients in Excel do a Save As 'Xml Spreadsheet 2003' - this will in fact preserve your accented characters (I haven't checked Asian languages but European languages seem to work OK). Then change your inbound email parser to parse XML files. This is a monolithic text file (unlike the zipped OOXML files MSFT adopted in Office 2007 which will be useless to the Apex code in the Inbound email handler).

 

  • Example using Save As XML Spreadsheet 2003 - the character ž appears in the output XML file (seen through a hex editor) as C5BE - the UTF-8 ž character!

 

I'm curious what other solutions folks have come up with

JeffreyStevensJeffreyStevens
Nice post - good information to know.

I know that IolitePro (ProCore Resources) - they have an external processor that will read the XLSX files.