function readOnly(count){ }
Starting November 20, the site will be set to read-only. On December 4, 2023,
forum discussions will move to the Trailblazer Community.
+ Start a Discussion
KSusan CoxKSusan Cox 

Fetch the content of word file in string


     How to fetch content of Word file in string. Because now I am getting all content in Blob but I am not able to fetch it in string. 
    I can fetch all content of text file but the same code is not working for word or excel. I tried conversion code from blob to string but it is not working.

    I also tried to encode and decode the content of file,but it is also not working. As encoded file dosent exactly converts into original contents after decode.

 Basically I want to read the content of all MSoffice files posted on chatter.

Thanks in advance.
Natively it is not supported to read Word documents in Apex because it is a zip file (means a binary file containing text files.) and there is no Apex native support for unzipping files to get at the contents inside it.

However,there is a  blog which could help you out :-

If this helps,please mark it as best answer to help others :)
KSusan CoxKSusan Cox
Hi Vinit

Thanks for the reply. 

Is there any alternatives to solve this issue? Like encoding - decoding or anything else which can help me.


Like I said as of now it is not supported so I am afraid you don't have anything to read the Word body.

The other workaround I can think of creating an Attachment or Document record inside salesforce and then read the Body.

Hope that helps!!
KSusan CoxKSusan Cox

I am near to the solution to read the content of file such as doc, xls, ppt, pdf. And it works fine.
But it consumes more CPU time so I am working on that.

Here I provide you the solution whihc works for me:

public static String blobToString(Blob input, String inCharset){
    String hex = EncodingUtil.convertToHex(input);
    System.assertEquals(0, hex.length() & 1);
    final Integer bytesCount = hex.length() >> 1;
    String[] bytes = new String[bytesCount];
    for(Integer i = 0; i < bytesCount; ++i)
       bytes[i] =  hex.mid(i << 1, 2);
    return EncodingUtil.urlDecode('%' + String.join(bytes, '%'), inCharset);

Thanks for sharing KSusan Cox !!
Pedro I Dal ColPedro I Dal Col
The content of Word documents is compressed using the Zip format. To extract the text, you need to uncompress a file named 'document.xml' that is embeded in all Word files. For that you can use the Zippex library (open source)

After you install Zippex, use this code to get the content in plain text:
//wordFileBlob is a Blob that contains the Word document
Zippex myZip = new Zippex(wordFileBlob);
//Uncompress data
String wordDoc = myZip.getFile('word/document.xml').toString();
//Remove XML tags
String plainText = wordDoc.stripHtmlTags();