I was reading the CMS bible the other night, and I read something about sorting content in to binary and "normal" files. Binary files being "all 1's and 0's", like pictures and media files, but also files of a closed format. A word file or an Excel workbook is a binary file. During the course of knowledge management this spring, I learned that heaps of information which contents are allready at an atomic level (there is no easy way to aggregate elements out of a word document), we call unstructured data. An email, which content contains no tags outside the header of the mail (subject, recepient, sender, etc), is also unstructured. Unstructered data is bad for content management because you can't tell from the outside what the content is about.
But back to the proprietary formats like Word, it occured to me the risk of a file or document being unstructered, is connected to the fact whether it is based on an open standard or not. XML (depending if you use it right) is structured. PDF is structured.
I'm sure Word has some sort of structure as well, but it's hard for the outside world to enable and use this structure since it is a closed format (there are some projects evolving how to read MS' formats, I have personally been using Apache POI for reading Excel workbooks). The POI team doesn't seem to be too pleased with reverse engineering and reading closed formats, using package names like
- HSSF - Horrible Spreadsheet Format
- HPSF - Horrible Property Set Format
- DDF - Dreadful Drawing Format
Closed formats automatically fall into the bag of unstructered/binary data in a CMS.
Enjoy your summer!