Today I'm taking the train back down south for my summer vacation. Since it's been a while since my last post, I'll do some more thoughts here before I'm off. I am planning to spend a good bit of the summer reading through Bob Boiko's CMS bible, as well as putting together a scheduler-application with JSF, and perhaps glue it together with Magnolia. Will also be making some much needed custom Magnolia templates.
I was reading the CMS bible the other night, and I read something about sorting content in to binary and "normal" files. Binary files being "all 1's and 0's", like pictures and media files, but also files of a closed format. A word file or an Excel workbook is a binary file. During the course of knowledge management this spring, I learned that heaps of information which contents are allready at an atomic level (there is no easy way to aggregate elements out of a word document), we call unstructured data. An email, which content contains no tags outside the header of the mail (subject, recepient, sender, etc), is also unstructured. Unstructered data is bad for content management because you can't tell from the outside what the content is about.
But back to the proprietary formats like Word, it occured to me the risk of a file or document being unstructered, is connected to the fact whether it is based on an open standard or not. XML (depending if you use it right) is structured. PDF is structured.
I'm sure Word has some sort of structure as well, but it's hard for the outside world to enable and use this structure since it is a closed format (there are some projects evolving how to read MS' formats, I have personally been using Apache POI for reading Excel workbooks). The POI team doesn't seem to be too pleased with reverse engineering and reading closed formats, using package names like
Closed formats automatically fall into the bag of unstructered/binary data in a CMS.
Enjoy your summer!
I was reading the CMS bible the other night, and I read something about sorting content in to binary and "normal" files. Binary files being "all 1's and 0's", like pictures and media files, but also files of a closed format. A word file or an Excel workbook is a binary file. During the course of knowledge management this spring, I learned that heaps of information which contents are allready at an atomic level (there is no easy way to aggregate elements out of a word document), we call unstructured data. An email, which content contains no tags outside the header of the mail (subject, recepient, sender, etc), is also unstructured. Unstructered data is bad for content management because you can't tell from the outside what the content is about.
But back to the proprietary formats like Word, it occured to me the risk of a file or document being unstructered, is connected to the fact whether it is based on an open standard or not. XML (depending if you use it right) is structured. PDF is structured.
I'm sure Word has some sort of structure as well, but it's hard for the outside world to enable and use this structure since it is a closed format (there are some projects evolving how to read MS' formats, I have personally been using Apache POI for reading Excel workbooks). The POI team doesn't seem to be too pleased with reverse engineering and reading closed formats, using package names like
- HSSF - Horrible Spreadsheet Format
- HPSF - Horrible Property Set Format
- DDF - Dreadful Drawing Format
Closed formats automatically fall into the bag of unstructered/binary data in a CMS.
Enjoy your summer!
Comments
Post a Comment