Friday, July 31, 2009

Removing duplicate comments from a word document

As I wrote before, I am working on Malevich-like system (http://malevich.codeplex.com) for reviewing specs in the same way we're reviewing code.

This work is based on Eric White's excellent blog post about merging comments from two identical files (http://blogs.msdn.com/ericwhite/archive/2009/07/28/merging-comments-from-multiple-open-xml-documents-into-a-single-document.aspx).

The idea is to have a web site where one uploads a Word document, the reviewers then download a locked copy of it which only allows adding comments. They then use Word to comment, and upload the files back. The server merges all comments (using Eric's code) back into the master copy. Every person who downloads the document afterwards gets the comments from all previous reviewers.

While working on this system, I had to add two things in terms of comment management.

First, I had to lock files so only adding comments is allowed. The code for this is here: http://1-800-magic.blogspot.com/2009/07/lockingunlocking-word-doc-files.html.

Second, Eric's code merges the comments by adding all comments from one document to the other. Unfortunately what this means is that after the very first reviewer has added his or her comments, every time someone else downloads the copy with these comments, adds more, and uploads the document back, the original set of comments gets duplicated. So I had to write code that cleans up this duplication.

The comments in the Word files leave in a special section accessible through MainDocumentPart.WordprocessingCommentsPart.Comments of the WordprocessingDocument class. They can be enumerated as follows:

WordprocessingDocument doc = WordprocessingDocument.Open(args[0], true);
foreach (Comment c in doc.MainDocumentPart.WordprocessingCommentsPart.Comments)
Console.WriteLine("{0} {1}:{2}", c.Id, c.Author, c.InnerText);


This section contains the comments themselves, but it does not have any information as to where the comments attach to the actual text in the Word document. Instead the comments attach via commentRangeStart, commentRanveEnd, and commentReference elements that are intersperced into the text of the paragraph:

<w:p>
<w:r>
<w:t xml:space="preserve">This is a test</w:t>
</w:r>
<w:commentRangeStart w:id="0" />
<w:commentRangeStart w:id="2" />
<w:commentRangeStart w:id="4" />
<w:r>
<w:t>document</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference" />
</w:rPr>
<w:commentReference w:id="0" />
</w:r>
<w:commentRangeEnd w:id="0" />
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference" />
</w:rPr>
<w:commentReference w:id="2" />
</w:r>
<w:commentRangeEnd w:id="2" />
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference" />
</w:rPr>
<w:commentReference w:id="4" />
</w:r>
<w:commentRangeEnd w:id="4" />
<w:r>
<w:t>.</w:t>
</w:r>
</w:p>


To the developer, these elements are accessible from the root element of the Word document's MainDocumentPart:

foreach (CommentReference cRef in
doc.MainDocumentPart.RootElement.Descendants<CommentReference>())
Console.WriteLine("Found reference for {0}", cRef.Id);

foreach (CommentRangeStart baseRs in
doc.MainDocumentPart.RootElement.Descendants<CommentRangeStart>())
Console.WriteLine("Found range start for {0}", baseRs.Id);

foreach (CommentRangeEnd baseRe in
doc.MainDocumentPart.RootElement.Descendants<CommentRangeEnd>())
Console.WriteLine("Found range end for {0}", baseRe.Id);


Unlike the beauty of almost Lisp-like functional code that Eric wrote to merge comments, the code below goes through some contortions trying to determine that comments that have the same text and author really do start and end in the same place of the Word document. Location is important in determining the equivalence of comments because it is easy to imagine a whole bunch of separate, different comments with the same text, for example, "Here, too.", that would otherwise be considered equal.

To compile the code, you need to get and install Microsoft's OpenXML SDK 2.0 from here: http://www.microsoft.com/downloads/details.aspx?FamilyId=C6E744E5-36E9-45F5-8D8C-331DF206E0D0&displaylang=en, and add a reference to DocumentFormat.OpenXml assembly which the SDK installer puts in GAC.

Here's the code. It is rather self-explanatory: it collects all the relative elements from the document - comments, ranges, and comment reference points, determines which ones are duplicates, then removes the dupes.

There is subtlety that this code relies upon which appears to be true, but technically does not technically have to be - that for the comments that are attached to the same location the commentRangeStart and commentRangeEnd elements have the same sequence - e.g. if comment A's commentRangeStart preceedes comment B's commentRangeStart, then comment A's commentRangeEnd should preceed comment B's commentRangeEnd. While this seems to be true for Word, if you are adopting this code for general purpose OpenXML, I would recomment changing the logic to remove this dependency.


//-----------------------------------------------------------------------
// <copyright>
// Copyright (C) Sergey Solyanik.
//
// This file is subject to the terms and conditions of the Microsoft Public License (MS-PL).
// See http://www.microsoft.com/opensource/licenses.mspx#Ms-PL for more details.
// </copyright>
//-----------------------------------------------------------------------
using System;
using System.Collections.Generic;
using System.Xml.Linq;

using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;

namespace RemoveDuplicateComments
{
/// <summary>
/// Removes duplicate comments in an OpenXML document.
/// </summary>
class Program
{
/// <summary>
/// Removes duplicate comment in an OpenXML document.
/// </summary>
/// <param name="args"> Command line arguments (file name). </param>
static void Main(string[] args)
{
if (args.Length != 1)
{
Console.WriteLine("Usage: removeduplicatecomments filename");
return;
}

Dictionary<int, Comment> comments =
new Dictionary<int, Comment>();
Dictionary<int, string> commentTexts =
new Dictionary<int, string>();
Dictionary<int, CommentRangeStart> commentRangeStarts =
new Dictionary<int, CommentRangeStart>();
Dictionary<int, CommentRangeEnd> commentRangeEnds =
new Dictionary<int, CommentRangeEnd>();
Dictionary<int, OpenXmlElement> commentReferenceParents =
new Dictionary<int, OpenXmlElement>();
HashSet<OpenXmlElement> commentReferenceParentsSet =
new HashSet<OpenXmlElement>();
HashSet<int> idsOfIdenticalStarts = new HashSet<int>();
HashSet<int> idsOfIdenticalEnds = new HashSet<int>();

WordprocessingDocument doc = WordprocessingDocument.Open(args[0], true);
foreach (Comment c in
doc.MainDocumentPart.WordprocessingCommentsPart.Comments)
{
Console.WriteLine("{0} {1}:{2}", c.Id, c.Author, c.InnerText);
int id = int.Parse(c.Id);
comments.Add(id, c);
commentTexts.Add(id, c.Author + " : " + c.InnerText);
}

foreach (CommentReference cRef in
doc.MainDocumentPart.RootElement.Descendants<CommentReference>())
{
Console.WriteLine("Found reference for {0}", cRef.Id);
commentReferenceParents.Add(int.Parse(cRef.Id), cRef.Parent);
commentReferenceParentsSet.Add(cRef.Parent);
}

foreach (CommentRangeStart baseRs in
doc.MainDocumentPart.RootElement.Descendants<CommentRangeStart>())
{
Console.WriteLine("Found range start for {0}", baseRs.Id);

int baseId = int.Parse(baseRs.Id);

commentRangeStarts[baseId] = baseRs;

string baseCommentText = commentTexts[baseId];

CommentRangeStart rs = baseRs;
for (; ; )
{
CommentRangeStart next = rs.NextSibling() as CommentRangeStart;
if (next == null)
break;

rs = next;

int rsId = int.Parse(rs.Id);
if (baseCommentText == commentTexts[rsId])
idsOfIdenticalStarts.Add(rsId);
}
}

foreach (CommentRangeEnd baseRe in
doc.MainDocumentPart.RootElement.Descendants<CommentRangeEnd>())
{
Console.WriteLine("Found range end for {0}", baseRe.Id);

int baseId = int.Parse(baseRe.Id);

commentRangeEnds[baseId] = baseRe;

string baseCommentText = commentTexts[baseId];

CommentRangeEnd re = baseRe;
for (; ; )
{
OpenXmlElement nextEl = re.NextSibling();
while (nextEl != null && commentReferenceParentsSet.Contains(nextEl))
nextEl = nextEl.NextSibling();

re = nextEl as CommentRangeEnd;
if (re == null)
break;

int reId = int.Parse(re.Id);
if (baseCommentText == commentTexts[reId])
idsOfIdenticalEnds.Add(reId);
}
}

foreach (int id in idsOfIdenticalStarts)
{
if (idsOfIdenticalEnds.Contains(id))
{
Console.WriteLine("Eliminating comment {0}", id);
commentRangeStarts[id].Remove();
commentRangeEnds[id].Remove();
commentReferenceParents[id].Remove();
comments[id].Remove();
}
}

doc.MainDocumentPart.RootElement.Save();
doc.MainDocumentPart.WordprocessingCommentsPart.RootElement.Save();

doc.Close();

Console.WriteLine("All done!");
}
}
}

Apple is replacing Microsoft as a company Linux advocates love to hate

Of course, there's still plenty of hate for everyone... still, so much fun to watch!

http://www.defectivebydesign.org/blog/jailbreaking-apple-iphone

Monday, July 27, 2009

Locking/unlocking Word doc files programmatically

My team is going through a planning milestone again, and this means reading, reviewing, and approving a lot of specs and design documents.

So for this weekend I was toying with the idea of setting up a clone of Malevich (http://malevich.codeplex.com) for document reviews.

Malevich is of course the tool we (and now a whole bunch of other teams inside and outside Microsoft) are using for code reviews. Its main target is to make commenting easy - you simply click on a line of source code, an edit box opens, you type your comment for that line, and that's it. You can read more about Malevich's inspirations and aspirations here: http://1-800-magic.blogspot.com/2009/01/malevich-introduction.html.

Over the last 7 months Malevich has proven to be a big success. It streamlined code review process in the development team, involved many people in code reviews who otherwise would not be participating, and did wonders for the quality of our code base.

All this made me start thinking about introducing a similar process for spec reviews. After all, a review is a review, right?

The biggest problem with the spec reviews turns out to be the file format. Malevich operates on text files, and so rendering these files on the screen, showing a difference between the two versions of a file, and associating comments with the line turns out to be very simple. Specs (at Microsoft) are traditionally written as Microsoft Word documents.

Word turns out to have a very nice commenting mechanism, but rendering documents on a web page is not nearly as straightforward, and diffing them... that's a whole another project!

While pondering this idea, I ran into this blog post by Eric White: http://blogs.msdn.com/ericwhite/archive/2009/07/05/comparing-two-open-xml-documents-using-the-zip-extension-method.aspx which describes how to determine if two Word documents are the same (modulo comments). The post served as my first introduction into OpenXML, which is the format behind the Word document. Also, I read that Eric was planning a blog post about merging comments from two documents, and this lead me to the following design for the spec review site.

I am going to put together a system very similar to Malevich (let's call it Black Square for now), but instead of text files, it would hold Word documents. To create a review request, a reviewee would upload a document to the server via a web site. Upon upload, the server will lock the Word file in a way that would prevent all modifications to it other than the comments. It will then make the document available for reviewers to download.

To perform a review, the reviewer downloads the document, comments on it using Office reviews functionality, and upload it back to the server. The server will then merge the comments back into the master document, making comments from everybody available to all subsequent reviewers as well as the reviewee.

I've shot Eric an email, and as it turned out, he had already largely completed his merger, and he gave me a preliminary copy to beta test (the final version is now here: http://blogs.msdn.com/ericwhite/archive/2009/07/28/merging-comments-from-multiple-open-xml-documents-into-a-single-document.aspx).

Then I spent part of the weekend coding. After a few hours I had a skeleton web site and needed to code the first meaningful action - locking a Word document so only comments could be added.

When I have to deal with large new API sets, I tend to program by Google - search for a code snippet that best illustrates the use of the API. Internet is a great resource for that (with the only exception - reading is fine, copying code with unclear copyright into commercial problems is not!), and Windows source is even better (although I cannot use that for the open source projects, for similar reasons).

Well, as it turned out, there is a dearth of samples when it comes to OpenXML programming. Unlike most of .NET APIs, MSDN has no examples of use in its API documentation. There are a few "How to" samples of solving and end-to-end problem which primarily focus on processing the text, not configuration options of the Word file. And the rest of the Internet is pretty much silent on the subject.

To make matters worse, the API is based on XML with a bunch of types derived from base XML elements, so Intellisense does not often works.

After some struggle (and help from Eric) I was able to make sense of the programming model. Here's what's going on here.

The document has a bunch of sections. You can look them up by changing the docx extension of the file into zip, and then opening it in your favorite archiver. You will find that the file is just a zipped archive of a bunch of XML files. What I've done to figure out what elements need to be changed to lock the file was making the copy of the file, expanding it, then locking the file, expanding the result, and then diffing it.

This led me to two elements: documentSecurity in properties of ExtendedFilePropertiesPart, and documentProtection. The first one was easy - it had a counterpart in the object model, "doc.ExtendedFilePropertiesPart.Properties.DocumentSecurity", setting it was very easy:

WordprocessingDocument doc = WordprocessingDocument.Open(args[1], true);
doc.ExtendedFilePropertiesPart.Properties.DocumentSecurity =
new DocumentFormat.OpenXml.ExtendedProperties.DocumentSecurity(isLock ? "8" : "0");
doc.ExtendedFilePropertiesPart.Properties.Save();
doc.Close();


The second was a setting in MainDocumentPart. The hiccup for me (a very novice XML developer - remember, most of my life was spent deep in the guts of OS, I have not touched managed code and all attendant goo until a few months ago!) was that settings were a collection of OpenXML elements, and DocumentProtection, despite the existence of the type, was not addressable in the direct way, as a property of the settings. Instead, the settings needed to be interpreted as an XML record, e.g. via LINQ to XML:

DocumentProtection dp =
doc.MainDocumentPart.DocumentSettingsPart.Settings
.ChildElements.First<DocumentProtection>();
if (dp != null)
dp.Remove();

if (isLock)
{
dp = new DocumentProtection();
dp.Edit = DocumentProtectionValues.Comments;
dp.Enforcement = DocumentFormat.OpenXml.Wordprocessing.BooleanValues.One;

doc.MainDocumentPart.DocumentSettingsPart.Settings.AppendChild(dp);
}

doc.MainDocumentPart.DocumentSettingsPart.Settings.Save();


So here's a full code snippet. It gives you a command line utility to lock and unlock Word files (unlocking the file will - I think - also remove the password protection, although I did not try this).

You need OpenXML Format SDK 2.0 to run this, available here: http://www.microsoft.com/downloads/details.aspx?FamilyId=C6E744E5-36E9-45F5-8D8C-331DF206E0D0&displaylang=en, and a reference to DocumentFormat.OpenXml in your project.


//-----------------------------------------------------------------------
// <copyright>
// Copyright (C) Sergey Solyanik.
//
// This file is subject to the terms and conditions of the Microsoft Public License (MS-PL).
// See http://www.microsoft.com/opensource/licenses.mspx#Ms-PL for more details.
// </copyright>
//-----------------------------------------------------------------------
using System;
using System.Xml.Linq;

using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;

namespace LockDoc
{
/// <summary>
/// Manipulates modification permissions of an OpenXML document.
/// </summary>
class Program
{
/// <summary>
/// Locks/Unlocks an OpenXML document.
/// </summary>
/// <param name="args"></param>
static void Main(string[] args)
{
if (args.Length != 2)
{
Console.WriteLine("Usage: lockdoc lock|unlock filename.docx");
return;
}

bool isLock = false;
if (args[0].Equals("lock", StringComparison.OrdinalIgnoreCase))
{
isLock = true;
}
else if (!args[0].Equals("unlock", StringComparison.OrdinalIgnoreCase))
{
Console.Error.WriteLine("Wrong action!");
return;
}

WordprocessingDocument doc = WordprocessingDocument.Open(args[1], true);
doc.ExtendedFilePropertiesPart.Properties.DocumentSecurity =
new DocumentFormat.OpenXml.ExtendedProperties.DocumentSecurity
(isLock ? "8" : "0");
doc.ExtendedFilePropertiesPart.Properties.Save();

DocumentProtection dp =
doc.MainDocumentPart.DocumentSettingsPart
.Settings.ChildElements.First<DocumentProtection>();
if (dp != null)
{
dp.Remove();
}

if (isLock)
{
dp = new DocumentProtection();
dp.Edit = DocumentProtectionValues.Comments;
dp.Enforcement = DocumentFormat.OpenXml.Wordprocessing.BooleanValues.One;

doc.MainDocumentPart.DocumentSettingsPart.Settings.AppendChild(dp);
}

doc.MainDocumentPart.DocumentSettingsPart.Settings.Save();

doc.Close();
}
}
}


BTW, for the not faint-of-heart, here's the documentation for OpenXML format: http://www.ecma-international.org/publications/standards/Ecma-376.htm

And here are the Microsoft SDK docs: http://msdn.microsoft.com/en-us/library/bb448854(office.14).aspx

Wednesday, July 15, 2009

Freedom and the Bible

"Romans 13:1-7 (NLT): Everyone must submit to governing authorities. For all authority comes from God, and those in positions of authority have been placed there by God. 2 So anyone who rebels against authority is rebelling against what God has instituted, and they will be punished. 3 For the authorities do not strike fear in people who are doing right, but in those who are doing wrong. Would you like to live without fear of the authorities? Do what is right, and they will honor you. 4 The authorities are God’s servants, sent for your good. But if you are doing wrong, of course you should be afraid, for they have the power to punish you. They are God’s servants, sent for the very purpose of punishing those who do what is wrong. 5 So you must submit to them, not only to avoid punishment, but also to keep a clear conscience. 6 Pay your taxes, too, for these same reasons. For government workers need to be paid. They are serving God in what they do. 7 Give to everyone what you owe them: Pay your taxes and government fees to those who collect them, and give respect and honor to those who are in authority."

Wednesday, July 8, 2009

Among all the idiocy printed today about Chrome OS

...finally, the voice of reason! Ladies and Gentlemen, I give you... fake Steve Jobs!

http://fakesteve.blogspot.com/2009/07/lets-all-take-deep-breath-and-get-some.html

The mother of all bull...

"Google Drops A Nuclear Bomb On Microsoft. And It’s Made of Chrome."

http://www.techcrunch.com/2009/07/07/google-drops-a-nuclear-bomb-on-microsoft-and-its-made-of-chrome/

The idiots in the press are at it again, cooking a sensation by blowing up an interesting tidbit of information way out of proportion.

Let me point out two obvious facts.

(1) The entire consumer market is rather small as a share of Microsoft revenue (10%?). The netbooks most likely represent less than 1% of the company's revenue stream. You cannot possibly call a "nuclear bomb" something that targets so little money.

(2) The smart phone market will always be much bigger than a netbook market. So if the "nuclear bomb" metaphor made any sense, Apple has dropped it years ago with iPhone.

Here's another stupid quote of the day:

'"One of Google's major goals is to take Microsoft out, to systematically destroy their hold on the market," said Mr Enderle.

"Google wants to eliminate Microsoft and it's a unique battle. The strategy is good. The big question is, will it work?"'

http://news.bbc.co.uk/2/hi/technology/8139711.stm

When I was at Google, the last thing people there were thinking about was Microsoft. I maybe have heard Microsoft mentioned a grand total of 10 times in my year plus there. What Googlers do care about is building cool things that attract attention and make customers come to their sites. THAT strategy clearly works. Destroying Microsoft - not so much (Netscape tried that approach).

My own take on this - thank you, Google! Windows 8/IE 9 will be better for your efforts. It often takes a competitor to persuade us that a segment of a market is important (unfortunate, but true). With this announcement Google did just that.

Do you have a health insurance?

Don't be so sure. You might lose it when you actually need it. Apparently, insurance companies slap a $1M surcharge on corporate policies that carry expensive patients. The companies then face a choice of whether to essentially pay you a $1M+ salary or...

http://www.dailykos.com/storyonly/2009/7/7/751100/-How-I-lost-my-health-insurance-at-the-hairstylists

Incidentally, in 3/4 of all medical bankruptcies (which are half of all bankruptcies in the US) people had health insurance.

http://1-800-magic.blogspot.com/2009/05/us-healthcare-by-numbers.html

Monday, July 6, 2009

BMI is bogus... because it embarrasses USA

It was making sense up to a point where an author claimed that 200 years ago most people led sedentiary life styles, although I had to ignore his quip on "if the formula does not describe the data, rig the formula" (this, of course, is what science - at least theoretical physics - is all about).

But when I got to the end, it was this: BMI does not make sense because...

"10. It embarrasses the U.S.

It is embarrassing for one of the most scientifically, technologically and medicinally advanced nations in the world to base advice on how to prevent one of the leading causes of poor health and premature death (obesity) on a 200-year-old numerical hack developed by a mathematician who was not even an expert in what little was known about the human body back then."

http://www.npr.org/templates/story/story.php?storyId=106268439&sc=fb&cc=fp

Come to think about it, an even more ridiculous fact is that our entire space program is based on a 300-year-old formula developed by a theologian!



This pearl of logical reasoning comes to you directly from a Stanford (!) Professor (!) of Mathematics (!) Keith Devlin...

http://www.stanford.edu/~kdevlin/

P.S. The author of this blog takes no position on the validity of BMI as a measure of human obesity, only on the validity of the referenced above argument against it.