Friday, July 31, 2009

Removing duplicate comments from a word document

As I wrote before, I am working on Malevich-like system (http://malevich.codeplex.com) for reviewing specs in the same way we're reviewing code.

This work is based on Eric White's excellent blog post about merging comments from two identical files (http://blogs.msdn.com/ericwhite/archive/2009/07/28/merging-comments-from-multiple-open-xml-documents-into-a-single-document.aspx).

The idea is to have a web site where one uploads a Word document, the reviewers then download a locked copy of it which only allows adding comments. They then use Word to comment, and upload the files back. The server merges all comments (using Eric's code) back into the master copy. Every person who downloads the document afterwards gets the comments from all previous reviewers.

While working on this system, I had to add two things in terms of comment management.

First, I had to lock files so only adding comments is allowed. The code for this is here: http://1-800-magic.blogspot.com/2009/07/lockingunlocking-word-doc-files.html.

Second, Eric's code merges the comments by adding all comments from one document to the other. Unfortunately what this means is that after the very first reviewer has added his or her comments, every time someone else downloads the copy with these comments, adds more, and uploads the document back, the original set of comments gets duplicated. So I had to write code that cleans up this duplication.

The comments in the Word files leave in a special section accessible through MainDocumentPart.WordprocessingCommentsPart.Comments of the WordprocessingDocument class. They can be enumerated as follows:

WordprocessingDocument doc = WordprocessingDocument.Open(args[0], true);
foreach (Comment c in doc.MainDocumentPart.WordprocessingCommentsPart.Comments)
Console.WriteLine("{0} {1}:{2}", c.Id, c.Author, c.InnerText);


This section contains the comments themselves, but it does not have any information as to where the comments attach to the actual text in the Word document. Instead the comments attach via commentRangeStart, commentRanveEnd, and commentReference elements that are intersperced into the text of the paragraph:

<w:p>
<w:r>
<w:t xml:space="preserve">This is a test</w:t>
</w:r>
<w:commentRangeStart w:id="0" />
<w:commentRangeStart w:id="2" />
<w:commentRangeStart w:id="4" />
<w:r>
<w:t>document</w:t>
</w:r>
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference" />
</w:rPr>
<w:commentReference w:id="0" />
</w:r>
<w:commentRangeEnd w:id="0" />
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference" />
</w:rPr>
<w:commentReference w:id="2" />
</w:r>
<w:commentRangeEnd w:id="2" />
<w:r>
<w:rPr>
<w:rStyle w:val="CommentReference" />
</w:rPr>
<w:commentReference w:id="4" />
</w:r>
<w:commentRangeEnd w:id="4" />
<w:r>
<w:t>.</w:t>
</w:r>
</w:p>


To the developer, these elements are accessible from the root element of the Word document's MainDocumentPart:

foreach (CommentReference cRef in
doc.MainDocumentPart.RootElement.Descendants<CommentReference>())
Console.WriteLine("Found reference for {0}", cRef.Id);

foreach (CommentRangeStart baseRs in
doc.MainDocumentPart.RootElement.Descendants<CommentRangeStart>())
Console.WriteLine("Found range start for {0}", baseRs.Id);

foreach (CommentRangeEnd baseRe in
doc.MainDocumentPart.RootElement.Descendants<CommentRangeEnd>())
Console.WriteLine("Found range end for {0}", baseRe.Id);


Unlike the beauty of almost Lisp-like functional code that Eric wrote to merge comments, the code below goes through some contortions trying to determine that comments that have the same text and author really do start and end in the same place of the Word document. Location is important in determining the equivalence of comments because it is easy to imagine a whole bunch of separate, different comments with the same text, for example, "Here, too.", that would otherwise be considered equal.

To compile the code, you need to get and install Microsoft's OpenXML SDK 2.0 from here: http://www.microsoft.com/downloads/details.aspx?FamilyId=C6E744E5-36E9-45F5-8D8C-331DF206E0D0&displaylang=en, and add a reference to DocumentFormat.OpenXml assembly which the SDK installer puts in GAC.

Here's the code. It is rather self-explanatory: it collects all the relative elements from the document - comments, ranges, and comment reference points, determines which ones are duplicates, then removes the dupes.

There is subtlety that this code relies upon which appears to be true, but technically does not technically have to be - that for the comments that are attached to the same location the commentRangeStart and commentRangeEnd elements have the same sequence - e.g. if comment A's commentRangeStart preceedes comment B's commentRangeStart, then comment A's commentRangeEnd should preceed comment B's commentRangeEnd. While this seems to be true for Word, if you are adopting this code for general purpose OpenXML, I would recomment changing the logic to remove this dependency.


//-----------------------------------------------------------------------
// <copyright>
// Copyright (C) Sergey Solyanik.
//
// This file is subject to the terms and conditions of the Microsoft Public License (MS-PL).
// See http://www.microsoft.com/opensource/licenses.mspx#Ms-PL for more details.
// </copyright>
//-----------------------------------------------------------------------
using System;
using System.Collections.Generic;
using System.Xml.Linq;

using DocumentFormat.OpenXml;
using DocumentFormat.OpenXml.Packaging;
using DocumentFormat.OpenXml.Wordprocessing;

namespace RemoveDuplicateComments
{
/// <summary>
/// Removes duplicate comments in an OpenXML document.
/// </summary>
class Program
{
/// <summary>
/// Removes duplicate comment in an OpenXML document.
/// </summary>
/// <param name="args"> Command line arguments (file name). </param>
static void Main(string[] args)
{
if (args.Length != 1)
{
Console.WriteLine("Usage: removeduplicatecomments filename");
return;
}

Dictionary<int, Comment> comments =
new Dictionary<int, Comment>();
Dictionary<int, string> commentTexts =
new Dictionary<int, string>();
Dictionary<int, CommentRangeStart> commentRangeStarts =
new Dictionary<int, CommentRangeStart>();
Dictionary<int, CommentRangeEnd> commentRangeEnds =
new Dictionary<int, CommentRangeEnd>();
Dictionary<int, OpenXmlElement> commentReferenceParents =
new Dictionary<int, OpenXmlElement>();
HashSet<OpenXmlElement> commentReferenceParentsSet =
new HashSet<OpenXmlElement>();
HashSet<int> idsOfIdenticalStarts = new HashSet<int>();
HashSet<int> idsOfIdenticalEnds = new HashSet<int>();

WordprocessingDocument doc = WordprocessingDocument.Open(args[0], true);
foreach (Comment c in
doc.MainDocumentPart.WordprocessingCommentsPart.Comments)
{
Console.WriteLine("{0} {1}:{2}", c.Id, c.Author, c.InnerText);
int id = int.Parse(c.Id);
comments.Add(id, c);
commentTexts.Add(id, c.Author + " : " + c.InnerText);
}

foreach (CommentReference cRef in
doc.MainDocumentPart.RootElement.Descendants<CommentReference>())
{
Console.WriteLine("Found reference for {0}", cRef.Id);
commentReferenceParents.Add(int.Parse(cRef.Id), cRef.Parent);
commentReferenceParentsSet.Add(cRef.Parent);
}

foreach (CommentRangeStart baseRs in
doc.MainDocumentPart.RootElement.Descendants<CommentRangeStart>())
{
Console.WriteLine("Found range start for {0}", baseRs.Id);

int baseId = int.Parse(baseRs.Id);

commentRangeStarts[baseId] = baseRs;

string baseCommentText = commentTexts[baseId];

CommentRangeStart rs = baseRs;
for (; ; )
{
CommentRangeStart next = rs.NextSibling() as CommentRangeStart;
if (next == null)
break;

rs = next;

int rsId = int.Parse(rs.Id);
if (baseCommentText == commentTexts[rsId])
idsOfIdenticalStarts.Add(rsId);
}
}

foreach (CommentRangeEnd baseRe in
doc.MainDocumentPart.RootElement.Descendants<CommentRangeEnd>())
{
Console.WriteLine("Found range end for {0}", baseRe.Id);

int baseId = int.Parse(baseRe.Id);

commentRangeEnds[baseId] = baseRe;

string baseCommentText = commentTexts[baseId];

CommentRangeEnd re = baseRe;
for (; ; )
{
OpenXmlElement nextEl = re.NextSibling();
while (nextEl != null && commentReferenceParentsSet.Contains(nextEl))
nextEl = nextEl.NextSibling();

re = nextEl as CommentRangeEnd;
if (re == null)
break;

int reId = int.Parse(re.Id);
if (baseCommentText == commentTexts[reId])
idsOfIdenticalEnds.Add(reId);
}
}

foreach (int id in idsOfIdenticalStarts)
{
if (idsOfIdenticalEnds.Contains(id))
{
Console.WriteLine("Eliminating comment {0}", id);
commentRangeStarts[id].Remove();
commentRangeEnds[id].Remove();
commentReferenceParents[id].Remove();
comments[id].Remove();
}
}

doc.MainDocumentPart.RootElement.Save();
doc.MainDocumentPart.WordprocessingCommentsPart.RootElement.Save();

doc.Close();

Console.WriteLine("All done!");
}
}
}

No comments: