Docbook to PDF for MoinMoin wiki

MoinMoin is a wiki engine that I frequently use and recommend, mostly because it is so easy to set up. It requires Python and that’s it – a simple web server is built-in, and it uses regular files to store data.

There are a couple of downsides to wikis, however:

  • Web access is needed to read the content.
  • If issued as a reference (e.g. standards document), the content might change at any time.

Fortunately, Moin supports rendering wiki pages in DocBook format, which can be transformed into PDF documents. After a few hours in the wee hours of the morning, I cobbled together a set of scripts and XSL transformations that generated reasonable looking PDF versions of the wiki content.

The basic steps:

  1. Perl script crawls (using wget) the wiki, looking for pages to include in the PDF.
  2. Run the pages that we want to keep through an XSLT process to convert each individual Docbook <article> into a <chapter> instead.
  3. Concatenate all the modified Docbook instances into a Docbook <book>.
  4. Run dblatex, an open source tool, that will convert Docbook instances into PDFs.

What I learned:

  • The Perl XML::XSLT module doesn’t support substr (or any?) functions. I ended up using Saxon instead.
  • A lot of minor tweaks were necessary to get the content in the right place for dblatex to render the PDF in a fashion that looked reasonable. e.g. stripping leading slashes, moving the page title from one XML subtree to another, changing <ulink> tags to <link> tags.
  • MoinMoin doesn’t like a User-Agent of known automated spidering tools. I had to set the User-Agent that wget advertised to a value Moin didn’t recognize.

Below are the scripts I put together (you need to download Saxon separately).

moinToPdf.pl

#!/usr/bin/perl

use strict;
use English;

die "Specify URL and name of book." if @ARGV < 2;

my $BASE_URL=$ARGV[0];
my $BOOK_NAME=$ARGV[1];
my %VISITED_URLS;
my $URL_SUFFIX = "?action=format&mimetype=xml/docbook";
my @SPIDER_URLS;
push @SPIDER_URLS, "/FrontPage";

my $TMP_DIR=`mktemp -d /tmp/moinToPdf-XXXXX`;
chomp $TMP_DIR;
#print STDOUT "$TMP_DIR\n";
#print STDOUT "---\n";

while (scalar @SPIDER_URLS > 0) {
  my @RELATIVE_URLS;
  my $WGET_STRING;

  while (scalar @SPIDER_URLS > 0) {
    my $RELATIVE_URL = pop @SPIDER_URLS;
    my $FULL_URL=${BASE_URL} . ${RELATIVE_URL} . $URL_SUFFIX;
    push @RELATIVE_URLS, $RELATIVE_URL;
    if (length $WGET_STRING > 0) {
      $WGET_STRING = $WGET_STRING . " ";
    }
    $WGET_STRING = $WGET_STRING . "\'${FULL_URL}\'";
#print STDOUT "$BASE_URL$RELATIVE_URL\n";
  }
#print STDOUT "---\n";

	`wget -P $TMP_DIR -q -U foobar ${WGET_STRING}`;

  while (scalar @RELATIVE_URLS > 0) {
    my $RELATIVE_URL = pop @RELATIVE_URLS;
    my $BASE_RELATIVE_URL = `basename "$RELATIVE_URL"`;
    chomp $BASE_RELATIVE_URL;
    $RELATIVE_URL = "/" . $BASE_RELATIVE_URL;

    my $TMP_FILE = $TMP_DIR . $RELATIVE_URL . $URL_SUFFIX;
    $TMP_FILE=~s/xml\/docbook/xml%2Fdocbook/;
#print STDOUT "$TMP_FILE\n";

    `grep -q '' '$TMP_FILE' 2>/dev/null`;
    if ($? != 0) {
#print STDOUT "SKIP: $TMP_FILE\n";
      next;
    }

    my $output=`java -jar saxon8.jar \'$TMP_FILE\' getWikiNames.xsl`;

    my $DOCBOOK;
    if ($RELATIVE_URL) {
      $DOCBOOK=substr($RELATIVE_URL,1) . ".xml";
    } else {
      $DOCBOOK="FrontPage.xml";
    }

    my $DIR=`dirname "$DOCBOOK"`;
    chomp $DIR;
    `mkdir -p "$DIR"`;

    `java -jar saxon8.jar -o "$DOCBOOK" "$TMP_FILE" transformArticle.xsl`;

    foreach (split '\n', $output) {
      chomp $_;
      /.*url="([^"]+)".*/;
      my $NEW_URL=$1;
      /
(.*)<\/ulink>/;
      my $NAME=$1;
      if ($NEW_URL=~/action=AttachFile.*do=get.*/) {
        my $FILE_URL = "${BASE_URL}${NEW_URL}";
        my $FILE_NAME = "$FILE_URL";
        $FILE_NAME=~s/.*target=(.*)/\1/;
        `wget -q -U foobar -O "$FILE_NAME" "$BASE_URL$NEW_URL"`;
      } elsif (
        $NEW_URL=~/^\/.*/
        and not defined $VISITED_URLS{$NEW_URL}
        and not $NEW_URL=~/^\/OtherUser/
        and not $NEW_URL=~/^\/HelpOn/
        and not $NEW_URL=~/^\/Category/
        and not $NEW_URL=~/^\/SystemPages/
        and not $NEW_URL=~/^\/MoinMoin/
        and not $NEW_URL=~/^\/WhyWikiWorks/
        and not $NEW_URL=~/^\/RecentChanges/
        and not $NEW_URL=~/^\/WikiCourse/
        and not $NEW_URL=~/^\/AutoAdminGroup/
        and not $NEW_URL=~/^\/HelpContents/
        and not $NEW_URL=~/^\/HelpMiscellaneous/
        and not $NEW_URL=~/^\/WikiWikiWeb/
        and not $NEW_URL=~/^\/SiteNavigation/
        and not $NEW_URL=~/^\/RandomPage/
        and not $NEW_URL=~/^\/WantedPages/
        and not $NEW_URL=~/^\/WordIndex/
        and not $NEW_URL=~/^\/FindPage/
        and not $NEW_URL=~/^\/WikiName/
        and not $NEW_URL=~/^\/InterWiki/
        and not $NEW_URL=~/^\/TitleIndex/
        and not $NEW_URL=~/^\/SyntaxReference/
        and not $NEW_URL=~/^\/HelpIndex/
        and not $NEW_URL=~/^\/HelpForBeginners/
        and not $NEW_URL=~/^\/WikiSandBox/
        and not $NEW_URL=~/.*action=AttachFile.*/
      ) {
        $VISITED_URLS{$NEW_URL} = 1;
        push @SPIDER_URLS, $NEW_URL;
      }
    }
  }
}

#`rm -rf $TMP_DIR`;

my $BOOK_DATE=`date +%Y%m%d%H%M`;
chomp $BOOK_DATE;
my $BOOK_TMP="${BOOK_NAME}.${BOOK_DATE}.tmp";
my $BOOK="${BOOK_NAME}.${BOOK_DATE}.xml";

#print STDOUT "BookDate: $BOOK_DATE";
#print STDOUT "BookInterim: $BOOK_TMP";
#print STDOUT "Book: $BOOK";

my $FILE;
open FILE, ">$BOOK_TMP";
print FILE '';
print FILE '';
foreach (`find .  -name '*.xml'`) {
  chomp $_;
  my $FILE2;
  open FILE2, "$_";
  while() {
    s/<\?xml[^>]+\?>//;
    print FILE $_;
  }
  close FILE2;
}
print FILE '';
close FILE;

`java -jar saxon8.jar -o $BOOK $BOOK_TMP polishBook.xsl`;
`dblatex $BOOK`;
`gzip $BOOK`;
`rm -f $BOOK_TMP`;

getWikiNames.xsl

<?xml version="1.0" ?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
  <!-- Stylesheet finds the WikiName links and spits it out for the perl script to determine the next URL. -->
  <xsl:template match="ulink">
                <xsl:copy-of select="."/><xsl:text>
</xsl:text>
  </xsl:template>
        <xsl:template match="/">
                <xsl:apply-templates select="//ulink"/>
        </xsl:template>
</xsl:stylesheet>

transformArticle.xsl

<?xml version="1.0" ?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
  <xsl:template match="/">
    <xsl:apply-templates/>
  </xsl:template>

  <!-- Default copy rules. -->
  <xsl:template match="text()">
    <xsl:value-of select="."/>
  </xsl:template>
  <xsl:template match="@*">
    <xsl:copy-of select="."/>
  </xsl:template>
  <xsl:template match="*">
    <xsl:copy>
      <xsl:apply-templates/>
    </xsl:copy>
  </xsl:template>

  <!-- Converts articles into sections. -->
  <xsl:template match="/article">
    <section>
      <xsl:attribute name="id">
        <xsl:value-of select="articleinfo/title"/>
      </xsl:attribute>
      <xsl:apply-templates/>
    </section>
  </xsl:template>
  <xsl:template match="/article/section">
    <xsl:apply-templates/>
  </xsl:template>

  <!-- Strip out <articleInfo/> as it's not needed for <section/> or <chapter/> -->
  <xsl:template match="/article/articleinfo"/>

  <!-- Tables require IDs, and won't render to PDF properly without them.  So use informal tables instead. -->
  <xsl:template match="table">
    <informaltable>
      <xsl:copy-of select="@*"/>
      <xsl:for-each select="*">
        <xsl:choose>
          <!-- Sets the column count for tables, to avoid dblatex warning messages. -->
          <xsl:when test="name() = 'tgroup'">
            <tgroup>
              <xsl:attribute name="cols" select="count(colspec)"/>
              <xsl:copy-of select="@*"/>
              <xsl:for-each select="*">
                <xsl:copy>
                  <xsl:apply-templates/>
                </xsl:copy>
              </xsl:for-each>
            </tgroup>
          </xsl:when>
          <xsl:otherwise>
            <xsl:copy>
              <xsl:apply-templates/>
            </xsl:copy>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:for-each>
    </informaltable>
  </xsl:template>

  <!-- Fix up any WikiName references to be rid of the leading slash. -->
  <xsl:template match="ulink">
    <xsl:choose>
      <xsl:when test="inlinemediaobject/imageobject/imagedata">
        <xsl:variable name="filename" select='replace(@url,".*target=","")'/>
        <inlinemediaobject>
          <imageobject>
            <imagedata>
              <xsl:attribute name="fileref">
                <xsl:value-of select="$filename"/>
              </xsl:attribute>
            </imagedata>
          </imageobject>
          <textobject>
            <phrase>
              <xsl:value-of select="$filename"/>
            </phrase>
          </textobject>
        </inlinemediaobject>
      </xsl:when>
      <xsl:otherwise>
        <xsl:choose>
          <xsl:when test="substring(@url,1,1) = '/'">
            <link>
              <xsl:attribute name="linkend">
                <xsl:value-of select="substring(@url,2)"/>
              </xsl:attribute>
              <xsl:apply-templates/>
            </link>
          </xsl:when>
          <xsl:otherwise>
            <xsl:copy>
              <xsl:copy-of select="@*"/>
              <!--<xsl:value-of select="@url"/>-->
              <xsl:apply-templates/>
            </xsl:copy>
          </xsl:otherwise>
        </xsl:choose>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

</xsl:stylesheet>

polishBook.xsl

<?xml version="1.0" ?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
  <xsl:template match="/">
    <xsl:apply-templates/>
  </xsl:template>
  <xsl:template match="/sections">
    <book>
      <xsl:apply-templates>
        <xsl:sort select="attribute::id" />
      </xsl:apply-templates>
    </book>
  </xsl:template>

  <!-- Converts each section into a chapter. -->
  <!-- Uses existence of '/' in @id as an indicator of nesting -->
  <xsl:template match="/sections/section[not(contains(@id,'/'))]">
    <xsl:variable name='sectionId' select='@id'/>
    <chapter>
      <xsl:copy-of select="@*"/>
      <xsl:for-each select="*">
        <xsl:copy>
          <xsl:apply-templates/>
        </xsl:copy>
      </xsl:for-each>
      <xsl:for-each
      select="/sections/section[starts-with(@id,concat($sectionId, '/'))]">
        <section>
          <xsl:copy-of select="@*"/>
          <xsl:for-each select="*">
            <xsl:copy>
              <xsl:apply-templates/>
            </xsl:copy>
          </xsl:for-each>
        </section>
      </xsl:for-each>
    </chapter>
  </xsl:template>

  <!-- These three templates are the default copy rules. -->
  <xsl:template match="text()">
    <xsl:value-of select="."/>
  </xsl:template>
  <xsl:template match="@*">
    <xsl:copy-of select="."/>
  </xsl:template>
  <xsl:template match="*">
    <xsl:copy>
      <xsl:apply-templates select="*|@*|text()"/>
    </xsl:copy>
  </xsl:template>

  <!-- Strips out any of the Category* pages. -->
  <xsl:template match="section">
    <xsl:choose>
      <xsl:when test="substring(@id,1,8) = 'Category'"/>
      <xsl:otherwise>
        <xsl:copy>
          <xsl:copy-of select="@*"/>
          <xsl:apply-templates/>
        </xsl:copy>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

  <!-- Strips out any links to CAtegory*, and transform dangling WikiName references to special text. -->
  <xsl:template match="link">
    <xsl:variable name="end">
      <xsl:value-of select="@linkend"/>
    </xsl:variable>
    <xsl:choose>
      <xsl:when test="//section[@id=$end]">
        <xsl:copy-of select="."/>
      </xsl:when>
      <xsl:when test="substring($end,1,8) = 'Category'"/>
      <xsl:otherwise>
        <emphasis role="italics"><emphasis role="underline">
          <xsl:value-of select="text()"/>
        </emphasis></emphasis>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>
</xsl:stylesheet>
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.