MoinMoin is a wiki engine that I frequently use and recommend, mostly because it is so easy to set up. It requires Python and that’s it – a simple web server is built-in, and it uses regular files to store data.
There are a couple of downsides to wikis, however:
- Web access is needed to read the content.
- If issued as a reference (e.g. standards document), the content might change at any time.
Fortunately, Moin supports rendering wiki pages in DocBook format, which can be transformed into PDF documents. After a few hours in the wee hours of the morning, I cobbled together a set of scripts and XSL transformations that generated reasonable looking PDF versions of the wiki content.
The basic steps:
- Perl script crawls (using wget) the wiki, looking for pages to include in the PDF.
- Run the pages that we want to keep through an XSLT process to convert each individual Docbook
<article>into a<chapter>instead. - Concatenate all the modified Docbook instances into a Docbook
<book>. - Run dblatex, an open source tool, that will convert Docbook instances into PDFs.
What I learned:
- The Perl XML::XSLT module doesn’t support
substr(or any?) functions. I ended up using Saxon instead. - A lot of minor tweaks were necessary to get the content in the right place for
dblatexto render the PDF in a fashion that looked reasonable. e.g. stripping leading slashes, moving the page title from one XML subtree to another, changing<ulink>tags to<link>tags. - MoinMoin doesn’t like a
User-Agentof known automated spidering tools. I had to set theUser-Agentthatwgetadvertised to a value Moin didn’t recognize.
Below are the scripts I put together (you need to download Saxon separately).
moinToPdf.pl
#!/usr/bin/perl
use strict;
use English;
die "Specify URL and name of book." if @ARGV < 2;
my $BASE_URL=$ARGV[0];
my $BOOK_NAME=$ARGV[1];
my %VISITED_URLS;
my $URL_SUFFIX = "?action=format&mimetype=xml/docbook";
my @SPIDER_URLS;
push @SPIDER_URLS, "/FrontPage";
my $TMP_DIR=`mktemp -d /tmp/moinToPdf-XXXXX`;
chomp $TMP_DIR;
#print STDOUT "$TMP_DIR\n";
#print STDOUT "---\n";
while (scalar @SPIDER_URLS > 0) {
my @RELATIVE_URLS;
my $WGET_STRING;
while (scalar @SPIDER_URLS > 0) {
my $RELATIVE_URL = pop @SPIDER_URLS;
my $FULL_URL=${BASE_URL} . ${RELATIVE_URL} . $URL_SUFFIX;
push @RELATIVE_URLS, $RELATIVE_URL;
if (length $WGET_STRING > 0) {
$WGET_STRING = $WGET_STRING . " ";
}
$WGET_STRING = $WGET_STRING . "\'${FULL_URL}\'";
#print STDOUT "$BASE_URL$RELATIVE_URL\n";
}
#print STDOUT "---\n";
`wget -P $TMP_DIR -q -U foobar ${WGET_STRING}`;
while (scalar @RELATIVE_URLS > 0) {
my $RELATIVE_URL = pop @RELATIVE_URLS;
my $BASE_RELATIVE_URL = `basename "$RELATIVE_URL"`;
chomp $BASE_RELATIVE_URL;
$RELATIVE_URL = "/" . $BASE_RELATIVE_URL;
my $TMP_FILE = $TMP_DIR . $RELATIVE_URL . $URL_SUFFIX;
$TMP_FILE=~s/xml\/docbook/xml%2Fdocbook/;
#print STDOUT "$TMP_FILE\n";
`grep -q '' '$TMP_FILE' 2>/dev/null`;
if ($? != 0) {
#print STDOUT "SKIP: $TMP_FILE\n";
next;
}
my $output=`java -jar saxon8.jar \'$TMP_FILE\' getWikiNames.xsl`;
my $DOCBOOK;
if ($RELATIVE_URL) {
$DOCBOOK=substr($RELATIVE_URL,1) . ".xml";
} else {
$DOCBOOK="FrontPage.xml";
}
my $DIR=`dirname "$DOCBOOK"`;
chomp $DIR;
`mkdir -p "$DIR"`;
`java -jar saxon8.jar -o "$DOCBOOK" "$TMP_FILE" transformArticle.xsl`;
foreach (split '\n', $output) {
chomp $_;
/.*url="([^"]+)".*/;
my $NEW_URL=$1;
/
(.*)<\/ulink>/;
my $NAME=$1;
if ($NEW_URL=~/action=AttachFile.*do=get.*/) {
my $FILE_URL = "${BASE_URL}${NEW_URL}";
my $FILE_NAME = "$FILE_URL";
$FILE_NAME=~s/.*target=(.*)/\1/;
`wget -q -U foobar -O "$FILE_NAME" "$BASE_URL$NEW_URL"`;
} elsif (
$NEW_URL=~/^\/.*/
and not defined $VISITED_URLS{$NEW_URL}
and not $NEW_URL=~/^\/OtherUser/
and not $NEW_URL=~/^\/HelpOn/
and not $NEW_URL=~/^\/Category/
and not $NEW_URL=~/^\/SystemPages/
and not $NEW_URL=~/^\/MoinMoin/
and not $NEW_URL=~/^\/WhyWikiWorks/
and not $NEW_URL=~/^\/RecentChanges/
and not $NEW_URL=~/^\/WikiCourse/
and not $NEW_URL=~/^\/AutoAdminGroup/
and not $NEW_URL=~/^\/HelpContents/
and not $NEW_URL=~/^\/HelpMiscellaneous/
and not $NEW_URL=~/^\/WikiWikiWeb/
and not $NEW_URL=~/^\/SiteNavigation/
and not $NEW_URL=~/^\/RandomPage/
and not $NEW_URL=~/^\/WantedPages/
and not $NEW_URL=~/^\/WordIndex/
and not $NEW_URL=~/^\/FindPage/
and not $NEW_URL=~/^\/WikiName/
and not $NEW_URL=~/^\/InterWiki/
and not $NEW_URL=~/^\/TitleIndex/
and not $NEW_URL=~/^\/SyntaxReference/
and not $NEW_URL=~/^\/HelpIndex/
and not $NEW_URL=~/^\/HelpForBeginners/
and not $NEW_URL=~/^\/WikiSandBox/
and not $NEW_URL=~/.*action=AttachFile.*/
) {
$VISITED_URLS{$NEW_URL} = 1;
push @SPIDER_URLS, $NEW_URL;
}
}
}
}
#`rm -rf $TMP_DIR`;
my $BOOK_DATE=`date +%Y%m%d%H%M`;
chomp $BOOK_DATE;
my $BOOK_TMP="${BOOK_NAME}.${BOOK_DATE}.tmp";
my $BOOK="${BOOK_NAME}.${BOOK_DATE}.xml";
#print STDOUT "BookDate: $BOOK_DATE";
#print STDOUT "BookInterim: $BOOK_TMP";
#print STDOUT "Book: $BOOK";
my $FILE;
open FILE, ">$BOOK_TMP";
print FILE '';
print FILE '';
foreach (`find . -name '*.xml'`) {
chomp $_;
my $FILE2;
open FILE2, "$_";
while() {
s/<\?xml[^>]+\?>//;
print FILE $_;
}
close FILE2;
}
print FILE '';
close FILE;
`java -jar saxon8.jar -o $BOOK $BOOK_TMP polishBook.xsl`;
`dblatex $BOOK`;
`gzip $BOOK`;
`rm -f $BOOK_TMP`;
getWikiNames.xsl
<?xml version="1.0" ?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<!-- Stylesheet finds the WikiName links and spits it out for the perl script to determine the next URL. -->
<xsl:template match="ulink">
<xsl:copy-of select="."/><xsl:text>
</xsl:text>
</xsl:template>
<xsl:template match="/">
<xsl:apply-templates select="//ulink"/>
</xsl:template>
</xsl:stylesheet>
transformArticle.xsl
<?xml version="1.0" ?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<!-- Default copy rules. -->
<xsl:template match="text()">
<xsl:value-of select="."/>
</xsl:template>
<xsl:template match="@*">
<xsl:copy-of select="."/>
</xsl:template>
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<!-- Converts articles into sections. -->
<xsl:template match="/article">
<section>
<xsl:attribute name="id">
<xsl:value-of select="articleinfo/title"/>
</xsl:attribute>
<xsl:apply-templates/>
</section>
</xsl:template>
<xsl:template match="/article/section">
<xsl:apply-templates/>
</xsl:template>
<!-- Strip out <articleInfo/> as it's not needed for <section/> or <chapter/> -->
<xsl:template match="/article/articleinfo"/>
<!-- Tables require IDs, and won't render to PDF properly without them. So use informal tables instead. -->
<xsl:template match="table">
<informaltable>
<xsl:copy-of select="@*"/>
<xsl:for-each select="*">
<xsl:choose>
<!-- Sets the column count for tables, to avoid dblatex warning messages. -->
<xsl:when test="name() = 'tgroup'">
<tgroup>
<xsl:attribute name="cols" select="count(colspec)"/>
<xsl:copy-of select="@*"/>
<xsl:for-each select="*">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:for-each>
</tgroup>
</xsl:when>
<xsl:otherwise>
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each>
</informaltable>
</xsl:template>
<!-- Fix up any WikiName references to be rid of the leading slash. -->
<xsl:template match="ulink">
<xsl:choose>
<xsl:when test="inlinemediaobject/imageobject/imagedata">
<xsl:variable name="filename" select='replace(@url,".*target=","")'/>
<inlinemediaobject>
<imageobject>
<imagedata>
<xsl:attribute name="fileref">
<xsl:value-of select="$filename"/>
</xsl:attribute>
</imagedata>
</imageobject>
<textobject>
<phrase>
<xsl:value-of select="$filename"/>
</phrase>
</textobject>
</inlinemediaobject>
</xsl:when>
<xsl:otherwise>
<xsl:choose>
<xsl:when test="substring(@url,1,1) = '/'">
<link>
<xsl:attribute name="linkend">
<xsl:value-of select="substring(@url,2)"/>
</xsl:attribute>
<xsl:apply-templates/>
</link>
</xsl:when>
<xsl:otherwise>
<xsl:copy>
<xsl:copy-of select="@*"/>
<!--<xsl:value-of select="@url"/>-->
<xsl:apply-templates/>
</xsl:copy>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
polishBook.xsl
<?xml version="1.0" ?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="/sections">
<book>
<xsl:apply-templates>
<xsl:sort select="attribute::id" />
</xsl:apply-templates>
</book>
</xsl:template>
<!-- Converts each section into a chapter. -->
<!-- Uses existence of '/' in @id as an indicator of nesting -->
<xsl:template match="/sections/section[not(contains(@id,'/'))]">
<xsl:variable name='sectionId' select='@id'/>
<chapter>
<xsl:copy-of select="@*"/>
<xsl:for-each select="*">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:for-each>
<xsl:for-each
select="/sections/section[starts-with(@id,concat($sectionId, '/'))]">
<section>
<xsl:copy-of select="@*"/>
<xsl:for-each select="*">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:for-each>
</section>
</xsl:for-each>
</chapter>
</xsl:template>
<!-- These three templates are the default copy rules. -->
<xsl:template match="text()">
<xsl:value-of select="."/>
</xsl:template>
<xsl:template match="@*">
<xsl:copy-of select="."/>
</xsl:template>
<xsl:template match="*">
<xsl:copy>
<xsl:apply-templates select="*|@*|text()"/>
</xsl:copy>
</xsl:template>
<!-- Strips out any of the Category* pages. -->
<xsl:template match="section">
<xsl:choose>
<xsl:when test="substring(@id,1,8) = 'Category'"/>
<xsl:otherwise>
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates/>
</xsl:copy>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
<!-- Strips out any links to CAtegory*, and transform dangling WikiName references to special text. -->
<xsl:template match="link">
<xsl:variable name="end">
<xsl:value-of select="@linkend"/>
</xsl:variable>
<xsl:choose>
<xsl:when test="//section[@id=$end]">
<xsl:copy-of select="."/>
</xsl:when>
<xsl:when test="substring($end,1,8) = 'Category'"/>
<xsl:otherwise>
<emphasis role="italics"><emphasis role="underline">
<xsl:value-of select="text()"/>
</emphasis></emphasis>
</xsl:otherwise>
</xsl:choose>
</xsl:template>
</xsl:stylesheet>
Advertisement