Porting this site to Pelican

2015-04-29

tagged python, web-development, programming, wordpress, plone, restructured-text, pelican

On making a simpler website in a slightly complicated way.

Background

Way, way back, this site ran on various permutations of PHP-based frameworks. Getting sick of ugly URLs and having to vigilantly update the software so as to avoid hacks (and still periodically falling victim to various hacks), I went in search of a Python-based framework that I could customize, which had a more humane interface and organisation.

So I settled on Plone, and there the site lived for 6-7 years. It allowed me to write articles in restructured text, arrange pages into a nice hierarchy, rendered sensible looking URLs, came with a growing set of plugins, had frameworks for customising page appearance and behaviour. On the downside, performance was sluggish and consumed much cpu / memory / disk space, the Plone stack kept developing in more and more arcane ways, theming support was dismal and my webhost could best be described as benignly negligent [grokthis].

So I turned to the darkside: WordPress. And the experience was nice. There were plugins for almost anything you wanted. You had the choice of many attractive themes. The WP software stack was reliable, performed well and was easy to install. But now I was being bothered by minor things. I hated having to compose through a web-interface, wanting to write things on my laptop wherever I was. Translating documents I'd written in other ways (e.g. plain text, Word) was painful. While the stack of pages imported from Plone was problematic (as the result of both Plone's and WordPress's joint oddities), a special irritation was documents that changed regularly. Roll out a new version of a software package? Go to the WordPress interface, edit the webpage, edit the online manual, delete the old downloads, add the new uploads ... All this futzing around with documents also made me very conscious that the site contents weren't under version control. Finally there was also the constant need to update WordPress and its plugins [wp].

Then I stumbled across the new generation of static website generators and an idea started to take hold. Buoyed by the cheap price and easy availability of Amazon Web Services, I took the leap. And this is how I set the current website up.

Why Pelican?

Pelican is a static website generator. So why use one of those?

The contents of my website can be kept on a hard disk.
These contents can be committed and preserved in a Github repo.
The website can be edited and tested offline before uploading it.
Documents and files elsewhere on the hard disk can be symbolically linked into the site content, rather than just copied.
Posts can be written in ReST (restructured text) or other plain formats.
The site generation is under under my control and I can hack at it as desired.

There are many static website generators. So why Pelican?

It's written in Python, my favoured language.
Restructured text is it's primary format and I like and use ReST heavily.
It seems to be the major player in Python-based site generators, with active development and a large community.
The page templating language (Jinja2) is a sane and widely-used one.
There's a solid selection of themes, essential for those of us with no artistic talent.
Ditto for plugins
The framework seemed sane, approachable and ripe for customization.
It seemed like it can render a non-bloggish site.

("Seemed" is not me being sarcastic. Lots of frameworks seem simple until used in anger.)

The main failure points / turn-offs for any alternatives were:

Implementation language
Lack of themes
Apparent inflexibility
Small user community
Incapable of rendering anything but a blog.
Blogs or reports of people using them until they "discovered Pelican".

Escaping WordPress

First the content had to be extracted from WordPress. The first step is simple, as WP can dump an XML file of your sites content.

Now this has to be translated into ReST. Fortunately, Pelican has a tool for this. Unfortunately, it doesn't work as well as it should. There were a lot of niggling little problems in the produced ReST:

Widespread use of backslash-escaped characters, in plain and literal text.
Malformed lists (appearing as -listitem)
Wrapping and indentation removed from literal and code blocks (devastating for formatted Python code)
Chunks of raw html. (There is an option in the importer to just skip all of these. It might be wise to use this by default as none of the output HTML actually had any content.)
Unicode entities (e.g. &) left in the output rather than being converted to the correct unicode characters.

Some of these are tricky things to catch. Some of it might be due to my heavy use of codeblocks and literal text. Some of this might be due to historical content imported from Plone. And WordPress' HTML can be a little pathological.

The end result was that most pages would not render 'out of the box' and every page had to be checked. I used a lot of regular expressions and search-and-replace to clean the pages up.

Configuring Pelican

Broadly, configuring and setting up Pelican is straightforward. You call pelican-quickstart, dump your restructured text (or markdown) files in the content directory and run make. The out-of-the-box behaviour is nice and sensible, but should you want anything more, you're going to have to write a little code.

Pelican wants to turn your site into a blog:

Almost every other static website generator has this tendency baked-in, despite protests that "you can configure X to do anything you want". (Much like a blob of molten pig iron can be turned into anything.) Pelican is arguably more flexible than most, but still has blog-like tendencies, wanting to arrange everything by date and categories and tags. The distinction is made between articles (timed / blog entry-like content) and pages (static, fixed content) but pages are definitely second-class citizens in the Pelican ecosystem. For example, they're not included in the tag or category lists, so they're less findable.

I'm also used to my sites being navigated in a hierarchy (e.g. /science/computational-biology/galaxy, /programming/web-development). It's very handy to give the url of a folder to someone and say "everything about X is listed there". And it's a useful piece of organisation for me - the articles I've written on X can be found here. Pelican only allows a single level of categories. I could make everything a category but that would lead to an explosion of categories. A combination of catgories and tags might do the job but I wanted things to rest in a sensible url structure, rather than all gathered in /archive``or ``/articles.

After much prevarication, I decided to hack a solution in. There's a subcategory plugin available for Pelican that allows nested categories. Unfortunately, it requires that you explicitly annotate each article with its full subcategory. Drats. So I forked the github repo and hacked it so that it used the content path to infer the subcategory. Adjusting the pelican configuration file, I ended up with this scheme:

# used by the subcategory hack
PATH_METADATA= '(?P<path_no_ext>.*)\..*'

CATEGORY_URL = '{slug}/'
CATEGORY_SAVE_AS = join (CATEGORY_URL, 'index.html')

TAG_URL = '/by-tag/{slug}/'
TAG_SAVE_AS = join (TAG_URL, 'index.html')

ARTICLE_URL= join (CATEGORY_PREFIX, '{path_no_ext}/')
ARTICLE_SAVE_AS= join (ARTICLE_URL, 'index.html')

PAGE_URL = '{slug}/'
PAGE_SAVE_AS = join (PAGE_URL, 'index.html')

SUBCATEGORY_URL = '{savepath}/'
SUBCATEGORY_SAVE_AS = join (SUBCATEGORY_URL, 'index.html')

which meant that the if the article foo was placed in content/science/geospatial, it was given the subcategory of science/geospatial, rendered as /science/geospatial/foo/index.html, and referred to with the clean url of /science/geospatial/foo/. Clean meaningful urls, meaningful organisation, articles sit under their categories, a bit of time-saving for me.

No such thing as a single-author site:

No matter what I did, no matter what values I gave to AUTHOR_URL or AUTHOR_SAVE_AS, Pelican would always generate an authors page.

Each theme is different:

There's a lot of nice and attractive themes available for Pelican. However, each works in a different way. If several themes have (say) a sharing toolbar or take a site subtitle or breadcrumb navigation or analytics capacity, they will each use different named variables and require different configuration. If a plugin introduces new site features, themes might not display theme or work with theme.

So you're going to have to start hacking at the theme until it does what you want. Fortunately, this is not onerous.

File urls:

I've found the syntax for linking to other assets within the slightly opaque - it seems simple such that the minimal examples given in the documentation appear to be self-explanatory. But it took a while to work out.

Commenting:

One of the features that appears in some but all themes is commenting. It's fairly easy to signup at Disqus, getting the requisite HTML and Javascript fragments to inject into your template for rendering a comment system. I had some minor trouble because various ad-blockers or tracker-blockers stopped the Disqus forms from showing on the site. After they were disabled, it was fine.

A live twitter feed can be inserted in a similar way.

Duplicate target name:

Not the fault of Pelican but I got a lot of these errors - Duplicate explicit target name: "foo"`, Duplicate target name - seemingly triggered by hyperlinks that weren't duplicated. The trick seems to be that all links of all types (hyperlinks, citations, footnotes, etc.) within a single ReST document share the same "space". So links with the same text or name will collide:

`Foo <http://xxx.example.org>`_
`Foo <http://yyy.example.com>`_

I had footnotes and citations colliding with the urls within those footnotes. One solution is to make the links anonymous:

`Foo <http://xxx.example.org>`__
`Foo <http://yyy.example.com>`__

See StackOverflow.

Bad names:

It's really difficult to have tags or categories like c++. The resultant urls you get are ugly. This is an admittedly hard problem, but it would be nice to separate the name and slug for tags and categories.

Category indexes:

It would be really nice to customize index pages for categories. There is a plugin that purports to handle this but it seems to have problems with my url layout and use of subcategory.

Writing tools:

Note a Pelican issue as such, but I really wish there were better tools for writing ReST. Many editors provide an appropriate mode but it's often just a syntax colouring. It would be great to have an editor with hotkeys (e.g. type cmd-b to bold this text), spelling and grammar checkers that picks up malformed text. Textmate has decent hotkeys, PyCharm has spellchecking (that is awkward to use) and many editors pick up some errors. Can we get these all together?

Caching:

Occasionally when I was hacking on the site, changes didn't seem to appear, especially if I was using make devserver (which compiles the site and recompiles it when changes are detected), This puzzled me for a while until I realised that Pelican caches generated pages and only recompiles them if it sees changes in the site content, theme or settings. Plugins are outside that loop. So flush the cache directory and all will be fine.

Verdict

Having complained at length about Pelican's shortcomings, let me reassure you that I'm very happy with it. Yes, it's fairly opinionated about site layout but after you've accepted or hacked on that, the layout is consistent and works well. Several times I thought "surely this change must break Pelican" but the site would work. I like writing in ReST and it's easy to produce material for the site. Files that I've generated elsewhere can easily be linked into the site structure. I don't have to maintain a complex software stack, it's just a collection of HTML pages. It's easy to hack upon. Most of all, it's a tool that lets me do what I want to do and gets out of the way.

References

Some other Pelican porting experiences:

Footnotes

[grokthis]

The late Grokthis. To their credit, they offered cheap Plone hosting at a time when any Plone hosting was rare. Unfortunately, service faults cropped up a few times a year and the fastest way to get technical support was to post a complaint to a webhosting forum.

[wp]

I also set-up my local neighbourhood association with WordPress and would still recommend it for a lot of use cases, especially blogs or timely content. A hosted solution, like the free WordPress.com blogs is best (which relieves you of the software updating and security responsibilities) but self-installation and hosting is fairly painless. A big shoutout to my webhost of the time, Linode. They offer a solid and well-priced service.