Some CMS platforms do a simple word count and cut the article at an arbitrary number word count. The problem with this is it cuts the sentence resulting in sentences such as, "I reviewed the new Apple", or "Today, a new company launched called". Reviewed a new Apple iPhone? Apple iPad? What's the company that launched? Even if you get the full context, I believe it's never a good idea to stop mid-sentence.
Since we've been using Movable Type since 2004 (before Wordpress became popular), I've developed my own set of Movable Type plugins over the years. One of them I came up with was to address the lack of HTML excerpts in Movable Type.
First, let's run down the problems that exist without my plugin:
- Movable Type's popular MT:EntryExcerpt tag auto-generates an excerpt based on word length (default=40) however it strips all HTML formatting, including bold, italic, hyperlinks, quotes, etc.
- If you try and simply cut an entry at an arbitrary character length, you could be cutting in the middle of an HTML tag (img src, a href, etc.) or just as bad you could be leaving an unclosed / orphaned HTML tag resulting in massive layout or formatting problems. e.g. No closing </div> tag or no closing </b> tag resulting in your entire website being "bolded", etc.
- Existing word counting Movable Type plugins can also break HTML leaving orphaned tags or at best cutting the middle of sentences.
- Even if you get automatic HTML excerpts working, you may not like the excerpt chosen and wish to override it.
Developed a Movable Type plugin containing a regular expression (regex) that looks for sentences in the tag you pass to the plugin - typically the MTEntryBody text. Seems simple enough, right? After all, sentences end with punctuation (. ! ?) followed by space. Well, it's trickier than that. You also have consider that in HTML you could have a sentence end with punctuation followed by </span> or a new paragraph (<p>) and not necessarily a space. So now the the regex gets a little harrier. I won't go into the nitty gritty, but you can look at the source code to see how I'm able to parse all these conditions.
Once I chop up the text into array elements of sentences, I count the number of array elements UP to the maximum specified in the ''html_sentences' Movable Type global filter (defined by my plugin). Then I rejoin just those array elements back into the text variable. This will sometimes lead to some 'orphaned' tags, but not to worry! HTML::Tidy (or HTML::Lint) to the rescue!
Basically, my plugin lets you choose a HTML "cleaner" - HTML::Tidy or HTML::Lint, both of which automatically fixes orphaned tags. So for example, if one of these Perl modules sees <table> with no corresponding </table> in the text it will automatically add it. If you have a long bulleted list of (<li> tag) sentences, this plugin will automatically put in the close </ul> html tag even if the bulleted list is cut before the end. It's a beauty to behold!
Lastly, I pass the modified text back to the Movable Type tag that called it where it is then outputted.
- HTML::Lint <-Optional, but if you choose not to use it and not bother to install the Perl module, you will need to delete it from plugin to prevent Perl compile errors.
- HTML::Tidy <-My preferred HTML cleaner
- Install HTML::Tidy Perl module using CPAN - Download: http://search.cpan.org/~petdance/HTML-Tidy-1.54/lib/HTML/Tidy.pm Or use 'cpan' command from Linux shell and install it that way.
- Install tidyp from github - Download: https://github.com/petdance/tidyp [Required by HTML::Tidy] Direct link: http://cloud.github.com/downloads/petdance/tidyp/tidyp-1.04.tar.gz
- Install HTML::Lint Perl module using CPAN - Download: http://search.cpan.org/~petdance/HTML-Lint-2.10/lib/HTML/Lint.pm Or use 'cpan' command from Linux shell and install it that way.
- Download my
HTMLExcerpt.pl plugin. [Get it from Github instead for the latest version].
- Extract HTMLExcerpt.zip to /mt/plugins/ folder
- By default it uses HTML::Tidy. If you wish to change to HTML::Lint, simply edit HTMLExcerpt.pl and change Line 12:
my $cleaner = "Tidy"; #change to Lint want Lint to clean HTML/orphaned tags
You can use the global filter 'html_sentences' in any MT tag you like, but most likely you will use it within <$MTEntryBody>
<$MTEntryBody html_sentences="4"$> <-Pulls 4 sentences with HTML formatting.
Optionally you can allow only certain HTML tags like so:
<$MTEntryBody html_sentences="4" sanitize="a href,p,br,i,em,strong,blockquote,ol,ul,li,script"$>
Override Automatic Excerpt
On rare occasions you may wish to override the automatic excerpt. It's very easy to do. Borrowing a page from WordPress, all you have to do is insert <!-- pagebreak --> into the text body where you want the excerpt to end. The plugin will grab the entire HTML block of code up to and including <!-- pagebreak -->. You can add this via HTML source code in the MT editor, or if you use TinyMCE WYSIWYG editor, you can simply click this button in the button toolbar:
My blog's home page uses the Entry Summary template along with the HTMLExcerpt plugin to pull 4 sentences. Here's two sample screenshots of the output. Note the <blockquote> HTML tag working in 1st example and (green) hyperlinks working in the 2nd example:
If you download and use my Movable Type HTMLExcerpt plugin, let me know how you like it in the comments!