Damn MTGoogleSearch!


I’ve been using MTGoogleSearch for Related Entries on my MovableType blog - and unfortunately some of the related entries have UTF-8 characters in the URL titles which changes my webpage’s default iso-8859-1 encoding to UTF-8. If at least one of the Related Entries URL titles has a UTF-8 character, this causes funky characters to display in the blog body. That is, all of my em-dashes, quotes, apostrophes, etc. in the blog body are messed up. Even though I explicitly specify the encoding in the template to be iso-8859-1, I guess MT actually encodes the and saves the file in UTF-8 format. Even though I explicitly specify the encoding (using <$MTPublishCharset$>) in the template and it's set to be iso-8859-1 within mt.cfg, I guess MT actually encodes and saves the file in UTF-8 format due to the UTF-8 characters returned by MTGoogleSearch. Interestingly, when I View Source in Notepad and do a Save As, it displays UTF-8 in the filetype instead of the usual ANSI.

Compare this page: http://blog.tmcnet.com/blog/testblog/main-test.asp (has MTGoogleSearch/Related Entries)

with

http://blog.tmcnet.com/blog/testblog/main-test2.asp (deleted MTGoogleSearch/Related Entries from template)

Same page – I just deleted MTGoogleSearch code from the template.

Notice the funky characters in the first one.

I could probably re-encode all my posts (ISO-8859-1) to UTF-8, but that’s a huge hassle. At least, I think it is. I tried changing MovableType’s default encoding to UTF-8 and rebuilt my site, but then my posts within the MT database had even more funky characters. I'd have to go into each blog post (in the hundreds) and fix all the funky characters and re-save. Uhhh no thanks.

There should be a way of forcing MTGoogleSearch to strip UTF-8 characters or just ignore them without changing the page’s encoding.

Grrr!!! For now I removed the MTGoogleSearch Related Entries feature from my home page, but I'll leave it on the individual blog posts. For some reason UTF-8 characters appear much more often on my main MT template than my individual archive template.

Update! 04/20/2005
I found an alternate solution to strange characters showing up in my blog. The solution is to download the MTStripControlChars plugin This fixed most of the weird characters, but not all of them. I customized the MTStripControlChars file to add other character mappings such as copyright symbols, registered trademarks, letter 'e' with an accent (
é), and other mappings. I had to break out the old ASCII chart of characters and perform some decimal to Hexidecimal conversion which was then added to the MTStripControlChars.pl file. Then you simply put <$MTEntryBody strip_controlchars="2"$> into various locations in the blog's template and presto bango it works!

(Essentially it translates the (would-be) Windows-1252 characters into the corresponding Unicode numeric entities.)

| 5 Comments | 0 TrackBacks

Listed below are links to sites that reference Damn MTGoogleSearch!:

0 TrackBacks

Damn MTGoogleSearch! TrackBack URL: http://blog.tmcnet.com/mt/mt-tb.cgi/1375

5 Comments

Hello,

The easiest way I found to convert my weblog from iso-8859-1 to utf-8 is to use the import/export feature.

I export my weblog to a file which I donwload on my computer. I open it within a good text editor (I used ultraedit) which can convert from and to utf-8. Once converted to utf8 and saved send it back to the server (binary mode!) and import the entries...

and voilà!

mtgoogle search

Leave comment to Damn MTGoogleSearch! article

Subscribe to Blog

Technorati

Technorati search

» Blogs that link here