Converting zenkaku to hankaku

For historical reasons, Chinese, Japanese and Korean word processors allow certain characters (including the Roman alphabet and Arabic numerals) to be entered using wide variants called fullwidth (zenkaku; 全角) characters instead of — or rather, in addition to — the ordinary halfwidth (hankaku; 半角) characters used by everyone else.

When preparing Japanese text for translation in CAT tools like OmegaT, it often helps to convert zenkaku characters to their hankaku equivalents. The Japanese version of Microsoft Word has a built-in feature that will do this, but it’s a little bit annoying because it also converts katakana characters. All I really want to do is convert the non-Japanese characters.

Here’s a Perl script I’ve been using to do this inside TextWrangler:

#!/usr/bin/perl -w

# File: ZtoH.pl
# Author: Phil Ronan, japanesetranslator.co.uk

# Convert zenkaku to hankaku

# Prepare Japanese UTF-8 plain-text files for translation by
# converting full-width (zenkaku) characters to their half-width
# (hankaku) counterparts. Katakana characters are not converted.

# This script was written for use as a TextWrangler plugin, but
# can also be used as a command line tool -- simply pipe in the
# text you want to convert, and the results will be delivered
# to stdout.

use utf8;
use Encode;
binmode STDOUT, ":utf8";

my $s;

while (<>) {
  $s = decode_utf8($_);
  $s =~ tr/ !"#$%&'()*+,-.// !"#$%&'()*+,-.\//;
  $s =~ tr/0-9:;<=>?@A-Z[\]^/0-9:;<=>?@A-Z[\\]^/;
  $s =~ tr/_`a-z{|}〜¢£¬ ̄¦¥₩/_`a-z{|}\~¢£¬¯¦¥₩/;
  print $s;

(You can download the script here, but you’ll need to rename it to ZtoH.pl before running it. Make sure you save the script using UTF-8 encoding.)

If you’re using TextWrangler, simply place this file inside your Unix Filters directory (~/Library/Application Support/TextWrangler/Unix Support/Unix Filters). You should then see this script listed under Unix Filters in the !# menu. Update: In more recent version of TextWrangler, the text filters have been moved to Text » Apply Text Filter.

If you don’t have TextWrangler or you’re running some other system, then you can still use this script as long as you have Perl installed. Just pipe your UTF-8 encoded text through it, and the results will appear on stdout.

