Perl 5.16 vs Ruby 2.0 UTF-8 support

Joel Bernstein joel at fysh.org
Thu Aug 22 16:55:04 BST 2013


You can use the ruby String#encode method to force UTF-8 encoding on the
string and have invalid byte sequences replaced. At a guess your perl code
is happy with the invalid sequence because it's not treating the string as
unicode at all. I'd expect it to fail in the same way if you force the
filehandle to be opened with binmode :utf8.

/joel


On 22 August 2013 17:39, gvim <gvimrc at gmail.com> wrote:

> Can anyone who also uses Ruby enlighten me? For benchmarking purposes this
> Perl 5.16 script works fine parsing a large Maildir folder:
>
> ------------------------------**------------------------------
> use 5.016;
> use autodie;
>
> my $dir = 'my/mail/path';
> chdir $dir;
> opendir my $dh, $dir;
>
> while (readdir $dh) {
>   next unless /^\d{4}/;
>   open my $fh, '<', $_;
>   say "\n\n************* Opening $_ *************";
>   while (<$fh>) {
>     chomp;
>     say if /^\w{4}\s/;
>   }
>   close $fh;
> }
> closedir $dh;
>
> ------------------------------**------------------------------**-
>
> However, the equivalent Ruby 2.0 script produces at UTF-8 error after
> parsing 7 files:
>
> ------------------------------**---------------------------
> dir = 'my/maildir/path'
> Dir.chdir(dir)
>
> Dir.foreach(dir) do |file|
>   next unless file =~ /^\d{4}/
>   print "\n\n************* Opening #{file} *************\n"
>   fh = File.open(file)
>   while fh.gets do
>     print if $_ =~ /^\w{4}\b/
>   end
>   fh.close
> end
>
> ------------------------------**------------------------------**-
>
> The problematic mail file doesn't display any non-ASCII characters when
> opened in Vim. Here's the Ruby 2.0 error message:
>
>
> ************* Opening 1270516984.M407293P18051.mac,**S=1601,W=1645:2,Sb
> *************
> Paul
> ./1.rb:13:in `block in <main>': invalid byte sequence in UTF-8
> (ArgumentError)
>     from ./1.rb:8:in `foreach'
>     from ./1.rb:8:in `<main>'
>
>
> gvim
>
>


More information about the london.pm mailing list