Logo

dev-resources.site

for different kinds of informations.

PWC 296 String Compression

Published at
11/21/2024
Categories
perl
pwc
perlweeklychallenge
Author
Bob Lied
Categories
3 categories in total
perl
open
pwc
open
perlweeklychallenge
open
PWC 296 String Compression

Weekly Challenge 296, Task 1, asks us to implement a simple form of string compression. Run-length encoding as given here is easy to understand and implement, but it does have the short-coming that you can't encode text with numbers in it.

You are given a string of alphabetic characters, $chars.
Write a script to compress the string with run-length
encoding, as shown in the examples.

A compressed unit can be either a single character or a count followed by a character.

BONUS: Write a decompression function.
  • Example 1: Input: $chars = "abbc" Output: "a2bc"
  • Example 2: Input: $chars = "aaabccc" Output: "3ab3c"
  • Example 3: Input: $chars = "abcc" Output: "ab2c"

Thinking in C

I started my career as a C programmer, so when I see string problems, my first instinct is to iterate over characters. This problem seems to lend itself to that. We'll process left to right; if characters repeat we'll count and build up an output string.

sub rle_enc($str)
{
    my @s = split(//, $str);

    my $out = "";
    while ( defined(my $c = shift @s) )
    {
        my $n = 1;
        while ( @s && $s[0] eq $c ) { $n++; shift @s; }
        $out .= ( $n > 1 ? $n : "" ) . $c;
    }

    return $out;
}

Probably the only odd thing worth noting here is that I consume the input string by converting it to an array and using shift instead of using a for loop.

Perl easily moves between numeric and string types, so the expression ($n > 1 ? $n : "") handles the problem of omitting 1s easily, where a C compiler or other strongly-typed language would tell you that you can't mix types like that.

The bonus decoding is pretty easy, too. Whenever we see a number, remove it and replicate the following character.

sub rle_dec($str)
{
    my $out;
    while ( length($str) > 0 )
    {
        my $n = ( ( $str =~ s/^(\d+)// ) ? $1 : 1 );
        $out .= (substr($str, 0, 1, '') x $n);
    }
    return $out;
}

Useful Perl features in this code:

  • s/^(\d+)// -- Capturing the matching digit string while deleting it
  • ... x $n -- The replication operator creates $n copies

Thinking in Perl

Another way of approaching this problem is that we're substituting one pattern for another. Anyplace we find a repeated character, we'll replace it with the number of occurrences and one instance of that character.

We've seen the repeated-character pattern several times in the last few weeks of challenges. It's any character; capture it, and find that character again immediately following -- /(.)\1*/. We'll also want to capture the entire group of matching characters, so add a set of parentheses around that, which makes the single character the second capture group ($2) instead of $1 -- /((.)\2*/.

So that's the pattern we want to match, now for the replacement. We need the length of the repeated string. That's the $1 capture group, so the straight-forward way is to build up an output string from repeated matches:

    my $out = "";
    while ( $str =~ s/^((.)\2*)//g )
    {
        my $n = length($1);
        $out .= ( $n > 1 ? $n : "" ) . $2;
    }
    return $out;

What I'd like to do is to avoid the loop and just make a substitution right in the s/// operation. We'll need to figure out the length of the $1 string match. Does Perl help us? Yes, it does.

As a side effect of regular-expression captures, Perl creates the special array variables @- and @+. These represent the offsets in the string where a capture group begins and ends. The $1 group starts at $-[1] and ends just before $+[1], so the length of $1 is $+[1]-$-[1]. If this sort of punctuation overload makes you twitchy, you can make it more readable by adding use English; and substituting Cobol-level verbosity: $LAST_MATCH_END[1]-$LAST_MATCH_START[1].

So what we want to do during the substitution is evaluate a little bit of code:

my $n = $LAST_MATCH_END[1] - $LAST_MATCH_START[1];
( $n > 1 : $n : "" ) . $2

We can do that in Perl. Adding the /e flag to the substitution operator lets us put code in the replacement.
s/.../my $n=$+[1]-$-[1];($n>1?$n:"").$2/e

Two more flags will reduce the problem to a one-liner. We want to do this globally, so we need /g, of course. And we want the result of the substitution as an output value. Normally the s/// operator returns the number of substitutions; to yield the modified string use the /r flag.

The final, dense code looks like:

sub rleRE($str)
{
    return ( $str =~ s/((.)\2*)/my $n=$+[1]-$-[1];($n>1?$n:"").$2/ger );
}

What about the decoding bonus? Even easier with regular expressions. We can exploit the /e flag again. Everywhere that we find the pattern of a number followed by a character, replace it with the replicated character.

sub rle_dec_RE($str)
{
    return $str =~ s/(\d+)(.)/$2x$1/ger;
}

The breakdown:

  • /(\d+)(.)/ -- Capture an integer ($1) and the character following it ($2).
  • /$2x$1/ -- This is code, not a string. The x is the Perl replication operator.
  • s///ger -- As before, we want to do this globally (g), using expression evaluation in the replacement (e), and returning the modified string (r).

But does it blend?

One final note: performance. I did a simple benchmark (included in the code on GitHub), which suggests that the regular expression encoding is about 25% slower than the character-at-a-time version. Pattern matching is concisely expressive, but there is a bit of machinery under the hood that isn't free. Of course, if performance really mattered, I'd write this in C. And if compression really mattered, I'd choose a different algorithm altogether.

Featured ones: