dev-resources.site
for different kinds of informations.
PWC 296 String Compression
Weekly Challenge 296, Task 1, asks us to implement a simple form of string compression. Run-length encoding as given here is easy to understand and implement, but it does have the short-coming that you can't encode text with numbers in it.
You are given a string of alphabetic characters, $chars.
Write a script to compress the string with run-length
encoding, as shown in the examples.
A compressed unit can be either a single character or a count followed by a character.
BONUS: Write a decompression function.
-
Example 1: Input:
$chars = "abbc"
Output:"a2bc"
-
Example 2: Input:
$chars = "aaabccc"
Output:"3ab3c"
-
Example 3: Input:
$chars = "abcc"
Output:"ab2c"
Thinking in C
I started my career as a C programmer, so when I see string problems, my first instinct is to iterate over characters. This problem seems to lend itself to that. We'll process left to right; if characters repeat we'll count and build up an output string.
sub rle_enc($str)
{
my @s = split(//, $str);
my $out = "";
while ( defined(my $c = shift @s) )
{
my $n = 1;
while ( @s && $s[0] eq $c ) { $n++; shift @s; }
$out .= ( $n > 1 ? $n : "" ) . $c;
}
return $out;
}
Probably the only odd thing worth noting here is that I consume the input string by converting it to an array and using shift
instead of using a for
loop.
Perl easily moves between numeric and string types, so the expression ($n > 1 ? $n : "")
handles the problem of omitting 1s easily, where a C compiler or other strongly-typed language would tell you that you can't mix types like that.
The bonus decoding is pretty easy, too. Whenever we see a number, remove it and replicate the following character.
sub rle_dec($str)
{
my $out;
while ( length($str) > 0 )
{
my $n = ( ( $str =~ s/^(\d+)// ) ? $1 : 1 );
$out .= (substr($str, 0, 1, '') x $n);
}
return $out;
}
Useful Perl features in this code:
-
s/^(\d+)//
-- Capturing the matching digit string while deleting it -
... x $n
-- The replication operator creates$n
copies
Thinking in Perl
Another way of approaching this problem is that we're substituting one pattern for another. Anyplace we find a repeated character, we'll replace it with the number of occurrences and one instance of that character.
We've seen the repeated-character pattern several times in the last few weeks of challenges. It's any character; capture it, and find that character again immediately following -- /(.)\1*/
. We'll also want to capture the entire group of matching characters, so add a set of parentheses around that, which makes the single character the second capture group ($2
) instead of $1
-- /((.)\2*/
.
So that's the pattern we want to match, now for the replacement. We need the length of the repeated string. That's the $1
capture group, so the straight-forward way is to build up an output string from repeated matches:
my $out = "";
while ( $str =~ s/^((.)\2*)//g )
{
my $n = length($1);
$out .= ( $n > 1 ? $n : "" ) . $2;
}
return $out;
What I'd like to do is to avoid the loop and just make a substitution right in the s///
operation. We'll need to figure out the length of the $1
string match. Does Perl help us? Yes, it does.
As a side effect of regular-expression captures, Perl creates the special array variables @-
and @+
. These represent the offsets in the string where a capture group begins and ends. The $1
group starts at $-[1]
and ends just before $+[1]
, so the length of $1
is $+[1]-$-[1]
. If this sort of punctuation overload makes you twitchy, you can make it more readable by adding use English;
and substituting Cobol-level verbosity: $LAST_MATCH_END[1]-$LAST_MATCH_START[1]
.
So what we want to do during the substitution is evaluate a little bit of code:
my $n = $LAST_MATCH_END[1] - $LAST_MATCH_START[1];
( $n > 1 : $n : "" ) . $2
We can do that in Perl. Adding the /e
flag to the substitution operator lets us put code in the replacement.
s/.../my $n=$+[1]-$-[1];($n>1?$n:"").$2/e
Two more flags will reduce the problem to a one-liner. We want to do this globally, so we need /g
, of course. And we want the result of the substitution as an output value. Normally the s///
operator returns the number of substitutions; to yield the modified string use the /r
flag.
The final, dense code looks like:
sub rleRE($str)
{
return ( $str =~ s/((.)\2*)/my $n=$+[1]-$-[1];($n>1?$n:"").$2/ger );
}
What about the decoding bonus? Even easier with regular expressions. We can exploit the /e
flag again. Everywhere that we find the pattern of a number followed by a character, replace it with the replicated character.
sub rle_dec_RE($str)
{
return $str =~ s/(\d+)(.)/$2x$1/ger;
}
The breakdown:
-
/(\d+)(.)/
-- Capture an integer ($1
) and the character following it ($2
). -
/$2x$1/
-- This is code, not a string. Thex
is the Perl replication operator. -
s///ger
-- As before, we want to do this globally (g), using expression evaluation in the replacement (e), and returning the modified string (r).
But does it blend?
One final note: performance. I did a simple benchmark (included in the code on GitHub), which suggests that the regular expression encoding is about 25% slower than the character-at-a-time version. Pattern matching is concisely expressive, but there is a bit of machinery under the hood that isn't free. Of course, if performance really mattered, I'd write this in C. And if compression really mattered, I'd choose a different algorithm altogether.
Featured ones: