Gabor Szabo: Code Maven

For a long time I wanted to extend the Perl Maven site to include articles about other subject as well, but I think publishing articles about Python, Ruby or even JavaScirpt on a site called Perl-something would be strange. Finally I reached the conclusion that I should use the domain I bought for some other purposes, but that would be perfect for this. Hence a few days ago I started to post on the Code Maven site.

For the full article visit Code Maven

Perl Foundation News: Perl::Lint Bug Fixes and New Policies

Last week Taiki released version 0.11 of Perl::Lint to CPAN.

This new version fixes several bugs reported after Perl::Lint's release in November.

Taiki has not achieved everything outlined in his grant proposal: implementing some of Perl::Lint's policies has been much harder than expected. However, Perl::Lint provides a useful tool for analysing Perl source code. So, The Perl Foundation's grants committee would like to treat this grant as successfully completed soon.

As usual, we welcome comments on this grant from the Perl community.

Tinypig Blog: Command Line Project Manager (clpm) v1.0.1 released

clpm is designed to make managing sets of files easier at the Unix command line. It's free and open source. Please view the demo here. Find more info at http://tinypig.com/clpm

PAL-Blog: Achtung: Frisch geschlüpft!

Kurz nachdem Zoe erfahren hatte, dass sie große Schwester wird, teilte sie uns ganz stolz (und unerwartet mit), dass ihre Glubschi-Eule auch schwanger ist. Nach kurzer Aufklärung glaubte sie uns sogar, dass Eulen Vögel sind und deswegen Eier legen, anstatt das die Jungen direkt schlüpfen.

perlancar's blog: pericmd 003: What’s wrong with Getopt::Long

perlancar's blog

So what’s wrong with Getopt::Long? Nothing really. As an option parser library it’s doing perfectly okay. There are some crufts or things I personally would like to change (I’ll talk about it later), but these are minor. Also there are alternative libraries with different/”fresh” approaches like Docopt which improve options parsing in some aspects, albeit introducing new problems of their own (we’ll also get into these later), but really the old Getopt::Long way is still fine for most cases.

As a CLI library, however, it is inadequate. Well, that’s unfair. What I meant to say is that Getopt::Long is not a full CLI library. It can be part of such library, but a CLI library will want functionalities not provided by Getopt::Long. For example, routing, which is just a fancy way of saying a nice way of mapping options to actions/commands. Getopt::Long does not provide such routing; it only deals mostly with assigning values to options. So, in a more complex CLI application you’ll see a cascade if statement like this:

GetOptions(
    'clean'      => sub { $action = "clean" },
    'register'   => sub { $action = "register" },
    'unregister' => sub { $action = "unregister" },
    'install'    => sub { $action = "install" },
    'uninstall'  => sub { $action = "uninstall" },
    ...
);

if ($action eq 'clean') {
    ...
} elsif ($action eq 'register') {
    ...
} elsif ($action eq 'unregister') {
    ...
} elsif ($action eq 'install') {
    ...
} elsif ($action eq 'uninstall') {
    ...
}

It would be nice if a CLI library lets us specify a routing of options/subcommands to actions, much like what we usually see in web frameworks.

Technically speaking, you can perform action directly in an option handler, like I usually do with --help or --version, but this will be done immediately when that option is encountered by Getopt::Long so you cannot see the options that follow it (this is usually okay for help/version message because they don’t need to read options and exit early anyway):

GetOptions(
    'help|h|?'  => sub { say "help message ..."; exit 0 },
    'version|v' => sub { say "foo version $VERSION"; exit 0 },
    ...
);

Another feature that Getopt::Long does not provide is automatic help message generation. Well, actually Getopt::Long does have an auto_help switch, but this will only dump the POD instead of generating a help message from its options specification. The main reason is that the specification does not contain rich enough (meta)data to generate a proper help message. Getopt::Long::Descriptive attempts to remedy this, and I’ll talk about it too in another post.

Also missing from Getopt::Long is the concept of subcommands, which most CLI application will need to have, when they grow big/complex enough, e.g. git, perlbrew. So Getopt::Long is purely all about options and does not care about arguments or subcommands at all. There are other libraries, some of them based on Getopt::Long, that tackle this, and we’ll talk about those libraries in the future.


Perl Foundation News: January 2015 Grant Votes

The Grants Committee has concluded the voting of the January round.

Proposal in this round

Voting Results

TitleYesNoScore
Ado14

Definition of the score is found in 3.2 of the rules.

Details

While it is always difficult to make decision on grant proposals, this was a particularly tough one. We understand this is a serious project with good amount of interest from the community including multiple code committers.

The main reason that led us to the negative conclusion is as follows: there are already plenty of web frameworks and CMSes. It is not a bad thing to have another one but we will not be able to fund this "another one" which may or may not attract new users given there are just too many choices. For this specific area with high competition, we tend to spend money on marketing, promotion or tutorial of software which is completed, established and/or adopted. Otherwise on initiative to solve a specific problem.

We hope the success of Ado.

Next round

The next round is in March. You can submit proposals now.

perlancar's blog: pericmd 002: Getopt::Long

perlancar's blog

Getopt::Long (cpanratings) is an established Perl module for parsing command-line options. A bit of history (via some googling): it has been included as part of the Perl distribution since perl-5.0.0 back in 1994. It was actually created as newgetopt.pl during the pre-module days of perl 4.x. The creator is Netherland-based Johan Vromans (CPAN ID: JV) which also wrote the Perl Pocket Reference (one among the well-respected books, if I remember). Amazingly, after 20+ strong years, Johan still maintains Getopt::Long to this very day. So, deep kudos for that.

Why the ::Long in the name, you might ask? Because before newgetopt.pl, there is actually a Perl 4 library called getopt.pl which became Getopt::Std in Perl 5. It can only handle single-letter options, so we will not consider using it in this series.

Here’s a typical CLI program written using Getopt::Long, taken from my scripts repository:

#!/usr/bin/env perl

use 5.010;
use strict;
use warnings;

use Getopt::Long;

my %opts = (parts => ["/backup"], all=>0, percent_used=>95);

GetOptions(
    "part=s"             => $opts{parts},
    "all|a"              => \$opts{all},
    "percent-used|pct=i" => \$opts{percent_used},
    "help|h|?"           => \$opts{help},
);

if ($opts{help}) {
    print <<_;
Usage: $0 [opts]
Options:
  --part=S  Add partition S to warn, eg. --part=/home --part=/
  --all     Warn all partitions (except tmpfs)
  --percent-used=N  Warn if percentage of usage exceeds N (default: 95)

_
    exit 0;
}

my $i;
for (`df -h`) {
    next unless $i++;
    chomp;
    my ($fs, $blocks, $used, $avail, $pctuse, $mp) = split /\s+/, $_;
    $pctuse =~ s/%//;
    #say "DEBUG: mp=$mp, pctuse=$pctuse";

    next if $fs =~ /^(tmpfs)$/;
    if (!$opts{all}) { next unless $mp ~~ $opts{parts} }
    next unless $pctuse >= $opts{percent_used};
    say "Disk usage in $mp has exceeded $pctuse% (only $avail left)!";
}

As you can see, using the module is pretty straightforward: you just specify a hash of options specification and pass it to the GetOptions() function, which will search the options in @ARGV. The function will modify @ARGV and strip all the known options, leaving only unknown options and arguments.

The only thing you’ll need to familiarize with is basically the options specification. Short of a formal grammar, you might want to glance the syntax of the options spec via the regex specified in Getopt::Long::Util, copy-pasted here for convenience:

    $optspec =~ qr/\A
               (?:--)?
               (?P<name>[A-Za-z0-9_][A-Za-z0-9_-]*)
               (?P<aliases> (?: \| (?:[^:|!+=:-][^:|!+=:]*) )*)?
               (?:
                   (?P<is_neg>!) |
                   (?P<is_inc>\+) |
                   (?:
                       =
                       (?P<type>[siof])
                       (?P<desttype>|[%@])?
                       (?:
                           \{
                           (?: (?P<min_vals>\d+), )?
                           (?P<max_vals>\d+)
                           \}
                       )?
                   ) |
                   (?:
                       :
                       (?P<opttype>[siof])
                       (?P<desttype>|[%@])
                   ) |
                   (?:
                       :
                       (?P<optnum>\d+)
                       (?P<desttype>|[%@])
                   )
                   (?:
                       :
                       (?P<optplus>\+)
                       (?P<desttype>|[%@])
                   )
               )?
               \z/x
                   or return undef;

So, an option is more or less a “word” which can be followed by zero or more aliases. An alias is usually a single letter (like “a” for “all”) or a shorter version of the option (like “percent-used” vs “pct”), but can also be some non-alphanumeric characters (like “?” as alias for “help”).

If an option requires a value, you can add “=” followed by one of these letters: s (for string), i (for int), f (for float), o (nowadays seldom used?). Getopt::Long actually allows you to specify optional value, e.g. by specifying “debug:i” instead of “debug=i”, so user can specify either of these options: –debug, –debug=9 (or –debug 9).

Getopt::Long even allows you to specify multiple values, e.g. “rgb-color=i{3}” so user can specify options like: –rgb-color 255 255 0. However, you can’t do something like this (specifying different type for each value): “person-age=s,i”. I seldom use this feature, though.

There are other features of Getopt::Long, some very useful, some rarely used nowadays. We will cover this in the next posts because Perinci::CmdLine uses Getopt::Long for command-line options parsing.

Compared to other (newer) options processing modules on CPAN that reinvent the wheel, chances are Getopt::Long is still your best bet. It’s battle tested, comes with Perl 5 out of the box, and already does most of the common needs for options processing. Most of the other options processing modules don’t even do the two things that we usually take for granted when using common Unix/GNU utilities: short option bundling and auto abbreviation.

Short option bundling is a feature that lets you bundle two or more single-letter options in a single command-line argument. For example, when using rsync I usually write: rsync -avz … instead of rsync -a -v -z … The short options must not take values, except for the last one.

Auto abbreviation is a feature that lets you type only partial (the first letters of the) option names, as long as it is unambiguous. For example, if you have an option called –process-directories, you can just specify –process-dir or even –process, as long as it’s unambiguous with the other existing options. This is convenient.

In the next post I’ll write about what’s wrong with Getopt::Long.


perlancar's blog: pericmd 001: Creating command-line application with Perinci::CmdLine

perlancar's blog

I’m starting a (potentially long, hopefully once daily) series of blog posts about Perinci::CmdLine, my library for creating command-line application in Perl, and issues on command-line applications in general. Since I want to only spend between 10-30 minutes writing each post, I expect each post to be short and a single topic, if large enough, might span several posts. I was inspired by byterock’s series of posts last year where he chronicled his adventure with Moose (and then with XS, in another series), as well as Louis Erickson‘s series on the Perl Best Practices. The decision to do 10-30 minutes per daily post was first made last December when I did an advent calendar. I found that that amount of time spent per post is about right because it does not disrupt my daily flow too much. If I have a longer block of time available, e.g. 1-2 hours, I can always create several posts and schedule them to be published in the subsequent days, giving me more time to come up with the idea for the next unwritten post.

The series will begin with a list of features that I need/want in a command-line application library, and a review of other libraries/frameworks available on CPAN and in other languages and what’s “wrong” (and “right”) about each of them, and why I ended up creating my own.

After that, the next posts will continue with the background and architecture of Perinci::CmdLine itself. And finally with building a (or several) command-line applications step by step, highlighting related features of Perinci::CmdLine as we go along.

Requirements

Ok, enough introduction. Let’s enumerate all the features that I need for a command-line application library. Ideally, for me at least, I want as many as needed features as possible, to allow my to just write the “business logic” and not deal with all the mundane, tedious plumbing tasks.

A typical command-line application (hereby called a CLI app for short, while the library will be called CLI library) needs to do these things:

  • parse command-line options and arguments
  • decide which code to handle which tasks (routing)
  • display usage/help message (typically invoked via --help, -h, or -?)
  • display result/output
  • return appropriate exit code

Also:

  • A CLI app should have a proper manpage.
  • Sometimes a CLI app wants to be interactive and prompts/asks user question.
  • Often we also want to display progress indicator for visual hints to user, if we are doing something that might take a long time.
  • Specific to a CLI app, we also want tab completion.
  • And lastly, we want fast startup time. I define fast as under 0.05-0.1s, because anything over that starts to get noticeable.

In the next post we’ll discuss Getopt::Long.


Perl Foundation News: Deadlines Approaching!

As you prepare for this year's YAPC::NA be sure to take advantage of the early bird rates being offered by Little America Hotel in Salt Lake City! Reserve your room(s) prior to March 1st 2015 and receive a nightly discount.

After March 1st, rates will go up $20/room/night. So be sure to book today. The reservation link can be found at yapcna.org/yn2015/location.html.

CALLING ALL SPEAKERS!

Don't forget that submissions for your YAPC::NA::2015 talks are due before midnight (23:59 EST) on March 1st 2015. Please submit your proposal by visiting yapcna.org/yn2015/newtalk.

PAL-Blog: Deutscher geht's nicht: Die Gesundheitskarte

Drei Dinge sind typisch Deutsch: Über alles meckern, Bürokratie und funktionierende Dinge durch kompliziertere zu ersetzen. An SEPA sieht man, wie Deutsch die EU schon geworden ist. National ist unser Gesundheitssystem traditionell Vorreiter in Sachen Bürokratie und Umständlichkeit. 

Laufeyjarson writes... » Perl: PBP: 079 Tabular Ternaries

Mr. Conway suggests that you can cascade uses of the ternary (?:) operator instead of an if-elsif-else chain.  He feels this is much more readable.  I strongly disagree.

I’m going to leave it to you to dig up a copy of the book and goo look at page 122, as I don’t feel like typing it all in here.  I’ll just say that every time I run into this construct, I pretty much can’t read it at all and have to reformat it as if-elsif-else blocks to understand what the mess is doing.  I find the ?: chain hides the control flow greatly, and mashes a whole bunch of details into a too-dense mess.

Perhaps this is just me, but I find this construct unreadable, uneditable, and unmaintainable.  If you’re doing this, I think you’re doing something wrong.

For something that’s (in my opinion) far less readable and has the same performance characteristics, there’s no good reason not to use the more readable version.

Of course, Mr. Conway says exactly the same thing, just that the way I need to see it is the unreadable one.

The Onion Stand: My CPAN Pull Request Challenge for January: Fuse


So! Back in December I subscribed to Neil Bowers' CPAN Pull Request Challenge. In a nutshell, throughout 2015 every participant will get an email in the first days of each month with a CPAN module that needs a little bit of love, and we have until the end of the month to submit at least one pull request for that module's Github repository.

The module I got was "Fuse".

What is Fuse?


I must admit to my ignorance as I have never heard of Fuse before. So the first thing I did was jump to Fuse's page on MetaCPAN and have a look. Fuse is a Perl interface to the FUSE library (Filesystem in USErspace), and lets you implement filesystems in Perl. Wow! That's pretty cool, right?!

Where to start


If Neil was pointing me to Fuse, it means there's probably work for me to do. So what I did was read the documentation, CPAN Testers reports, open issues and the Github page. Then I downloaded the source code and grep'd for tags like "FIXME" and "TODO". In the end it was a very well-thought distribution and I think the authors did - and are still doing - a pretty nice job. Still, this is what I thought I could help with:
  • The (external) FUSE kernel/lib doesn't seem to be linked anywhere in the main doc, just on the README;
  • The SYNOPSIS and DESCRIPTION could be a bit more verbose - maybe I'm just not used to the FUSE library, but I felt a lot was being taken for granted;
  • The README (displayed on github) doesn't point to a lot of useful places in the Perl-sphere (other than the examples and how to install);
  • There is no license information on the main pod - though it's there on the META resources;
  • The latest version is not indexed for some reason;
  • There are a lot of FAIL and UNKNOWN reports on CPAN Testers.

First things first


There were several small things for me to attempt here, so rather than making a huge bundled Pull Request for Fuse, I chose to compartmentalize them into smaller, separate, PRs. This way the authors'd probably be more comfortable reviewing my changes and I wouldn't depend on all of them being accepted. This is super easy with git and github: just create a branch from master and make the pull request from your specific branch to the author's master. This will let you easily go back to their master (instead of to the one you might have changed) and work on something else independently.

1. Doc fixes

Since I was really short of free time this month, I decided to start with the lowest hanging fruit. So I created a "garu/doc_patches" branch and tried my best to improve on the existing documentation. I expanded the SYNOPSIS, put external links whenever I missed them, and did some minor tweaks here and there with the pod formatting to improve readability. I also moved the README to markdown so it looks awesome on Github while still being nice to read on the console. When I was satisfied, I sent out the pull request.

This was great and really helped me understand what Fuse was all about and how to use it. I recommend anyone trying to learn and contribute to a project to first read the docs and try to improve them whenever you find something you don't quite understand. Most of the time the ones writing the documentation are too familiar with the code and API so they can take a lot of it for granted.

2. WHY U NO INDEX?

There was an open ticket saying 0.16.1 was not indexed. Upon further investigation, I saw versions 0.16_01 and 0.16.1 were in fact indexed, but available as if they were older releases. Why weren't they showing up as the latest? I went to #metacpan on irc.perl.org and asked around.

As haarg++ pointed out, Fuse inadvertently mixed single dotted and double dotted version formats, releasing '0.16' (instead of '0.16.0') and '0.16.1'. When the CPAN indexer compares version numbers, 0.16.1 becomes 0.016001 and plain 0.16 becomes 0.160000, and as such, "0.16" is indeed greater than "0.16.1".

While I believe version numbers should be boring, I understand why some people go for the major.minor.revision semantic versioning format. So I went ahead and filed a pull request bumping up the version to 0.160.2, and another one that bumps it to 0.17. Both numbers are greater than 0.16 so they'll definitely show as latest, and the authors can just pick whichever format they prefer and discard the other PR.

3. Making tests pass

Finally, I noticed that all the UNKNOWN tags on CPAN Testers were there because FUSE was not installed. Implementing an Alien::FUSE module seems like a good idea, but it is a bit out of scope for the PR Challenge and the limited free time I had. So I went on and installed FUSE for OS X using homebrew. For my surprise, the Makefile.PL was still not finding it:

    $ perl Makefile.PL
    Cannot build for platform: darwin
    Please install OSXFUSE from http://osxfuse.github.com/

I checked Makefile.PL's source and found out what was wrong: it's trying to find libfuse using the "pkg-config" tool. Although "pkg-tool" is rather ubiquitous in major Linux systems, my OS X did not have it. So a made another PR that checks for the existence of either 'pkg-config' or 'ppkg-config' (a Pure-Perl alternative to 'pkg-config' available with the excellent PkgConfig CPAN module) early on the Makefile.PL, dying with an explanatory message if none are available. There's even an opportunity here to simply depend on PkgConfig and use the module version of the commands instead of relying on external "(p)pkg-config" tools, but I thought I rather wait a bit and see how well received this change is first.

Now the module compiles on OSX, but I still can't make the tests pass. I dug a bit and found some commands like read_buf and write_buf are required for testing, but are only available in libfuse versions 2.9.0 and above. However, osxfuse seems to use a different version equivalence and the latest version (released just a month ago) is still 2.7.4. I couldn't find an osxfuse <=> libfuse version equivalence, and because of this, the Fuse module doesn't create the proper bindings for such functions and the tests fail. There might be other issues still, but I stopped there, filing a ticket and hoping someone else will pick it up where I left off.

Wrapping up


I had a lot of fun finding out about Fuse and playing with it. By the end of January I was able to file 4 pull requests and 1 RT ticket to the main repo. I'm sad I wasn't able to get the tests to pass on OS X, but hopefully someone else will read this and be inspired :)

I'd like to thank Neil for this awesome PR Challenge idea, Graham "plicease" Ollis for the great PkgConfig module and for quickly replying to and merging the small changes I proposed that allowed me to use it on Fuse, and Graham "haarg" Knop and everybody from #metacpan for their amazing support.

Also, of course, a huge thank you to Dobrica Pavlinušić, Derrik Pates, Mark Glines and the entire Fuse team for such a great module, and for allowing me the opportunity to be a part of it.

Can't wait to see which module I'll get next month! :)

perlancar's blog: Progress indicator in terminal-based applications

perlancar's blog

In terminal-based applications, there are various ways one can show progress indicators:

Simple series of dots:

Copying files .
Copying files ..
Copying files ...
Copying files .....
Copying files ......

Spinning cursors:

Copying files |
Copying files /
Copying files -
Copying files \
Copying files |

Or progress bars (of various styles):

Copying files [=         ] 10%
Copying files [==        ] 20%
Copying files [===       ] 30%
Copying files [====      ] 40%
Copying files [=====     ] 50%

With Progress::Any, you can update progress indicators and then show them in various forms just by swithing output (Progress::Any::Output::*) modules, without re-coding your code. For terminal-based applications, so far these are the (useful) output modules available for terminal applications: TermProgressBarColor (for progress bars), TermSpin (for spinning cursors). Here’s a very minimal code to show progress:

use Progress::Any '$progress';
use Progress::Any::Output 'TermProgressBarColor', width=>40;
$|++;
print "Copying files:";
for (1..10) {
    $progress->update;
    sleep 1;
}
$progress->finish;
print "done\n";

Sample output:

Copying files:   0% [ ==              ]1s left         
Copying files:   0% [  ==             ]1s left         
Copying files:   0% [   ==            ]1s left         
Copying files:   0% [    ==           ]1s left         
Copying files:   0% [     ==          ]1s left         
...
Copying files: done

Since we have not provided any target, completion percentage and ETA (estimated time of “arrival”, or completion) cannot be calculated. To specify target, you can add:

$progress->target(10);

before the loop. Output:

Copying files:  10% [=                ]1s left         
Copying files:  20% [===              ]4s left         
...
Copying files:  90% [================ ]1s left         
Copying files: done

To use another output, just change this output line, e.g.:

use Progress::Any::Output 'TermSpin';
Copying files: |
Copying files: /
Copying files: -
...
Copying files: done

The spinning cursor output does not show completion percentage or ETA.

There are many other features available, e.g. TermProgressBarColor can show text message inside the bar, can autoset width to the whole terminal width. You can have separate progress indicators for multiple (sub)tasks. And there are other outputs for non-terminal applications, e.g. for GUI or for logging. Please refer to the documentation for all these.

One last thing I want to note here is that the terminal progress output modules have been written to work together with Log::Any, particularly with the ScreenColoredLevel adapter. The log adapter and progress output modules all write to terminal, and care has been taken so they do not clobber each other’s outputs. Here’s a full example:

#!/usr/bin/env perl

use 5.010;
use strict;
use warnings;

use Log::Any '$log';
use Log::Any::Adapter 'ScreenColoredLevel', min_level=>'info';

use Progress::Any '$progress';
use Progress::Any::Output ('TermProgressBarColor', show_delay=>2);

use Time::HiRes qw(sleep time);

$progress->target(3);

$log->info('Starting');

$log->info('Doing task 1/3 ...');
{
    my $target1 = 30+int(40*rand());
    for (1..$target1) {
        $progress->update(pos => $_/$target1);
        if (rand() < 0.025) {
            $log->warn("There's something wrong with item $_");
        }
        sleep 0.1;
    }
}

$log->info('Doing task 2/3 ...');
{
    my $target2 = 30+int(40*rand());
    for (1..$target2) {
        $progress->update(pos => 1 + $_/$target2);
        sleep 0.1;
    }
}

$log->info('Doing task 3/3 ...');
{
    my $target3 = 30+int(40*rand());
    for (1..$target3) {
        $progress->update(pos => 2 + $_/$target3);
        sleep 0.1;
    }
}

$log->info('Finished');

One nice thing about the example above is the usage of TermProgressBarColor’s show_delay parameter (TermSpin also has the same parameter, BTW). By setting this parameter, the progress output will not be shown first until the specified number of seconds has passed (the counter is reset to zero whenever a new log message is displayed to the screen). This has the effect of less chatty output (only log messages are shown, without progress report), but if several seconds pass without any more message being displayed to the screen, the progress output automatically shows.


dagolden: Moving CPAN RT tickets to Github, now improved

When I wrote about migrating RT tickets to Github a couple years ago, it was a quick-and-dirty job. This week, I finally cleaned it up and added some features I'd been missing:

  • Automatically closing the migrated RT ticket with a hyperlink to the new Github ticket
  • Migrating "stalled" RT tickets, as well as "new" and "open" tickets
  • Including all comments, not just the opening one, along with creator email and creation time stamp
  • Including hyperlinks to any file attachments
  • Including a list of all "requestors" email addresses (people following the ticket)
  • Improving distribution name detection for people who don't use Dist::Zilla
  • Adding a "dry-run" mode
  • Adding a "single-ticket" mode

Here's a draft of what a newly migrated ticket might look like (from a test run, not an actual migration, but it shows an attachment):

Migrated ticket example

Or, here's an actual ticket migrated over for IO::CaptureOutput: Bad joining of STDERR + STDOUT in windows [rt.cpan.org #55164].

I've created a github repository for this at zzz-rt-to-github to make it easy for people to grab the script or contribute back. The README file there has instructions.

Enjoy!

NEILB: The meaning of MetaCPAN favorites

Until now I haven't really been sure what interpretation to put on MetaCPAN favorites, so I haven't favorited anything. You won't be surprised to hear that the PR challenge prompted me to think about favorites, and your response to my interpretation is likely to be "well, yeah, duh!".

mvuets' notes on #cpanpr: January: SOAP-Lite

Su-Shee> mvuets: oh dear god you have soap::lite? my condolences..

SOAP-Lite is a collection of modules that enables you to write SOAP clients and servers, no surprise. As the name suggests, the distribution is not comprehensive; for instance it lacks the full support of WSDL. But in general it’s good enough. Current maintainer is PHRED.

If you know what SOAP is, you can figure what SOAP-Lite is about. And if you have used it, you likely know much more than me. But if you believe that SOAP is a way to shout a cleansing agent, maybe I can tell you something new…

What is SOAP

© Maxim Vuets

I presume you are already familiar with a concept of a web service. To effectively communicate, both client and server should speak the same language. It means a client should know what actions a service can perform, how to pass data and in what format it should be, what communication channel to use and how to serialize data, and so on.

There are many various bits and ways to put it all together and make a web service. The modern de facto way seems to be to follow the REST style and utilize HTTP, URI, and JSON. But before that was even cool, there were other options, and SOAP was one of them.

SOAP is a messaging framework that defines means for exchanging information and is extensively based on XML. It is standardized by W3C, has a few big company names behind its emergence, and used to be an acronym for “Simple Object Access Protocol”. Overall SOAP feels very enterprisey, which may be either bliss or sorrow—it depends on your spirit.

First, let’s say a few words about SOAP’s advantages and why it stays relevant nowadays.

  • It’s documented and defined by the specification.
  • Since it’s XML, one may benefit from other technologies in the stack: XML Schema, XML Namespaces, XSLT, XPath, XQuery, etc.
  • It’s well backed up by enterprise languages like C# and Java.
  • Besides HTTP, SOAP message may also be transmitted over SMTP, FTP, XMPP, and likely other application layer protocols. (I personally have never seen it used in practice, but admit: it sounds intriguing.)
  • Quite a few web services out there operate on SOAP only. Just to name one: AdWords, Google’s main source of revenue.
  • SOAP service API may be described in WSDL, Web Services Description Language. Which, when used properly, is extremely powerful and handy. I will show a brief example of its usage later.

Why is SOAP distasteful

During the tide of protests against SOPA someone made a great pun that, I think, shares a similar sentiment about SOAP:

http://web.archive.org/web/20130708174927/http://stopsoap.com/

So why do many people not like SOAP? Here is a couple of reasons.

It’s XML. Yes, it’s dualistic. While there are good parts, it’s also too complex, bloated, hard to write by humans, easy to parse wrong, and so forth.

SOAP takes it even further. Take a look at this request-response example captured from an actual web service:

<?xml version="1.0" encoding="UTF-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
  <soapenv:Body>
    <helloName xmlns="http://nks34.t320">
      <name>John Doe</name>
    </helloName>
  </soapenv:Body>
</soapenv:Envelope>

<?xml version="1.0" encoding="utf-8"?>
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
  <soapenv:Body>
    <ns:helloNameResponse xmlns:ns="http://nks34.t320">
      <ns:return>Hello John Doe</ns:return>
    </ns:helloNameResponse>
  </soapenv:Body>
</soapenv:Envelope>

Making a SOAP request is like invoking a remote function by name and passing it arguments. Using this analogy, the gist of the request above can be expressed as nks34_t320::helloName("John Doe"). And the reply here is just a string: Hello John Doe. Everything else is scaffolding that’s necessary for SOAP.

Namespaces. These come from XML. They are supposed to resolve element name conflicts when mixing a few XML documents from different XML applications. But it seems that no one ever gets it right.

Look at these two XML document examples. They both are semantically equivalent. They both carry the same data. They both have the same structure. The only difference is in how they are spelled. And it’s up to you (and your serializer) whether to use a default namespace (bare xmlns attribute); or to use an explicit namespace prefix, which may be any custom string you like (e.g. xmlns:feed).

<f:article xmlns:f="http://example.org/feed">
  <f:title>Breaking news!</f:title>
  <f:body xmlns:b="http://www.w3.org/1999/xhtml">
    <b:div>
      <b:p>Young boy living on Mars discovers that
      his <b:em>parents are robots</b:em>.</b:p>
    </b:div>
  </f:body>
</f:article>

<article xmlns="http://example.org/feed">
  <title>Breaking news!</title>
  <body>
    <div xmlns="http://www.w3.org/1999/xhtml">
      <p>Young boy living on Mars discovers that
      his <em>parents are robots</em>.</p>
    </div>
  </body>
</article>

Unfortunately, many SOAP clients and servers take it literally. They expect messages to either use no default namespaces, or to have only explicit prefixes. And sometimes prefixes must be specific strings. In other words: whenever they should look into the essence of XML elements (e.g. element A comes from namespace B, that’s been described earlier by URI C), instead they expect concrete sequences of bytes.

That issue is not necessary XML’s fault. But XML is hard, and it leads to poor quality services. Now, because you have to use such services, you have to adapt your clients accordingly. It hence increases code complexity and leads to potential problems in client code as well.

By the way, to overcome the aforementioned namespace issues, SOAP::Lite provides two attributes: default_ns($uri) and ns($uri, $prefix). They give you some control over formatting requests.

It’s standardized. So you think you can rely on the specification when writing your client to interact with other SOAP code out there. If only everyone followed specifications… The spec mostly gives you illusory confidence that is destroyed right away by harsh reality.

Here is another example of inconsistency among implementations. A SOAP message is called Envelope. It contains one mandatory element Body, and may have two optional elements Header and Fault. Header might carry an authentication token, query rate quota, cache information, etc. A server reply might use Fault to indicate request processing status and errors. In reality some services resort to HTTP status code and custom headers (remember: SOAP is designed to be used by variety of transport protocols, not only HTTP). Some report errors in the Body. Some return special values over wanted result, like -1. And don’t get me started on the myriads of different Body format styles.

From my own experience the last headache I remember to have was a Baidu pay-per-click advertising SOAP service, that refused to accept argument-less requests containing xsi:nil="true". Fortunately I could workaround the problem by explicitly passing optional arguments.

SOAP::Lite

In the light of all SOAP pain points you would like to have a generic SOAP library that could handle it. And SOAP::Lite module does pretty good job. So let’s run a quick example and see what does it look like to make a simple SOAP client.

Objective: get the current weather report in Amsterdam, NY, USA with a help of CDYNE Weather web service.

First you want to find out how to access the web service and what arguments it accepts. One way to do it is to look at its WSDL file. But you will soon find yourself sick of tons of angle brackets and become this >_<.

Gladly there are tools that may simplify the routine. One of them is a web SOAP client hosted at service-repository.com. Given that WSDL you get a list of functions the web service provides. You are interested in GetCityWeatherByZIP. Click on it to see a location (endpoint URI to which send requests) and a form showing required arguments. The single argument is ZIP—a postal U.S. ZIP code of a city in question. For our case it’s 12010.

Now after a series of trials and errors I came up with this code that does what was intended. Follow comments in the code below.

use v5.20;
use warnings;
use SOAP::Lite;
use Data::Printer;

# Build a client object
my $client = SOAP::Lite->new;

# Set an endpoint URL
$client->proxy('http://wsf.cdyne.com/WeatherWS/Weather.asmx');
# Set default XML namespace
$client->default_ns('http://ws.cdyne.com/WeatherWS/');
# Set SOAPAction HTTP header: .NET service demands it in format "$ns/$method"
$client->on_action( sub { join '', @_ } );

# Make a call to the GetCityWeatherByZIP function
# and pass one untyped named argument
my $res = $client->call('GetCityWeatherByZIP',
    SOAP::Data->name('ZIP' => 12010),
);

# The XML response gets parsed automatically
# and is transformed into native Perl data structures
my $body = $res->result;

# If the service utilized the Fault SOAP element,
# this code would catch errors
if ($res->fault) {
    die $res->faultstring;
}

# ...but it does not. You have to do custom error checking
if ($body->{Success} ne "true") {
    die $body->{ResponseText};
}

# Dump the entire reply to screen
p $body;

After running it, this is the output that I got:

{
  City               ⇒ "Amsterdam",
  Description        ⇒ "Cloudy",
  Pressure           ⇒ "29.67R",
  RelativeHumidity   ⇒ 63,
  Remarks            ⇒ "",
  ResponseText       ⇒ "City Found",
  State              ⇒ "NY",
  Success            ⇒ "true",
  Temperature        ⇒ 45,
  Visibility         ⇒ "",
  WeatherID          ⇒ 14,
  WeatherStationCity ⇒ "Albany",
  Wind               ⇒ "W17G24",
  WindChill          ⇒ ""
}

This is SOAP::Lite in a nutshell. I’ll spare you further details of the module features. But you can refer to the official SOAP::Lite documentation, if you want to know more.

In my next post I will tell you what I learned while working on the assignment and what pull requests I actually made. Stay tuned!

Nestoria Dev Blog: Module of the month January 2015: Test::Deep

Test::Deep is a very useful extension to Test::More. Its exported subroutines are so generally useful you’ll soon rely on them as much as is_deeply. A while back, we blogged about how we include some modules automatically in all our tests with Test::Kit, and including Test::Deep there was a no-brainer.

Its primary contributor is Ricardo Signes, also knows as RJBS. Frankly, RJBS’s contributions to Perl are much more than just Test::Deep, he’s been pumpking many times since Perl 5.11.4, and he’s contributed many modules on CPAN. Ricardo will be receiving 1$/week for a year on Gratipay as a sign of our gratitude for Test::Deep, although we know he deserves much more than that. Congratulatios, Ricardo!

If you’re not using Test::Deep, you really should check it out.

Perl News: Perl QA Hackathon 2015

The Perl Perl Quality Assurance Hackathon 2015 has been confirmed, this year it is between the 16th-19th April in Berlin.

There is an interesting article about the details and why it is important, including a call for donations to help support this valuable work.

Laufeyjarson writes... » Perl: PBP: 078 Value Switches

The idea behind this Best Practice is to explain how a table lookup can be used instead of an if/elsif/elsif/elsif/elsif/else block to make selections from a set of constants.  This is almost always a good choice!

Mr. Conway’s examples are clear and make the point well.  One thing I have seen done is to use a hash; the keys hold the values to check for and the value holds a code block containing what to do.  Indexing hashes is very quick, and the code block then does what is needed.  A set of these definitions and usage is pretty clear, and enormously flexible.

NEILB: The first challenge PR I received

I received my first pull request (PR) originating from the PR challenge today. NGLENN had been assigned HTML::ParseBrowser, a module I adopted a couple of years ago. He did the thing I've most been meaning to do, but never got around to. Unprompted. Aces!

Sawyer X: Web scraping continued

I recently gave a talk at AmsterdamX.pm about web scraping. I provided a few examples of scraping (most of them on my Github repo), and amongst them, a few relating to the January assignments page Neil has put up.

The first was to simply check if any of them are mine. I check if I'm the last person who released or if the repo is under my username. It's not very accurate because someone else might release and some projects are under a Github organization, but it's a good start.

Then I thought, I could check for people I know personally, like my colleagues. I did this in the second script. Using Acme::CPANAuthors I apply the same logic I did before, but to a large audience.

By then the output was difficult to read. The third script put everything in a hash so I could display it better.

During my talk I also mentioned I had wanted to merge Web::Query and AnyEvent and the following script achieves just that.

Since you can provide wq() with content on which it would select and not just URLs for it to fetch, I can use AnyEvent::HTTP to fetch and then feed the input into wq(). Fun!

I have another scrapper up my sleeve and I'll write a separate post about that.

Sawyer X: My January's Pull Request Challenge (part 2)

In the previous post I discussed my January Pull Request Challenge contribution. It's only part 1 because there's another part: the contributions others made during their PRC which were related to projects I'm in charge of.

I selfishly apply the "My January's Pull Request Challenge" label on changes others made to code I'm responsible for because it's part of my experience too, and it's this experience I want to share.

David Pottage received Module::Starter, which is still under my wing. Module::Starter has accumulated many issues (both on RT and Github) and needed more care and love than I had time for.

David contacted me asking for suggestions for topics to work on. He looked into several issues and decided to dive in. The first issue he fixed was the tracker meta data. Module::Starter was on Github but many tickets were opened in RT because the repository meta data was incorrect. David remedied it and prompted me to move the existing tickets to Github. Done.

Then David followed up with no less than six additional pull requests, solving issues and adding features. Wow. The work David has done for Module::Starter in a single month has been so great it left me in awe.

The Pull Request Challenge is not just about socializing, communicating, and providing an opportunity for developers to improve and contribute, but also to improve the state of important projects that need it. By the same token, the PRC is not just about improving the state of important projects that need it, it is also about socializing, communicating, and providing an opportunity for developers to improve and contribute.

Thank you, David, and thank you, Neil, who started this crazy adventure! :)

Sawyer X: My January's Pull Request Challenge (part 1)

While I was focused on the social aspects of the PR Challenge, such as the IRC channel (opping literally everyone), the guides (wrote several), the repo (plus organization), and lately even a small parser for the web page Neil created (which will appear in another post), I still had my own responsibilities - mainly, my own PR challenge, and taking care of others' PR challenge contributions that fell under my purview.

I received Maroš Kollár's MooseX::App.

I looked at several angles: docs, bugs, major improvements (speed, practices, etc.), and big blinking "what the hell is this?" lights. Unfortunately none were available. The project was maintained well, had a really decent code-base, good practices, clear - it was terrible. :)

I emailed Maroš to ask how I could help. He emailed me back and we talked about a few options. I ended up providing the following very simple pull request.

I wouldn't say I made a major contribution to MooseX::App. I didn't. Some things are now immutable (a quality I'm close to tattooing on my body - and there's a joke there on read-write mode) and coderefs are used for values. This isn't something that affected performance, no bugs were fixed, and no features were added. The code is simply more consistent and can now avoid accounting for more variables changing in run-time. Meh.

I had planned out to do more. Maroš and I discussed a few more ideas and he updated the TODO file to reflect them. Unfortunately I had no time to do so. I had more and more projects coming in and January turned out to be far from the care-free month I had hoped it would be.

However, I did earn a new insight into meta programming with Moose - a topic I never had the need to delve into deeply. Reading other people's code (especially good code) teaches you a lot. I also had the pleasure of interacting with another CPAN author and get the sense of "we're all in this together" feeling that I enjoy so much on CPAN.

Even though my time was sparse, I enjoyed providing the ever-so-little help I did, and I look forward to my February challenge, which I hope I could contribute more to.

My next post will be about work others have contributed to me during their January PR Challenge.

Hacking Thy Fearful Symmetry: Crafting Cheatsheets

I am balefully under-leveraging the tools at my disposal.

Not that I think I am alone in my plight, mind you. Technology barrels down its virtual highway so fast we barely have the time to register any newcomer before it screams past us and doppler itself off our field of vision. That, or it's a leviathan of such proportion that only grasping its general shape
rightfully requires diligent years of study.

Documentation is, of course, the main instrument that we have to break those mind-mustangs. One of my personal traditions is to acquire one core dead-tree reference for any major technology I dabble in. But while that helps with the big picture, it's less effective for the day-to-day use of new tools -- minor tools, like plugins and addons that are too ephemeral or infinitesimal to appear in pulp-enshrouded tomes. For those slender software slivers, it usually pay off to bolt down a few notes and form a cheatsheet that one can refer to quickly and easily.

Sounds simple, right? But, in my case, things get complicated by two things: (a) my penmanship is encryption-grade terrible and (b) I'm not a man who leave good enough alone. So I went on a quest to set up, if not the most reasonable, at least the snazziest cheatsheet system possible.

First challenge: cheatsheets must be easy to jolt down. For that, no fancy markup, just something wikiish or markdownese. Bonus points if it lives within the editor I already use most of the time.

Happily enough, such a beast exist: vimwiki, a vim-based personal wiki. Its native syntax is tame enough (and it supports Markdown too), and it has real nice support for tables. For those not convinced, here's exhibit A: my cheatsheet for Impaired.

But I said I wanted to have those cheatsheets near at hand, and for once I was not metaphorical -- I want to print those cheatsheets. As luck has it, vimwiki does generate HTML output of its pages. By default the pages are faily bland, but it's nothing that a little styling, and a little mangling can't improve.

To get from the HTML document to a printer-ready PDF, we have several options. We could go via any browser, or use PhantomJS, but I elected to go with Prince; it's free for personal use, and is very, very good at what it does.

And just like that we have our PDFs. Are we done? Not quite. Just to be difficult, I also want to print those cheatsheets as booklets. I could write a small script to juggle and rearrange the pages of the PDF documents to acheive that, but it happens that there's a tool that already does that.

One job sent to the printer later, and a first of many cheatsheets adorn my desk.

impaired cheatsheet

Gabor Szabo: 2014 in Numbers

Before really planning for 2015, it is worth to look back at 2014.

For the full article visit 2014 in Numbers

:: Luca Ferrari ::: Perl, printf and qw to rescue!

When dealing with fixed/padded strings, nothing is better in my opinion of the printf family of functions. However, the printf has a couple of problems when trying to format complex data, especially if compared to pack(). The first problem is that the formatting string could result very hard to read; for instance consider the following one: qq(%-4s%1s%09d%1s%-50s%-50s%1s%08d%-4s%-16s%-100s)

Perl Foundation News: Maintaing Perl 5: Grant Report for November 2014

Tony Cook writes:

Approximately 38 tickets were reviewed, and 10 patches were applied.

[perl #121337] is interesting in the number of things that were wrong with the first test in t/op/utf8cache.t.

This was reported by Daniel Dragan where op/utfcache.t was producing a an error message, but was still passing:

  1..15
  '-' is not recognized as an internal or external command, operable program or batch file.
  ok 1
  ok 2 - quadratic pos
  ...

This turned out to be a problem from when the test was originally written in 2009 - it uses C< open CHILD, '-|' > which doesn't work on Win32.

It was further muddled when the code was converted to use test.pl instead of manual TAP output in 2011. Previously the code would attempt to load Devel::Peek, but the modified code only checked $Config{extensions}, so the module wasn't available to produce the test output.

Finally, the test included no description, and the test performed was obscure enough that it was difficult to tell what it was testing for. The original commit message is fairly obscure.

(and post-finally, I missed committing a change to use and allow a skip count for skip_without_dynamic_extension(), covered later by Jarkko.)

I spent a lot of time on [perl #108276] this month, which was originally reported by Danial Dragan in 2012 as a deep recursion bug in Perl_scalarvoid().

Others followed up with attempted reproducers which revealed different deep recursion bugs, including a suggested fix, with the last activity in July 2014.

I spent some time cleaning up the suggsted fix and extending it to two of the other reported recursion issues.

In June this year, Dave Mitchell added -DPERL_OP_PARENT as a build option for perl, adding a mechanism for finding the parent of a given op. Dave suggested after I posted my patches that once that's the default, most of our recursive tree-walking would become trivially fixable.

I made a first attempt to fix this, but unfortunately found some more recent changes had broken -DPERL_OP_PARENT builds, which I tracked down and fixed.

I then produced a patch to use the new PERL_OP_PARENT links for the simplest of the recursion cases, finalize_op().

I also spent some time trying to apply this to op_free(), but haven't been able to produce a working build yet.

HoursActivity
22.14#108276 review, re-test and apply to blead, re-work whitespace patch
#108276 testing, apply to blead
#108276 testing of various reproducers
#108276 produce a S_finalize_op() patch, comment
#108276 review comment, try PERL_OP_PARENT, comment, track
down PERL_OP_PARENT assertion failure
#108276 debug and produce fix for PERL_OP_PARENT issues
#108276 apply to blead, try non-stack traversal again,
debugging, hair tearing
#108276 track down re/pat.t crash (can't handle reduced
stack), post traversal patch
#108276 consider a more general iteration mechanism and comment
#108276 op_free() and PERL_OP_PARENT
#108276 debugging
#108276 more debugging
5.86#119439 podcheck updates
#119439 try to extract_pod() with Pod::Simple, but fallback
to a simple extractor
#119439 updates to podcheck.t, Pod-Checker
0.28#120487 re-test and apply to blead
2.29#121337 produce a fix, testing, comment - utf8cache.t broken
#121337 re-test and apply to blead
0.80#122002 re-test and push to blead - -Dmksymlinks
0.23#122823 re-test and apply to blead - make clean
0.93#123063 review ticket, produce a patch and comment
0.33#123089 comment and close - svpeek.t on android
0.87#123096 review patch and referenced thread, apply to blead
- IO::Socket::connected documentation
0.68#123105 merge the mess, try to make a patch work, comment
- op-type patchset
#123105 review discussion
1.57#123119 review and comment
#123119 comment
0.30#123123 review, comment about missing patch - nmake and
DynaLoader
0.58#123124 review, test and apply to blead - constants in
make_ext.pl
0.18#123130 review, test and apply to blead
0.27#123131 review and consider a related bug
1.00#123135 reproduce and start bisect - $^N utf8 bug
#123135 produce a patch and comment
1.85#123136 review
#123136 review code, make some changes (not applied to
blead)
1.30#123145 reproduce, code reading - gv_const_sv
#123145 test some code and fail
1.60#123158 research and comment
1.02#123163 reproduce cygwin build issue
#123163 prepare a fix after discussion with khw
#123163 re-test and push to blead
0.90#123166 review, testing, apply to blead
0.73#123172 testing and comment
0.93#123188 review
#123188 review ticket, research
0.50#123199 review ticket and code
0.95#123211 review and testing
#123211 review test results and comment
0.81#123253 review patch and comment - SvIV_please()
optimization
#123253 testing, apply to blead
0.22#123266 comment - glibc bug no longer present
2.26#123278 review, testing, comment - ParseXS and RETVALSV
#123278 apply to blead
0.87#123292 review ticket and code
3.30#123297 try some fixes, testing, IRC discussion
#123297 review Jarkko's change, comment
#123297 reply Jarkko's email
5.62#45331 research, some code
#45331 more research, testing - win32 stat on \\?\UNC
#45331 make it work in a basic way, need non-DST to test
1.67analyze Itanium thread failures, email to Tux
0.80help khw with win32 build issues
0.40look into HP-UX failures - give up, too slow
0.43reply bulk88's ParseXS mail
2.97review bisect results and try to debug, discuss with jhi,
push a fix
2.18track down fork test crash on 64-bit Win32, bisect.pl for win32
1.18try to reproduce netbsd test failures and start bisect

70.80 hours total

PAL-Blog: Wie sag ich's meinem Kind...

Zoe hat sich schon vor über einem Jahr ein Baby gewünscht, genau als ihre beste Freundin erfahren hat, dass sie eines bekommt. Trotzdem haben wir überlegt, wie wir ihr am schonensten beibringen, dass sie Mama und Papa bald mit (noch) jemandem teilen muss, der mehr Aufmerksamkeit braucht als sie. Schließlich habe ich die Sache in die Hand genommen und ihr ein kleines Rätsel aufgegeben...

manchicken here... » perl: 2015 Pull Request Challenge: A Successful January!

So I just got word that the maintainer of App::CLI::Extension merged my changes!

I’m very excited to have participated in January and I’m looking forward to seeing what my February assignment will be.

Laufeyjarson writes... » Perl: PBP: 077 Multipart Selections

The PBP say “Avoid cascading an if.”  At first glance, this sounds like a pretty harsh suggestion.  As it turns out, I rarely ever have any issue following this suggestion.  Well organized code seems not to need long if/elsif/elsif/elsif/elsif/else blocks.

Mr. Conway gives two reasons to avoid this construct, performance and readability.  I find them both plausible arguments.  I don’t know that I find the problem such a big problem that it needs to be prevented.  In my experience, good code doesn’t run into this.  I forget Perl has a specific elsif clause most of the time.

The next couple of entries cover common ways to avoid them.

 

brian d foy: CPAN Cleaning Day 2457044: Compiler::Lexer

In my quest to clean up my CPAN distributions and to normalize them, I've been working on CPAN::Critic, my unreleased tool that looks at my directory and complains about things I don't like. That's going nicely for the most part, but I ran into a small problem that's taken me down a bit of a rabbit hole of C++. All this in the same week I wrote a Python program.

I want to check that the minimum version of Perl the code requires is the same as the MIN_PERL_VERSION. We have two modules to do that, the PPI-based Perl::MinimumVersion and the Compiler::Lexer-based Perl::MinimumVersion::Fast.

However, I'm in love with postfix dereferencing, a v5.20 feature. PPI reports 5.004 and Compiler::Lexer thinks 5.008 because neither handle the latest syntax enhancements:

#!/Users/brian/bin/perls/perl5.20.0
use v5.20;

use Perl::MinimumVersion;
use Perl::MinimumVersion::Fast;

my $code = '$array_ref->@*';

my $version = Perl::MinimumVersion->new( \$code )->minimum_version;
say "min version is $version"; # 5.004

my $fast = Perl::MinimumVersion::Fast->new( \$code )->minimum_version;
say "min version is $fast"; # 5.008

I started looking at Compiler::Lexer to see what I would need to do to add postfix dereferencing. If someone wants to take this on as a grant proposal for TPF, that would work best for me. Aside from that, I dusted off one of my C++ books (or, more correctly, found the moving box it never left) and got to work. Instead of fooling with the XS stuff and the Perl interface, I decided to write a short C++ program to dump the tokens.

#include 
#include 
#include 

typedef Lexer * Compiler_Lexer;
int show_token( Token *token );

int main() {
const char *filename = "foo.pl";
bool verbose = false;
Lexer *lexer = new Lexer(filename, verbose);

const char *script = "$array->@*";
Tokens *tokens = lexer->tokenize( (char *)script);

size_t size = tokens->size();
for (size_t i = 0; i < size; i++) {
Token *token = tokens->at(i);
show_token( token );
}

return 0;
}

int show_token( Token *token ) {
printf(
"------------\nStype: %d\nType: %d\nKind: %d\nName: %s\nData: %s\n",
token->stype,
token->info.type,
token->info.kind,
token->info.name,
token->_data
);
return 1;
}

That took me much longer than I'd like to admit. I wasted most of that time just getting a C++ environment working. I haven't touched the language in over a decade. I kept typing my in front of variable names, leaving off parens, and ending lists in commas.

That's where I am now as I dig into the code to see how it parses and guesses what is going on. After it sees the -> token, I have to tell it that a following sigil starts a dereference, and then parse all the stuff, including possible slices, that come next.

Perlgeek.de : A new Perl 6 community server - update

In my previous post I announced my plans for a new Perl 6 community server (successor to feather.perl6.nl), and now I'd like to share some updates.

Thanks to the generosity of the Perl 6 community, the server has been ordered and paid. I am now in the process of contacting those donors who haven't paid yet, leaving them the choice to re-purpose their pledge to ongoing costs (traffic, public IPv4 addresses, domain(s), SSL certs if necessary) and maintenance, or withdraw their pledges.

Some details of the hardware we'll get:

  • CPU: Intel® Xeon® Haswell-EP Series Processor E5-2620 v3, 2.40 GHz, 6-Core Socket 2011-3, 15MB Cache
  • RAM: 4x8GB DDR4 DDR4 PC2133 Reg. ECC 2R
  • HD: 2x 2TB SATA3-HD

The vendor has told me that all parts have arrived, and will be assembled today or tomorrow.

Currently I lean towards using KVM to create three virtual hosts: one for websites (*.perl6.org, perlcabal.syn), one for general hacking and IRC activity, and one for high-risk stuff (evalbots, try.rakudo.org, ...).

I've secured the domain p6c.org (for "perl 6 community"), and the IPv4 range 213.95.82.52 - 213.95.82.62 and the IPv6 net 2001:780:101:ff00::/64.

So the infrastructure is in place, now I'm waiting for the delivery of the hardware.

Perlgeek.de : CPAN Pull Request Challenge: A call to the CPAN authors

The 2015 CPAN Pull Request Challenge is ramping up, and so far nearly two hundred volunteers have signed up, pledging to make one pull request for a CPAN distribution for each month of the year.

So here's a call to the all the CPAN authors: please be supportive, and if you don't like for your CPAN distributions to be part of the challenge, please send an email to neil at bowers dot com, stating your PAUSE ID and the fact that you want to be excluded.

How to be supportive? The first step is to act on pull requests. If you don't have time for a review, please say so; getting some response, even if it's "it'll be some time 'till I get around to reviewing this" is much better than none.

The volunteers have varied backgrounds; some are seasoned veterans, others are beginners who will make their first contribution to Open Source. So please be patient and encouraging.

If you have specific requirements for contributions, add a file called CONTRIBUTING or CONTRIBUTING.md to your github repositories where you state those requirements.

And of course, be civil. But that goes without saying, right? :-)

(And to those CPAN authors who haven't done it yet: put your distributions on github, so that you're not left out of the fun!

Happy New Year everybody, and have a great deal of fun!

See also: Resources for the CPAN Pull Request Challenge.

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Perlgeek.de : Rakudo's Abstract Syntax Tree

After or while a compiler parses a program, the compiler usually translates the source code into a tree format called Abstract Syntax Tree, or AST for short.

The optimizer works on this program representation, and then the code generation stage turns it into a format that the platform underneath it can understand. Actually I wanted to write about the optimizer, but noticed that understanding the AST is crucial to understanding the optimizer, so let's talk about the AST first.

The Rakudo Perl 6 Compiler uses an AST format called QAST. QAST nodes derive from the common superclass QAST::Node, which sets up the basic structure of all QAST classes. Each QAST node has a list of child nodes, possibly a hash map for unstructured annotations, an attribute (confusingly) named node for storing the lower-level parse tree (which is used to extract line numbers and context), and a bit of extra infrastructure.

The most important node classes are the following:

QAST::Stmts
A list of statements. Each child of the node is considered a separate statement.
QAST::Op
A single operation that usually maps to a primitive operation of the underlying platform, like adding two integers, or calling a routine.
QAST::IVal, QAST::NVal, QAST::SVal
Those hold integer, float ("numeric") and string constants respectively.
QAST::WVal
Holds a reference to a more complex object (for example a class) which is serialized separately.
QAST::Block
A list of statements that introduces a separate lexical scope.
QAST::Var
A variable
QAST::Want
A node that can evaluate to different child nodes, depending on the context it is compiled it.

To give you a bit of a feel of how those node types interact, I want to give a few examples of Perl 6 examples, and what AST they could produce. (It turns out that Perl 6 is quite a complex language under the hood, and usually produces a more complicated AST than the obvious one; I'll ignore that for now, in order to introduce you to the basics.)

Ops and Constants

The expression 23 + 42 could, in the simplest case, produce this AST:

QAST::Op.new(
    :op('add'),
    QAST::IVal.new(:value(23)),
    QAST::IVal.new(:value(42)),
);

Here an QAST::Op encodes a primitive operation, an addition of two numbers. The :op argument specifies which operation to use. The child nodes are two constants, both of type QAST::IVal, which hold the operands of the low-level operation add.

Now the low-level add operation is not polymorphic, it always adds two floating-point values, and the result is a floating-point value again. Since the arguments are integers and not floating point values, they are automatically converted to float first. That's not the desired semantics for Perl 6; actually the operator + is implemented as a subroutine of name &infix:<+>, so the real generated code is closer to

QAST::Op.new(
    :op('call'),
    :name('&infix:<+>'),    # name of the subroutine to call
    QAST::IVal.new(:value(23)),
    QAST::IVal.new(:value(42)),
);

Variables and Blocks

Using a variable is as simple as writing QAST::Var.new(:name('name-of-the-variable')), but it must be declared first. This is done with QAST::Var.new(:name('name-of-the-variable'), :decl('var'), :scope('lexical')).

But there is a slight caveat: in Perl 6 a variable is always scoped to a block. So while you can't ordinarily mention a variable prior to its declaration, there are indirect ways to achieve that (lookup by name, and eval(), to name just two).

So in Rakudo there is a convention to create QAST::Block nodes with two QAST::Stmts children. The first holds all the declarations, and the second all the actual code. That way all the declaration always come before the rest of the code.

So my $x = 42; say $x compiles to roughly this:

QAST::Block.new(
    QAST::Stmts.new(
        QAST::Var.new(:name('$x'), :decl('var'), :scope('lexical')),
    ),
    QAST::Stmts.new(
        QAST::Op.new(
            :op('p6store'),
            QAST::Var.new(:name('$x')),
            QAST::IVal.new(:value(42)),
        ),
        QAST::Op.new(
            :op('call'),
            :name('&say'),
            QAST::Var.new(:name('$x')),
        ),
    ),
);

Polymorphism and QAST::Want

Perl 6 distinguishes between native types and reference types. Native types are closer to the machine, and their type name is always lower case in Perl 6.

Integer literals are polymorphic in that they can be either a native int or a "boxed" reference type Int.

To model this in the AST, QAST::Want nodes can contain multiple child nodes. The compile-time context decides which of those is acutally used.

So the integer literal 42 actually produces not just a simple QAST::IVal node but rather this:

QAST::Want.new(
    QAST::WVal(Int.new(42)),
    'Ii',
    QAST::Ival(42),
)

(Note that Int.new(42) is just a nice notation to indicate a boxed integer object; it doesn't quite work like this in the code that translate Perl 6 source code into ASTs).

The first child of a QAST::Want node is the one used by default, if no other alternative matches. The comes a list where the elements with odd indexes are format specifications (here Ii for integers) and the elements at even-side indexes are the AST to use in that case.

An interesting format specification is 'v' for void context, which is always chosen when the return value from the current expression isn't used at all. In Perl 6 this is used to eagerly evaluate lazy lists that are used in void context, and for several optimizations.

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Ocean of Awareness: Removing obsolete versions of Marpa from CPAN

Marpa::XS, Marpa::PP, and Marpa::HTML are obsolete versions of Marpa, which I have been keeping on CPAN for the convenience of legacy users. All new users should look only at Marpa::R2.

I plan to delete the obsolete releases from CPAN soon. For legacy users who need copies, they will still be available on backPAN.

I do this because their placement on CPAN placement makes them "attractive nuisances" -- they show up in searches and generally make it harder to find Marpa::R2, which is the version that new users should be interested in. There is also some danger a new user could, by mistake, use the obsolete versions instead of Marpa::R2.

It's been some time since someone has reported a bug in their code, so they should be stable for legacy applications. I would usually promise to fix serious bugs that affect legacy users, but unfortunately, especially in the case of Marpa::XS, it is a promise I would have trouble keeping. Marpa::XS depends on Glib, and uses a complex build which I last performed on a machine I no longer use for development.

For this reason, a re-release to CPAN with deprecatory language is also not an option. I probably would not want to do so anyway -- the CPAN infrastructure by default pushes legacy users into upgrading, which always carries some risk. New deprecatory language would add no value for the legacy users, and they are the only audience these releases exist to serve.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Perlgeek.de : A new Perl 6 community server - call for funding

So far, many Perl 6 developers have used feather as a generic development server. Juerd, who has genereously provided this server for us for free for many years, has announced that it will be shut down at the end of the year.

My daytime job is at a b2b IT outsourcing and hosting company called noris network, and they have agreed to sponsor the hosting/housing of a 1U 19" server in one of their state-of-the-art data centers in Nürnberg, Germany.

What's missing is the actual hardware. Some folks in the community have already agreed to participate in funding the hardware, though I have few concrete pledges.

So here is the call to action: If you want to help the Perl 6 community with a one-time donation towards a new community server, please send me an e-mail to moritz at faui2k3 dot org, specifying the amount you're willing do pledge, and whether you want to stay private as a donor. I accept money transfer by paypal and wire transfer (SWIFT). Direct hardware donations are also welcome. (Though actual money will be deferred until the final decision what hardware to buy, and thus the total amount required).

How much do we need?

Decent, used 1U servers seem to start at about 250€, though 350€ would get us a lot more bang (mostly RAM and hard disk space). And in general, the more the merrier. (Cheaper offers exist, for example on ebay, but usually they are without hard disks, so the need for extra drives makes them more expensive in total).

With more money, even beefier hardware and/or spare parts and/or a maintainance contract and/new hardware would be an option.

What do we need it for?

The main tasks for the server are:

  • Hosting websites like perl6.org and the synopses
  • Hosting infrastructure like the panda metadata server
  • Be available for smoke runs of the compilers, star distributions and module ecosystem.
  • Be available as a general development machine for people who don't have linux available and/or not enough resources to build some Perl 6 compilers on their own machines comfortably.
  • A place for IRC sessions for community memebers
  • A backup location for community services like the IRC logs, the camelia IRC eval bot etc. Those resources are currently hosted elswewhere, though having another option for hosting would be very valuable.
  • A webspace for people who want to host Perl 6-related material.
  • It is explicitly not meant as a general hosting platform, nor as a mail server.

    Configuration

    If the hardware we get is beefy enough, I'd like to virtualize the server into two to three components. One for hosting the perl6.org and related websites that should be rather stable, and one for the rest of the system. If resources allow it, and depending on feedback I get, maybe a third virtual system for high-risk stuff like evalbot.

    As operating system I'll install Debian Jessie (the current testing), simply because I'll end up maintaing the system, and it's the system I'm most familiar with.

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: XML::Tiny released

Perlgeek.de : Pattern Matching and Unpacking

When talking about pattern matching in the context of Perl 6, people usually think about regex or grammars. Those are indeed very powerful tools for pattern matching, but not the only one.

Another powerful tool for pattern matching and for unpacking data structures uses signatures.

Signatures are "just" argument lists:

sub repeat(Str $s, Int $count) {
    #     ^^^^^^^^^^^^^^^^^^^^  the signature
    # $s and $count are the parameters
    return $s x $count
}

Nearly all modern programming languages have signatures, so you might say: nothing special, move along. But there are two features that make them more useful than signatures in other languages.

The first is multi dispatch, which allows you to write several routines with the name, but with different signatures. While extremely powerful and helpful, I don't want to dwell on them. Look at Chapter 6 of the "Using Perl 6" book for more details.

The second feature is sub-signatures. It allows you to write a signature for a sigle parameter.

Which sounds pretty boring at first, but for example it allows you to do declarative validation of data structures. Perl 6 has no built-in type for an array where each slot must be of a specific but different type. But you can still check for that in a sub-signature

sub f(@array [Int, Str]) {
    say @array.join: ', ';
}
f [42, 'str'];      # 42, str
f [42, 23];         # Nominal type check failed for parameter '';
                    # expected Str but got Int instead in sub-signature
                    # of parameter @array

Here we have a parameter called @array, and it is followed by a square brackets, which introduce a sub-signature for an array. When calling the function, the array is checked against the signature (Int, Str), and so if the array doesn't contain of exactly one Int and one Str in this order, a type error is thrown.

The same mechanism can be used not only for validation, but also for unpacking, which means extracting some parts of the data structure. This simply works by using variables in the inner signature:

sub head(*@ [$head, *@]) {
    $head;
}
sub tail(*@ [$, *@tail]) {
    @tail;
}
say head <a b c >;      # a
say tail <a b c >;      # b c

Here the outer parameter is anonymous (the @), though it's entirely possible to use variables for both the inner and the outer parameter.

The anonymous parameter can even be omitted, and you can write sub tail( [$, *@tail] ) directly.

Sub-signatures are not limited to arrays. For working on arbitrary objects, you surround them with parenthesis instead of brackets, and use named parameters inside:

multi key-type ($ (Numeric :$key, *%)) { "Number" }
multi key-type ($ (Str     :$key, *%)) { "String" }
for (42 => 'a', 'b' => 42) -> $pair {
    say key-type $pair;
}
# Output:
# Number
# String

This works because the => constructs a Pair, which has a key and a value attribute. The named parameter :$key in the sub-signature extracts the attribute key.

You can build quite impressive things with this feature, for example red-black tree balancing based on multi dispatch and signature unpacking. (More verbose explanation of the code.) Most use cases aren't this impressive, but still it is very useful to have occasionally. Like for this small evaluator.

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Perlgeek.de : YAPC Europe 2013 Day 3

The second day of YAPC Europe climaxed in the river boat cruise, Kiev's version of the traditional conference dinner. It was a largish boat traveling on the Dnipro river, with food, drinks and lots of Perl folks. Not having fixed tables, and having to get up to fetch food and drinks led to a lot of circulation, and thus meeting many more people than at traditionally dinners. I loved it.

Day 3 started with a video message from next year's YAPC Europe organizers, advertising for the upcoming conference and talking a bit about the oppurtunities that Sofia offers. Tempting :-).

Monitoring with Perl and Unix::Statgrab was more about the metrics that are available for monitoring, and less about doing stuff with Perl. I was a bit disappointed.

The "Future Perl Versioning" Discussion was a very civilized discussion, with solid arguments. Whether anybody changed their minds remain to be seen.

Carl Mäsak gave two great talks: one on reactive programming, and one on regular expressions. I learned quite a bit in the first one, and simply enjoyed the second one.

After the lunch (tasty again), I attended Jonathan Worthington's third talk, MoarVM: a metamodel-focused runtime for NQP and Rakudo. Again this was a great talk, based on great work done by Jonathan and others during the last 12 months or so. MoarVM is a virtual machine designed for Perl 6's needs, as we understand them now (as opposed to parrot, which was designed towards Perl 6 as it was understood around 2003 or so, which is considerably different).

How to speak manager was both amusing and offered a nice perspective on interactions between managers and programmers. Some of this advice assumed a non-tech-savy manager, and thus didn't quite apply to my current work situation, but was still interesting.

I must confess I don't remember too much of the rest of the talks that evening. I blame five days of traveling, hackathon and conference taking their toll on me.

The third session of lightning talks was again an interesting mix, containing interesting technical tidbits, the usual "we are hiring" slogans, some touching and thoughtful moments, and finally a song by Piers Cawley. He had written the lyrics in the previous 18 hours (including sleep), to (afaict) a traditional irish song. Standing up in front of ~300 people and singing a song that you haven't really had time to practise takes a huge amount of courage, and I admire Piers both for his courage and his great performance. I hope it was recorded, and makes it way to the public soon.

Finally the organizers spoke some closing words, and received their well-deserved share of applause.

As you might have guess from this and the previous blog posts, I enjoyed this year's YAPC Europe very much, and found it well worth attending, and well organized. I'd like to give my heart-felt thanks to everybody who helped to make it happen, and to my employer for sending me there.

This being only my second YAPC, I can't make any far-reaching comparisons, but compared to YAPC::EU 2010 in Pisa I had an easier time making acquaintances. I cannot tell what the big difference was, but the buffet-style dinners at the pre-conference meeting and the river boat cruise certainly helped to increase the circulation and thus the number of people I talked to.

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Perlgeek.de : A small regex optimization for NQP and Rakudo

Recently I read the course material of the Rakudo and NQP Internals Workshop, and had an idea for a small optimization for the regex engine. Yesterday night I implemented it, and I'd like to walk you through the process.

As a bit of background, the regex engine that Rakudo uses is actually implemented in NQP, and used by NQP too. The code I am about to discuss all lives in the NQP repository, but Rakudo profits from it too.

In addition one should note that the regex engine is mostly used for parsing grammar, a process which involves nearly no scanning. Scanning is the process where the regex engine first tries to match the regex at the start of the string, and if it fails there, moves to the second character in the string, tries again etc. until it succeeds.

But regexes that users write often involve scanning, and so my idea was to speed up regexes that scan, and where the first thing in the regex is a literal. In this case it makes sense to find possible start positions with a fast string search algorithm, for example the Boyer-Moore algorithm. The virtual machine backends for NQP already implement that as the index opcode, which can be invoked as start = index haystack, needle, startpos, where the string haystack is searched for the substring needle, starting from position startpos.

From reading the course material I knew I had to search for a regex type called scan, so that's what I did:

$ git grep --word scan
3rdparty/libtommath/bn_error.c:   /* scan the lookup table for the given message
3rdparty/libtommath/bn_mp_cnt_lsb.c:   /* scan lower digits until non-zero */
3rdparty/libtommath/bn_mp_cnt_lsb.c:   /* now scan this digit until a 1 is found
3rdparty/libtommath/bn_mp_prime_next_prime.c:                   /* scan upwards 
3rdparty/libtommath/changes.txt:       -- Started the Depends framework, wrote d
src/QRegex/P5Regex/Actions.nqp:                     QAST::Regex.new( :rxtype<sca
src/QRegex/P6Regex/Actions.nqp:                     QAST::Regex.new( :rxtype<sca
src/vm/jvm/QAST/Compiler.nqp:    method scan($node) {
src/vm/moar/QAST/QASTRegexCompilerMAST.nqp:    method scan($node) {
Binary file src/vm/moar/stage0/NQPP6QRegexMoar.moarvm matches
Binary file src/vm/moar/stage0/QASTMoar.moarvm matches
src/vm/parrot/QAST/Compiler.nqp:    method scan($node) {
src/vm/parrot/stage0/P6QRegex-s0.pir:    $P5025 = $P5024."new"("scan" :named("rx
src/vm/parrot/stage0/QAST-s0.pir:.sub "scan" :subid("cuid_135_1381944260.6802") 
src/vm/parrot/stage0/QAST-s0.pir:    push $P5004, "scan"

The binary files and .pir files are generated code included just for bootstrapping, and not interesting for us. The files in 3rdparty/libtommath are there for bigint handling, thus not interesting for us either. The rest are good matches: src/QRegex/P6Regex/Actions.nqp is responsible for compiling Perl 6 regexes to an abstract syntax tree (AST), and src/vm/parrot/QAST/Compiler.nqp compiles that AST down to PIR, the assembly language that the Parrot Virtual Machine understands.

So, looking at src/QRegex/P6Regex/Actions.nqp the place that mentions scan looked like this:

    $block<orig_qast> := $qast;
    $qast := QAST::Regex.new( :rxtype<concat>,
                 QAST::Regex.new( :rxtype<scan> ),
                 $qast,
                 ($anon
                      ?? QAST::Regex.new( :rxtype<pass> )
                      !! (nqp::substr(%*RX<name>, 0, 12) ne '!!LATENAME!!'
                            ?? QAST::Regex.new( :rxtype<pass>, :name(%*RX<name>) )
                            !! QAST::Regex.new( :rxtype<pass>,
                                   QAST::Var.new(
                                       :name(nqp::substr(%*RX<name>, 12)),
                                       :scope('lexical')
                                   ) 
                               )
                          )));

So to make the regex scan, the AST (in $qast) is wrapped in QAST::Regex.new(:rxtype<concat>,QAST::Regex.new( :rxtype<scan> ), $qast, ...), plus some stuff I don't care about.

To make the optimization work, the scan node needs to know what to scan for, if the first thing in the regex is indeed a constant string, aka literal. If it is, $qast is either directly of rxtype literal, or a concat node where the first child is a literal. As a patch, it looks like this:

--- a/src/QRegex/P6Regex/Actions.nqp
+++ b/src/QRegex/P6Regex/Actions.nqp
@@ -667,9 +667,21 @@ class QRegex::P6Regex::Actions is HLL::Actions {
     self.store_regex_nfa($code_obj, $block, QRegex::NFA.new.addnode($qast))
     self.alt_nfas($code_obj, $block, $qast);
 
+    my $scan := QAST::Regex.new( :rxtype<scan> );
+    {
+        my $q := $qast;
+        if $q.rxtype eq 'concat' && $q[0] {
+            $q := $q[0]
+        }
+        if $q.rxtype eq 'literal' {
+            nqp::push($scan, $q[0]);
+            $scan.subtype($q.subtype);
+        }
+    }
+
     $block<orig_qast> := $qast;
     $qast := QAST::Regex.new( :rxtype<concat>,
-                 QAST::Regex.new( :rxtype<scan> ),
+                 $scan,
                  $qast,

Since concat nodes have always been empty so far, the code generators don't look at their child nodes, and adding one with nqp::push($scan, $q[0]); won't break anything on backends that don't support this optimization yet (which after just this patch were all of them). Running make test confirmed that.

My original patch did not contain the line $scan.subtype($q.subtype);, and later on some unit tests started to fail, because regex matches can be case insensitive, but the index op works only case sensitive. For case insensitive matches, the $q.subtype of the literal regex node would be ignorecase, so that information needs to be carried on to the code generation backend.

Once that part was in place, and some debug nqp::say() statements confirmed that it indeed worked, it was time to look at the code generation. For the parrot backend, it looked like this:

    method scan($node) {
        my $ops := self.post_new('Ops', :result(%*REG<cur>));
        my $prefix := self.unique('rxscan');
        my $looplabel := self.post_new('Label', :name($prefix ~ '_loop'));
        my $scanlabel := self.post_new('Label', :name($prefix ~ '_scan'));
        my $donelabel := self.post_new('Label', :name($prefix ~ '_done'));
        $ops.push_pirop('repr_get_attr_int', '$I11', 'self', %*REG<curclass>, '"$!from"');
        $ops.push_pirop('ne', '$I11', -1, $donelabel);
        $ops.push_pirop('goto', $scanlabel);
        $ops.push($looplabel);
        $ops.push_pirop('inc', %*REG<pos>);
        $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
        $ops.push_pirop('repr_bind_attr_int', %*REG<cur>, %*REG<curclass>, '"$!from"', %*REG<pos>);
        $ops.push($scanlabel);
        self.regex_mark($ops, $looplabel, %*REG<pos>, 0);
        $ops.push($donelabel);
        $ops;
    }

While a bit intimidating at first, staring at it for a while quickly made clear what kind of code it emits. First three labels are generated, to which the code can jump with goto $label: One as a jump target for the loop that increments the cursor position ($looplabel), one for doing the regex match at that position ($scanlabel), and $donelabel for jumping to when the whole thing has finished.

Inside the loop there is an increment (inc) of the register the holds the current position (%*REG<pos>), that position is compared to the end-of-string position (%*REG<eos>), and if is larger, the cursor is marked as failed.

So the idea is to advance the position by one, and then instead of doing the regex match immediately, call the index op to find the next position where the regex might succeed:

--- a/src/vm/parrot/QAST/Compiler.nqp
+++ b/src/vm/parrot/QAST/Compiler.nqp
@@ -1564,7 +1564,13 @@ class QAST::Compiler is HLL::Compiler {
         $ops.push_pirop('goto', $scanlabel);
         $ops.push($looplabel);
         $ops.push_pirop('inc', %*REG<pos>);
-        $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
+        if nqp::elems($node.list) && $node.subtype ne 'ignorecase' {
+            $ops.push_pirop('index', %*REG<pos>, %*REG<tgt>, self.rxescape($node[0]), %*REG<pos>);
+            $ops.push_pirop('eq', %*REG<pos>, -1, %*REG<fail>);
+        }
+        else {
+            $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
+        }
         $ops.push_pirop('repr_bind_attr_int', %*REG<cur>, %*REG<curclass>, '"$!from"', %*REG<pos>);
         $ops.push($scanlabel);
         self.regex_mark($ops, $looplabel, %*REG<pos>, 0);

The index op returns -1 on failure, so the condition for a cursor fail are slightly different than before.

And as mentioned earlier, the optimization can only be safely done for matches that don't ignore case. Maybe with some additional effort that could be remedied, but it's not as simple as case-folding the target string, because some case folding operations can change the string length (for example ß becomes SS while uppercasing).

After successfully testing the patch, I came up with a small, artifical benchmark designed to show a difference in performance for this particular case. And indeed, it sped it up from 647 ± 28 µs to 161 ± 18 µs, which is roughly a factor of four.

You can see the whole thing as two commits on github.

What remains to do is implementing the same optimization on the JVM and MoarVM backends, and of course other optimizations. For example the Perl 5 regex engine keeps track of minimal and maximal string lengths for each subregex, and can anchor a regex like /a?b?longliteral/ to 0..2 characters before a match of longliteral, and generally use that meta information to fail faster.

But for now I am mostly encouraged that doing a worthwhile optimization was possible in a single evening without any black magic, or too intimate knowledge of the code generation.

Update: the code generation for MoarVM now also uses the index op. The logic is the same as for the parrot backend, the only difference is that the literal needs to be loaded into a register (whose name fresh_s returns) before index_s can use it.

Perlgeek.de : Quo Vadis Perl?

The last two days we had a gathering in town named Perl (yes, a place with that name exists). It's a lovely little town next to the borders to France and Luxembourg, and our meeting was titled "Perl Reunification Summit".

Sadly I only managed to arrive in Perl on Friday late in the night, so I missed the first day. Still it was totally worth it.

We tried to answer the question of how to make the Perl 5 and the Perl 6 community converge on a social level. While we haven't found the one true answer to that, we did find that discussing the future together, both on a technical and on a social level, already brought us closer together.

It was quite a touching moment when Merijn "Tux" Brand explained that he was skeptic of Perl 6 before the summit, and now sees it as the future.

We also concluded that copying API design is a good way to converge on a technical level. For example Perl 6's IO subsystem is in desperate need of a cohesive design. However none of the Perl 6 specification and the Rakudo development team has much experience in that area, and copying from successful Perl 5 modules is a viable approach here. Path::Class and IO::All (excluding the crazy parts) were mentioned as targets worth looking at.

There is now also an IRC channel to continue our discussions -- join #p6p5 on irc.perl.org if you are interested.

We also discussed ways to bring parallel programming to both perls. I missed most of the discussion, but did hear that one approach is to make easier to send other processes some serialized objects, and thus distribute work among several cores.

Patrick Michaud gave a short ad-hoc presentation on implicit parallelism in Perl 6. There are several constructs where the language allows parallel execution, for example for Hyper operators, junctions and feeds (think of feeds as UNIX pipes, but ones that allow passing of objects and not just strings). Rakudo doesn't implement any of them in parallel right now, because the Parrot Virtual Machine does not provide the necessary primitives yet.

Besides the "official" program, everybody used the time in meat space to discuss their favorite projects with everybody else. For example I took some time to discuss the future of doc.perl6.org with Patrick and Gabor Szabgab, and the relation to perl6maven with the latter. The Rakudo team (which was nearly completely present) also discussed several topics, and I was happy to talk about the relation between Rakudo and Parrot with Reini Urban.

Prior to the summit my expectations were quite vague. That's why it's hard for me to tell if we achieved what we and the organizers wanted. Time will tell, and we want to summarize the result in six to nine months. But I am certain that many participants have changed some of their views in positive ways, and left the summit with a warm, fuzzy feeling.

I am very grateful to have been invited to such a meeting, and enjoyed it greatly. Our host and organizers, Liz and Wendy, took care of all of our needs -- travel, food, drinks, space, wifi, accommodation, more food, entertainment, food for thought, you name it. Thank you very much!

Update: Follow the #p6p5 hash tag on twitter if you want to read more, I'm sure other participants will blog too.

Other blogs posts on this topic: PRS2012 – Perl5-Perl6 Reunification Summit by mdk and post-yapc by theorbtwo

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Perlgeek.de : iPod nano 5g on linux -- works!

For Christmas I got an iPod nano (5th generation). Since I use only Linux on my home computers, I searched the Internet for how well it is supported by Linux-based tools. The results looked bleak, but they were mostly from 2009.

Now (December 2012) on my Debian/Wheezy system, it just worked.

The iPod nano 5g presents itself as an ordinary USB storage device, which you can mount without problems. However simply copying files on it won't make the iPod show those files in the play lists, because there is some meta data stored on the device that must be updated too.

There are several user-space programs that allow you to import and export music from and to the iPod, and update those meta data files as necessary. The first one I tried, gtkpod 2.1.2, worked fine.

Other user-space programs reputed to work with the iPod are rhythmbox and amarok (which both not only organize but also play music).

Although I don't think anything really depends on some particular versions here (except that you need a new enough version of gtkpod), here is what I used:

  • Architecture: amd64
  • Linux: 3.2.0-4-amd64 #1 SMP Debian 3.2.35-2
  • Userland: Debian GNU/Linux "Wheezy" (currently "testing")
  • gtkpod: 2.1.2-1

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: Perl isn't dieing

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Perlgeek.de : The Fun of Running a Public Web Service, and Session Storage

One of my websites, Sudokugarden, recently surged in traffic, from about 30k visitors per month to more than 100k visitors per month. Here's the tale of what that meant for the server side.

As a bit of background, I built the website in 2007, when I knew a lot less about the web and programming. It runs on a host that I share with a few friends; I don't have root access on that machine, though when the admin is available, I can generally ask him to install stuff for me.

Most parts of the websites are built as static HTML files, with Server Side Includes. Parts of those SSIs are Perl CGI scripts. The most popular part though, which allows you to solve Sudoku in the browser and keeps hiscores, is written as a collection of Perl scripts, backed by a mysql database.

When at peak times the site had more than 10k visitors a day, lots of visitors would get a nasty mysql: Cannot connect: Too many open connections error. The admin wasn't available for bumping the connection limit, so I looked for other solutions.

My first action was to check the logs for spammers and crawlers that might hammered the page, and I found and banned some; but the bulk of the traffic looked completely legitimate, and the problem persisted.

Looking at the seven year old code, I realized that most pages didn't actually need a database connection, if only I could remove the session storage from the database. And, in fact, I could. I used CGI::Session, which has pluggable backend. Switching to a file-based session backend was just a matter of changing the connection string and adding a directory for session storage. Luckily the code was clean enough that this only affected a single subroutine. Everything was fine.

For a while.

Then, about a month later, the host ran out of free disk space. Since it is used for other stuff too (like email, and web hosting for other users) it took me a while to make the connection to the file-based session storage. What happened was 3 million session files on a ext3 file system with a block size of 4 kilobyte. A session is only about 400 byte, but since a file uses up a multiple of the block size, the session storage amounted to 12 gigabyte of used-up disk space, which was all that was left on that machine.

Deleting those sessions turned out to be a problem; I could only log in as my own user, which doesn't have write access to the session files (which are owned by www-data, the Apache user). The solution was to upload a CGI script that deleted the session, but of course that wasn't possible at first, because the disk was full. In the end I had to delete several gigabyte of data from my home directory before I could upload anything again. (Processes running as root were still writing to reserved-to-root portions of the file system, which is why I had to delete so much data before I was able to write again).

Even when I was able to upload the deletion script, it took quite some time to actually delete the session files; mostly because the directory was too large, and deleting files on ext3 is slow. When the files were gone, the empty session directory still used up 200MB of disk space, because the directory index doesn't shrink on file deletion.

Clearly a better solution to session storage was needed. But first I investigated where all those sessions came from, and banned a few spamming IPs. I also changed the code to only create sessions when somebody logs in, not give every visitor a session from the start.

My next attempt was to write the sessions to an SQLite database. It uses about 400 bytes per session (plus a fixed overhead for the db file itself), so it uses only a tenth of storage space that the file-based storage used. The SQLite database has no connection limit, though the old-ish version that was installed on the server doesn't seem to have very fine-grained locking either; within a few days I could errors that the session database was locked.

So I added another layer of workaround: creating a separate session database per leading IP octet. So now there are up to 255 separate session database (plus a 256th for all IPv6 addresses; a decision that will have to be revised when IPv6 usage rises). After a few days of operation, it seems that this setup works well enough. But suspicious as I am, I'll continue monitoring both disk usage and errors from Apache.

So, what happens if this solution fails to work out? I can see basically two approaches: move the site to a server that's fully under my control, and use redis or memcached for session storage; or implement sessions with signed cookies that are stored purely on the client side.

Perlgeek.de : YAPC Europe 2013 Day 2

The second day of YAPC Europe was enjoyable and informative.

I learned about ZeroMQ, which is a bit like sockets on steriods. Interesting stuff. Sadly Design decisions on p2 didn't quite qualify as interesting.

Matt's PSGI archive is a project to rewrite Matt's infamous script archive in modern Perl. Very promising, and a bit entertaining too.

Lunch was very tasty, more so than the usual mass catering. Kudos to the organizers!

After lunch, jnthn talked about concurrency, parallelism and asynchrony in Perl 6. It was a great talk, backed by great work on the compiler and runtime. Jonathans talk are always to be recommended.

I think I didn't screw up my own talk too badly, at least the timing worked fine. I just forgot to show the last slide. No real harm done.

I also enjoyed mst's State of the Velociraptor, which was a summary of what went on in the Perl world in the last year. (Much better than the YAPC::EU 2010 talk with the same title).

The Lightning talks were as enjoyable as those from the previous day. So all fine!

Next up is the river cruise, I hope to blog about that later on.

Perlgeek.de : The REPL trick

A recent discussion on IRC prompted me to share a small but neat trick with you.

If there are things you want to do quite often in the Rakudo REPL (the interactive "Read-Evaluate-Print Loop"), it makes sense to create a shortcut for them. And creating shortcuts for often-used stuff is what programming languages excel at, so you do it right in Perl module:

use v6;
module REPLHelper;

sub p(Mu \x) is export {
    x.^mro.map: *.^name;
}

I have placed mine in $HOME/.perl6/repl.

And then you make sure it's loaded automatically:

$ alias p6repl="perl6 -I$HOME/.perl6/repl/ -MREPLHelper"
$ p6repl
> p Int
Int Cool Any Mu
>

Now you have a neat one-letter function which tells you the parents of an object or a type, in method resolution order. And a way to add more shortcuts when you need them.

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Perlgeek.de : New Perl 6 community server now live, accepting signups

The new Perl 6 community server is now alive and kicking.

As planned, I've set up KVM virtualization, and so far there are two guest systems. hack.p6c.org is meant for general Perl 6 development activity (which also includes irssi/weechat sessions), and is equipped with 20GB RAM to handle multiple concurrent rakudo-jvm compilations :-). It runs a pretty bare-bones Debian Jessie.

Update: there is now a website for the new server.

www.p6c.org is the web server where I plan to host perl6.org and related (sub-)domains. It's not as beefy as hack, but sufficiently large to compile and run Rakudo, in preparation for future Perl 6-based web hosting. Currently I'm running a copy of several perl6.org subdomains on it (with the domain name p6c instead of perl6 for test purposes); the plan is to switch the perl6.org DNS over once all of the websites have been copied/migrated.

If you have a Perl 6 related use for a shell account or for serving websites, please request an account by email (moritz.lenz@gmail.com) or IRC (moritz on freenode and magnet), including:

  1. Your desired username
  2. What you want to do on the machine(s) (not necessary for #perl6 regulars)
  3. Which of the machine(s) you need access to
  4. Optionally an openssh public key
  5. Whether you'd be willing to help a bit with sysadmin tasks (mostly apt-get update && apt-get dist-upgrade, restarting hung services, killing huge processes)
  6. Software you need installed (it's OK to not know this up-front)

Note that feather.perl6.nl will shut down soon (no fixed date yet, but "end of 2014" is expected), so if you rely on feather now, you should consider migrating to the new server.

The code of conduct is pretty simple:

  1. Be reasonable in your resource usage.
  2. Use technical means to limit your resource usage so that it doesn't accidentally explode (ulimit comes to mind).
  3. Limit yourself to legal and Perl 6-related use cases (no warez).
  4. Help your fellow hackers.

The standard disclaimer applies:

  • Expect no privacy. There will potentially be many root users, who could all read your files and memory.
  • There are no promises of continued service or even support. Your account can be terminated without notice.
  • Place of jurisdiction in Nürnberg, Germany. You have to comply with German law while using the server. (Note that this puts pretty high standards on privacy for any user data you collect, including from web applications). It's your duty to inform yourself about the applicable laws. Illegal activities will be reported to the authorities.

With all that said, happy hacking!.

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Dave's Free Press: Journal: Thanks, Yahoo!

Ocean of Awareness: Parsing: Top-down versus bottom-up

Comparisons between top-down and bottom-up parsing are often either too high-level or too low-level. Overly high-level treatments reduce the two approaches to buzzwords, and the comparison to a recitation of received wisdom. Overly low-level treatments get immersed in the minutiae of implementation, and the resulting comparison is as revealing as placing two abstractly related code listings side by side. In this post I hope to find the middle level; to shed light on why advocates of bottom-up and top-down parsing approaches take the positions they do; and to speculate about the way forward.

Top-down parsing

The basic idea of top-down parsing is as brutally simple as anything in programming: Starting at the top, we add pieces. We do this by looking at the next token and deciding then and there where it fits into the parse tree. Once we've looked at every token, we have our parse tree.

In its purest form, this idea is too simple for practical parsing, so top-down parsing is almost always combined with lookahead. Lookahead of one token helps a lot. Longer lookaheads are very sparsely used. They just aren't that helpful, and since the number of possible lookaheads grows exponentially, they get very expensive very fast.

Top-down parsing has an issue with left recursion. It's straightforward to see why. Take an open-ended expression like

    a + b + c + d + e + f + [....]

Here the plus signs continue off to the right, and adding any of them to the parse tree requires a dedicated node which must be above the node for the first plus sign. We cannot put that first plus sign into a top-down parse tree without having first dealt with all those plus signs that follow it. For a top-down strategy, this is a big, big problem.

Even in the simplest expression, there is no way of counting the plus signs without looking to the right, quite possibly a very long way to the right. When we are not dealing with simple expressions, this rightward-looking needs to get sophisticated. There are ways of dealing with this difficulty, but all of them share one thing in common -- they are trying to make top-down parsing into something that it is not.

Advantages of top-down parsing

Top-down parsing does not look at the right context in any systematic way, and in the 1970's it was hard to believe that top-down was as good as we can do. (It's not all that easy to believe today.) But its extreme simplicity is also top-down parsing's great strength. Because a top-down parser is extremely simple, it is very easy to figure out what it is doing. And easy to figure out means easy to customize.

Take another of the many constructs incomprehensible to a top-down parser:

    2 * 3 * 4 + 5 * 6
    

How do top-down parsers typically handle this? Simple: as soon as they realize they are faced with an expression, they give up on top-down parsing and switch to a special-purpose algorithm.

These two properties -- easy to understand and easy to customize -- have catapulted top-down parsing to the top of the heap. Behind their different presentations, combinator parsing, PEG, and recursive descent are all top-down parsers.

Bottom-up parsing

Few theoreticians of the 1970's imagined that top-down parsing might be the end of the parsing story. Looking to the right in ad hoc ways clearly does help. It would be almost paradoxical if there was no systematic way to exploit the right context.

In 1965, Don Knuth found an algorithm to exploit right context. Knuth's LR algorithm was, like top-down parsing as I have described it, deterministic. Determinism was thought to be essential -- allowing more than one choice easily leads to a combinatorial explosion in the number of possibilities that have to be considered at once. When parsers are restricted to dealing with a single choice, it is much easier to guarantee that they will run in linear time.

Knuth's algorithm did not try to hang each token from a branch of a top-down parse tree as soon as it was encountered. Instead, Knuth suggested delaying that decision. Knuth's algorithm collected "subparses".

When I say "subparses" in this discussion, I mean pieces of the parse that contain all the decisions necessary to construct the part of the parse tree that is below them. But subparses do not contain any decisions about what is above them in the parse tree. Put another way, subparses know who they are, but not where they belong.

Subparses may not know where they belong, but knowing who they are is enough for them to be assembled into larger subparses. And, if we keep assembling the subparses, eventually we will have a "subparse" that is the full parse tree. And at that point we will know both who everyone is and where everyone belongs.

Knuth's algorithm stored subparses by shifting them onto a stack. The operation to do this was called a "shift". (Single tokens of the input are treated as subparses with a single node.) When there was enough context to build a larger subparse, the algorithm popped one or more subparses off the stack, assembled a larger subparse, and put the resulting subparse back on the stack. This operation was called a "reduce", based on the idea that its repeated application eventually "reduces" the parse tree to its root node.

In handling the stack, we will often be faced with choices. One kind of choice is between using what we already have on top of the stack to assemble a larger subparse; or pushing more subparses on top of the stack instead ("shift/reduce"). When we decide to reduce, we may encounter the other kind of choice -- we have to decide which rule to use ("reduce/reduce").

Like top-down parsing, bottom-up parsing is usually combined with lookahead. For the same lookahead, a bottom-up parser parses everything that a top-down parser can handle, and more.

Formally, Knuth's approach is now called shift/reduce parsing. I want to demonstrate why theoreticians, and for a long time almost everybody else as well, was so taken with this method. I'll describe how it works on some examples, including two very important ones that stump top-down parsers: arithmetic expressions and left-recursion. My purpose here is bring to light the basic concepts, and not to guide an implementor. There are excellent implementation-oriented presentations in many other places. The Wikipedia article, for example, is excellent.

Bottom-up parsing solved the problem of left recursion. In the example from above,

    a + b + c + d + e + f + [....]

we simply build one subparse after another, as rapidly as we can. In the terminology of shift/reduce, whenever we can reduce, we do. Eventually we will have run out of tokens, and will have reduced until there is only one element on the stack. That one remaining element is the subparse that is also, in fact, our full parse tree.

The top-down parser had a problem with left recursion precisely because it needed to build top-down. To build top-down, it needed to know about all the plus signs to come, because these needed to be fitted into the parse tree above the current plus sign. But when building bottom-up, we don't need to know anything about the plus signs that will be above the current one in the parse tree. We can afford to wait until we encounter them.

But if working bottom-up solves the left recursion problem, doesn't it create a right recursion problem? In fact, for a bottom-up parser, right recursion is harder, but not much. That's because of the stack. For a right recursion like this:

    a = b = c = d = e = f = [....]

we use a strategy opposite to the one we used for the left recursion. For left recursion, we reduced whenever we could. For right recursion, when we have a choice, we always shift. This means we will immediately shift the entire input onto the stack. Once the entire input is on the stack, we have no choice but to start reducing. Eventually we will reduce the stack to a single element. At that point, we are done. Essentially, what we are doing is exactly what we did for left recursion, except that we use the stack to reverse the order.

Arithmetic expressions like

    2 * 3 * 4 + 5 * 6

require a mixed strategy. Whenever we have a shift/reduce choice, and one of the operators is on the stack, we check to see if the topmost operator is a multiply or an addition operator. If it is a multiply operator, we reduce. In all other cases, if there is a shift/reduce choice, we shift.

In the discussion above, I have pulled the strategy for making stack decisions (shift/reduce and reduce/reduce) out of thin air. Clearly, if bottom-up parsing was going to be a practical parsing algorithm, the stack decisions would have to be made algorithmically. In fact, discovering a practical way to do this was a far from trivial task. The solution in Knuth's paper was considered (and apparently intended) to be mathematically provocative, rather than practical. But by 1979, it was thought a practical way to make stack decisions had been found and yacc, a parser generator based on bottom-up parsing, was released. (Readers today may be more familiar with yacc's successor, bison.)

The fate of bottom-up parsing

With yacc, it looked as if the limitations of top-down parsing were past us. We now had a parsing algorithm that could readily and directly parse left and right recursions, as well as arithmetic expressions. Theoreticians thought they'd found the Holy Grail.

But not all of the medieval romances had happy endings. And as I've described elsewhere, this story ended badly. Bottom-up parsing was driven by tables which made the algorithm fast for correct inputs, but unable to accurately diagnose faulty ones. The subset of grammars parsed was still not quite large enough, even for conservative language designers. And bottom-up parsing was very unfriendly to custom hacks, which made every shortcoming loom large. It is much harder to work around a problem in a bottom-up parser than than it is to deal with a similar shortcoming in a top-down parser. After decades of experience with bottom-up parsing, top-down parsing has re-emerged as the algorithm of choice.

Non-determinism

For many, the return to top-down parsing answers the question that we posed earlier: "Is there any systematic way to exploit right context when parsing?" So far, the answer seems to be a rather startling "No". Can this really be the end of the story?

It would be very strange if the best basic parsing algorithm we know is top-down. Above, I described at some length some very important grammars that can be parsed bottom-up but not top-down, at least not directly. Progress like this seems like a lot to walk away from, and especially to walk back all the way to what is essentially a brute force algorithm. This perhaps explains why lectures and textbooks persist in teaching bottom-up parsing to students who are very unlikely to use it. Because the verdict from practitioners seems to be in, and likely to hold up on appeal.

Fans of deterministic top-down parsing, and proponents of deterministic bottom-up parsing share an assumption: For a practical algorithm to be linear, it has to be deterministic. But is this actually the case?

It's not, in fact. To keep bottom-up parsing deterministic, we restricted ourselves to a stack. But what if we track all possible subpieces of parses? For efficiency, we can link them and put them into tables, making the final decisions in a second pass, once the tables are complete. (The second pass replaces the stack-driven see-sawing back and forth of the deterministic bottom-up algorithm, so it's not an inefficiency.) Jay Earley in 1968 came up with an algorithm to do this, and in 1991 Joop Leo added a memoization to Earley's algorithm which made it linear for all deterministic grammars.

The "deterministic grammars" are exactly the bottom-up parseable grammars with lookahead -- the set of grammars parsed by Knuth's algorithm. So that means the Earley/Leo algorithm parses, in linear time, everything that a deterministic bottom-up parser can parse, and therefore every grammar that a deterministic top-down parser can parse. (In fact, the Earley/Leo algorithm is linear for a lot of ambiguous grammars as well.)

Top-down parsing had the advantage that it was easy to know where you are. The Earley/Leo algorithm has an equivalent advantage -- its tables know where it is, and it is easy to query them programmatically. In 2010, this blogger modified the Earley/Leo algorithm to have the other big advantage of top-down parsing: The Marpa algorithm rearranges the Earley/Leo parse engine so that we can stop it, perform our own logic, and restart where we left off. A quite useable parser based on the Marpa algorithm is available as open source.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: cgit syntax highlighting

Perlgeek.de : First day at YAPC::Europe 2013 in Kiev

Today was the first "real" day of YAPC Europe 2013 in Kiev. In the same sense that it was the first real day, we had quite a nice "unreal" conference day yesterday, with a day-long Perl 6 hackathon, and in the evening a pre-conference meeting a Sovjet-style restaurant with tasty food and beverages.

The talks started with a few words of welcome, and then the announcement that the YAPC Europe next year will be in Sofia, Bulgaria, with the small side note that there were actually three cities competing for that honour. Congratulations to Sofia!

Larry's traditional keynote was quite emotional, and he had to fight tears a few times. Having had cancer and related surgeries in the past year, he still does his perceived duty to the Perl community, which I greatly appreciate.

Afterwards Dave Cross talked about 25 years of Perl in 25 minutes, which was a nice walk through some significant developments in the Perl world, though a bit hasty. Maybe picking fewer events and spending a bit more time on the selected few would give a smoother experience.

Another excellent talk that ran out of time was on Redis. Having experimented a wee bit with Redis in the past month, this was a real eye-opener on the wealth of features we might have used for a project at work, but in the end we didn't. Maybe we will eventually revise that decision.

Ribasushi talked about how hard benchmarking really is, and while I was (in principle) aware of that fact that it's hard to get right, there were still several significant factors that I overlooked (like the CPU's tendency to scale frequency in response to thermal and power-management considerations). I also learned that I should use Dumbbench instead of the Benchmark.pm core module. Sadly it didn't install for me (Capture::Tiny tests failing on Mac OS X).

The Perl 6 is dead, long live Perl 5 talk was much less inflammatory than the title would suggest (maybe due to Larry touching on the subject briefly during the keynote). It was mostly about how Perl 5 is used in the presenter's company, which was mildly interesting.

After tasty free lunch I attended jnthn's talk on Rakudo on the JVM, which was (as is typical for jnthn's talk) both entertaining and taught me something, even though I had followed the project quite a bit.

Thomas Klausner's Bread::Board by example made me want to refactor the OTRS internals very badly, because it is full of the anti-patterns that Bread::Board can solve in a much better way. I think that the OTRS code base is big enough to warrant the usage of Bread::Board.

I enjoyed Denis' talk on Method::Signatures, and was delighted to see that most syntax is directly copied from Perl 6 signature syntax. Talk about Perl 6 sucking creativity out of Perl 5 development.

The conference ended with a session of lighning talks, something which I always enjoy. Many lightning talks had a slightly funny tone or undertone, while still talking about interesting stuff.

Finally there was the "kick-off party", beverages and snacks sponsored by booking.com. There (and really the whole day, and yesterday too) I not only had conversations with my "old" Perl 6 friends, but also talked with many interesting people I never met before, or only met online before.

So all in all it was a nice experience, both from the social side, and from quality and contents of the talks. Venue and food are good, and the wifi too, except when it stops working for a few minutes.

I'm looking forward to two more days of conference!

(Updated: Fixed Thomas' last name)

Ocean of Awareness: What makes a parsing algorithm successful?

What makes a parsing algorithm successful? Two factors, I think. First, does the algorithm parse a workably-defined set of grammars in linear time? Second, does it allow the application to intervene in the parse with custom code? When parsing algorithms are compared, typically neither of these gets much attention. But the successful algorithms do one or the other.

Does the algorithm parse a workably-defined set of grammars in linear time?

"Workably-defined" means more than well-defined in the mathematical sense -- the definition has to be workable. That is, the definition must be something that, with reasonable effort, a programmer can use in practice.

The algorithms in regular expression engines are workably-defined. A regular expression, in the pure sense consists of a sequence of symbols, usually shown by concatenation:

a b c

or a choice among sequences, usually shown by a vertical bar:

a | b | c

or a repetition of any of the above, typically shown with a star:

a*

or any recursive combination of these. True, if this definition is new to you, it can take time to get used to. But vast numbers of working programming are very much "used to it", can think in terms of regular expressions, and can determine if a particular problem will yield to treatment as a regular expression, or not.

Parsers in the LALR family (yacc, bison, etc.) do not have a workably defined set of grammars. LALR is perfectly well-defined mathematically, but even experts in parsing theory are hard put to decide if a particular grammar is LALR.

Recursive descent also does not have a workably defined set of grammars. Recursive descent doesn't even have a precise mathematical description -- you can say that recursive descent is LL, but in practice LL tables are rarely used. Also in practice, the LL logic is extended with every other trick imaginable, up to and including switching to other parsing algorithms.

Does it allow the user to intervene in the parse?

It is not easy for users to intervene in the processing of a regular expression, though some implementations attempt to allow such efforts. LALR parsers are notoriously opaque. Those who maintain the LALR-driven Perl parser have tried to supplement its abilities with custom code, with results that will not encourage others making the same attempt.

Recursive descent, on the other hand, has no parse engine -- it is 100% custom code. You don't get much friendlier than that.

Conclusions

Regular expressions are a success, and will remain so, because the set of grammars they handle is very workably-defined. Applications using regular expressions have to take what the algorithm gives them, but what it gives them is very predictable.

For example, an application can write regular expressions on the fly, and the programmer can be confident they will run as long as they are well-formed. And it is easy to determine if the regular expression is well-formed. (Whether it actually does what you want is a separate issue.)

Recursive descent does not handle a workably-defined set of grammars, and it also has to be hand-written. But it makes up for this by allowing the user to step into the parsing process anywhere, and "get his hands dirty". Recursive descent does nothing for you, but it does allow you complete control. This is enough to make recursive descent the current algorithm of choice for major parsing projects.

As I have chronicled elsewhere, LALR was once, on highly convincing theoretical grounds, seen as the solution to the parsing problem. But while mathematically well-defined, LALR was not workably defined. And it was very hostile to applications that tried to alter, or even examine, its syntax-driven workings. After decades of trying to make it work, the profession has abandoned LALR almost totally.

What about Marpa?

Marpa has both properties: its set of grammars is workably-defined. And, while Marpa is syntax-driven like LALR and regular expressions, it also allows the user to stop the parse engine, communicate with it about the state of the parse, do her own parsing for a while, and restart the parse engine at any point she wants.

Marpa's workable definition has a nuance that the one for regular expressions does not. For regular expressions, linearity is a given -- they parse in linear time or fail. Marpa parses a much larger class of grammars, the context-free grammars -- anything that can be written in BNF. BNF is used to describe languages in standards, and is therefore itself a kind of "gold standard" for a workable definition of a set of grammars. However, Marpa does not parse everything that can be written in BNF in linear time.

Marpa linearly-parsed set of grammars is smaller than the context-free grammars, but it is still very large, and it is still workably-defined. Marpa will parse any unambiguous language in linear time, unless it contains unmarked middle recursions. An example of a "marked" middle recursion is the language described by

S ::= a S a | x

a string of which is "aaaxaaa", where the "x" marks the middle. An example of an "unmarked" middle recursion is the language described by

S ::= a S a | a

a string of which is "aaaaaaa", where nothing marks the middle, so that you don't know until the end where the middle of the recursion is. If a human can reliably find the middle by eyeball, the middle recursion is marked. If a human can't, then the middle recursion might be unmarked.

Marpa also parses a large set of unambiguous grammars linearly, and this set of grammars is also workably-defined. Marpa parses an ambiguous grammar in linear time if

  • It has no unmarked middle recursions.
  • All right recursions are unambiguous.
  • There are no cycles. A cycle occurs, for example, if there is a rule A ::= A in the grammar.
  • Marpa's level of ambiguity at any location is bounded by a constant.

The term "level of ambiguity" requires a bit of explanation. At any given location, there can be as many rules "in play" as you like, without affecting the level of ambiguity. The key question: What is the maximum number of different origins that a rule might have? (The "origin" of a rule is the location where it begins.) That is, can a rule currently in play have at most 20 different origins? Or could it have its origin at every location so far? If the maximum number of origins is 20 or any other fixed constant, the level of ambiguity is "bounded". But if the maximum number of origins keeps growing as the length of the input grows, the level of ambiguity is unbounded.

For the unambiguous case, Marpa's workable definition encompasses a much larger class of grammars, but is no more complex than that for regular expressions. If you want to extend even further, and work with ambiguous grammars, the definition remains quite workable. Of the four restrictions needed to ensure linearity, the one requiring a bounded level of ambiguity is the only one that might force you to exercise real vigliance -- once you get into ambiguity, unboundedness is easy to slip into.

As for the other three, cycles never occur in a practical grammars, and Marpa reports them, so that you simply fix them when they happen. Most recursions will be left recursions, which are unrestricted. My experience has been that, in practical grammars, unmarked middle recursions and ambiguous right recursions are not especially tempting features. If you note whenever you use a right recursion, checking that it is not ambiguous, and if you note whenever you use a middle recursion, checking that it is marked, then you will stay linear.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Ocean of Awareness: Parsing: a timeline

[ Revised 22 Oct 2014 ]

1960: The ALGOL 60 spec comes out. It specifies, for the first time, a block structured language. The ALGOL committee is well aware that nobody knows how to parse such a language. But they believe that, if they specify a block-structured language, a parser for it will be invented. Risky as this approach is, it pays off ...

1961: Ned Irons publishes his ALGOL parser. In fact, the Irons parser is the first parser of any kind to be described in print. Ned's algorithm is a left parser -- a form of recursive descent. Unlike modern recursive descent, the Irons algorithm is general and syntax-driven. "General" means it can parse anything written in BNF. "Syntax-driven" (aka declarative) means that parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Almost simultaneously, hand-coded approaches to left parsing appear. These we would now recognize as recursive descent. Over the following years, hand-coding approaches will become more popular for left parsers than syntax-driven algorithms. Three factors are at work:

  • In 1960's, memory and CPU are both extremely limited. Hand-coding pays off, even when the gains are small.
  • Pure left parsing is a very weak parsing technique. Hand-coding is often necessary to overcome its limits. This is as true today as it is in 1961.
  • Left parsing works well in combination with hand-coding -- they are a very good fit.

1965: Don Knuth invents LR parsing. Knuth is primarily interested in the mathematics. Knuth describes a parsing algorithm, but it is not thought practical.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's core idea is to track everything about the parse in tables. Earley's algorithm is enticing, but it has three major issues:

  • First, there is a bug in the handling of zero-length rules.
  • Second, it is quadratic for right recursions.
  • Third, the bookkeeping required to set up the tables is, by the standards of 1968 hardware, daunting.

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size.

1972: Aho and Ullmann describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed. Central to the toolkit is yacc, an LALR based parser generator. With a bit of hackery, yacc parses its own input language, as well as the language of V7's main compiler, the portable C compiler. After two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of Earley algorithm. But Earley parsing is almost forgotten. It will be 20 years before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: Aycock&Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

2000 to today: With the retreat from LALR comes a collapse in the prestige of parsing theory. After a half century, we seem to be back where we started. If you took Ned Iron's original 1961 algorithm, changed the names and dates, and translated the code from the mix of assembler and ALGOL into Haskell, you would easily republish it today, and bill it as as revolutionary and new.

Marpa

Over the years, I had come back to Earley's algorithm again and again. Around 2010, I realized that the original, long-abandoned vision -- an efficient, practical, general and syntax-driven parser -- was now, in fact, quite possible. The necessary pieces had fallen into place.

Aycock&Horspool have solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and probably no longer relevant in any case -- caches misses are now the bottleneck.

But while the original issues with Earley's disappeared, a new issue emerged. With a parsing algorithm as powerful as Earley's behind it, a syntax-driven approach can do much more than it can with a left parser. But with the experience with LALR in their collective consciousness, few modern programmers are prepared to trust a purely declarative parser. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove."

To be accepted, Marpa needed to allow procedure parsing, not just declarative parsing. So Marpa allows the user to specify events -- occurrences of symbols and rules -- at which declarative parsing pauses. While paused, the application can call procedural logic and single-step forward token by token. The procedural logic can hand control back over to syntax-driven parsing at any point it likes. The Earley tables can provide the procedural logic with full knowledge of the state of the parse so far: all rules recognized in all possible parses so far, and all symbols expected. Earley's algorithm is now a even better companion for hand-written procedural logic than recursive descent.

For more

For more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group.

Perlgeek.de : Correctness in Computer Programs and Mathematical Proofs

While reading On Proof and Progress in Mathematics by Fields Medal winner Bill Thurston (recently deceased I was sorry to hear), I came across this gem:

The standard of correctness and completeness necessary to get a computer program to work at all is a couple of orders of magnitude higher than the mathematical community’s standard of valid proofs. Nonetheless, large computer programs, even when they have been very carefully written and very carefully tested, always seem to have bugs.

I noticed that mathematicians are often sloppy about the scope of their symbols. Sometimes they use the same symbol for two different meanings, and you have to guess from context which on is meant.

This kind of sloppiness generally doesn't have an impact on the validity of the ideas that are communicated, as long as it's still understandable to the reader.

I guess on reason is that most mathematical publications still stick to one-letter symbol names, and there aren't that many letters in the alphabets that are generally accepted for usage (Latin, Greek, a few letters from Hebrew). And in the programming world we snort derisively at FORTRAN 77 that limited variable names to a length of 6 characters.

Ocean of Awareness: Reporting mismatched delimiters

In many contexts, programs need to identify non-overlapping pieces of a text. One very direct way to do this is to use a pair of delimiters. One delimiter of the pair marks the start and the other marks the end. Delimiters can take many forms: Quote marks, parentheses, curly braces, square brackets, XML tags, and HTML tags are all delimiters in this sense.

Mismatching delimiters is easy to do. Traditional parsers are often poor at reporting these errors: hopeless after the first mismatch, and for that matter none too precise about the first one. This post outlines a scaleable method for the accurate reporting of mismatched delimiters. I will illustrate the method with a simple but useable tool -- a utility which reports mismatched brackets.

The example script

The example script, bracket.pl, reports mismatched brackets in the set:

() {} []

They are expected to nest without overlaps. Other text is treated as filler. bracket.pl is not smart about things like strings or comments. This does have the advantage of making bracket.pl mostly language-agnostic.

Because it's intended primarily to be read as an illustration of the technique, bracket.pl's grammar is a basic one. The grammar that bracket.pl uses is so simple that an emulator of bracket.pl could be written using recursive descent. I hope the reader who goes on to look into the details will see that this technique scales to more complex situations, in a way that a solution based on a traditional parser will not.

Error reports

The description of how the method works will make more sense after we've looked at some examples of the diagnostics bracket.pl produces. To be truly useful, bracket.pl must report mismatches that span many lines, and it can do this. But single-line examples are easier to follow. All the examples in this post will be contained in a one line. Consider the string '((([))'. bracket.pl's diagnostics are:

* Line 1, column 1: Opening '(' never closed, problem detected at end of string
((([))
^
====================
* Line 1, column 4: Missing close ], problem detected at line 1, column 5
((([))
   ^^

In the next example bracket.pl realizes that it cannot accept the ')' at column 16, without first closing the set of curly braces started at column 5. It identifies the problem, along with both of the locations involved.

* Line 1, column 5: Missing close }, problem detected at line 1, column 16
[({({x[]x{}x()x)})]
    ^          ^

So far, so good. But an important advantage of bracket.pl has yet to be seen. Most compilers, once they report a first mismatched delimiter, produce error messages that are unreliable -- so unreliable that they are useless in practice. bracket.pl repairs a mismatched bracket before continuing, so that it can do a reasonable job of analyzing the text that follows. Consider the text '({]-[(}-[{)'. The output of bracket.pl is

* Line 1, column 1: Missing close ), problem detected at line 1, column 3
({]-[(}-[{)
^ ^
====================
* Line 1, column 2: Missing close }, problem detected at line 1, column 3
({]-[(}-[{)
 ^^
====================
* Line 1, column 3: Missing open [
({]-[(}-[{)
  ^
====================
* Line 1, column 5: Missing close ], problem detected at line 1, column 7
({]-[(}-[{)
    ^ ^
====================
* Line 1, column 6: Missing close ), problem detected at line 1, column 7
({]-[(}-[{)
     ^^
====================
* Line 1, column 7: Missing open {
({]-[(}-[{)
      ^
====================
* Line 1, column 9: Missing close ], problem detected at line 1, column 11
({]-[(}-[{)
        ^ ^
====================
* Line 1, column 10: Missing close }, problem detected at line 1, column 11
({]-[(}-[{)
         ^^
====================
* Line 1, column 11: Missing open (
({]-[(}-[{)
          ^

Each time, bracket.pl corrects itself, and accurately reports the next set of problems.

A difficult error report

To be 100% accurate, bracket.pl would have to guess the programmer's intent. This is, of course, not possible. Let's look at a text where bracket.pl's guesses are not so good: {{]}. Here we will assume the closing square bracket is a typo for a closing parenthesis. Here's the result:

* Line 1, column 1: Missing close }, problem detected at line 1, column 3
{{]}
^ ^
====================
* Line 1, column 2: Missing close }, problem detected at line 1, column 3
{{]}
 ^^
====================
* Line 1, column 3: Missing open [
{{]}
  ^
====================
* Line 1, column 4: Missing open {
{{]}
   ^

Instead of one error, bracket.pl finds four.

But even in this case, the method is fairly good, especially when compared with current practice. The problem is at line 1, column 3, and the first three messages all identify this as one of their potential problem locations. It is reasonable to believe that a programmer, especially once he becomes used to this kind of mismatch reporting, will quickly find the first mismatch and fix it. For this difficult case, bracket.pl may not be much better than the state of the art, but it is certainly no worse.

How it works

For full details of the workings of bracket.pl there is the code, which is heavily commented. This section provides a conceptual overview.

bracket.pl uses two features of Marpa: left-eideticism and the Ruby Slippers. By left-eidetic, I mean that Marpa knows everything there is to know about the parse at, and to left of, the current position. As a consequence, Marpa also knows exactly which of its input symbols can lead to a successful parse, and is able to stop as soon as it knows that the parse cannot succeed.

In the Ruby Slippers technique, we arrange for parsing to stop whenever we encounter an input which would cause parsing to fail. The application then asks Marpa, "OK. What input would allow the parse to continue?" The application takes Marpa's answer to this question, and uses it to concoct an input that Marpa will accept.

In this case, bracket.pl creates a virtual token which fixes the mismatch of brackets. Whatever the missing bracket may be, bracket.pl invents a bracket of that kind, and adds it to the virtual input. This done, parsing and error detection can proceed as if there was no problem. Of course, the error which made the Ruby Slippers token necessary is recorded, and those records are the source of the error reports we saw above.

To make its error messages as informative as possible in the case of missing closing brackets, bracket.pl needs to report the exact location of the opening bracket. Left-eideticism again comes in handy here. Once the virtual closing bracket is supplied to Marpa, bracket.pl asks, "That bracketed text that I just closed -- where did it begin?" The Marpa parser tracks the start location of all symbol and rule instances, so it is able to provide the application with the exact location of the starting bracket.

When bracket.pl encounters a problem at a point where there are unclosed opening brackets, it has two choices. It can be optimistic or it can be pessimistic. "Optimistic" means it can hope that something later in the input will close the opening bracket. "Pessimistic" means it can decide that "all bets are off" and use Ruby Slippers tokens to close all the currently active open brackets.

bracket.pl uses the pessimistic strategy. While the optimistic strategy sounds better, in practice the pessimistic one seems to provide better diagnostics. The pessimistic strategy does report some fixable problems as errors. But the optimistic one can introduce spurious fixes. These hide the real errors, and it is worse to miss errors than it is to overreport them. Even when the pessimistic strategy overreports, its first error message will always accurately identify the first problem location.

While bracket.pl is already useable, I think of it as a prototype. Beyond that, the problem of matching delimiters is in fact very general, and I believe these techniques may have very wide application.

For more

The example script of this post is a Github gist. For more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group.

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Subscriptions

Header image by Tambako the Jaguar. Some rights reserved.