brian d foy: Saint Perl 6 Hack Day wrap-up

Saint Perl 6 started with its hack day instead of putting it at the end. I can summarize some of the proceeding, but some of the end-of-day reports were delivered in Russian. Someone else will have to fill in the blanks.

I mentioned my CPAN Testers from GitHub idea and some people looked into it. Miyagawa has a gitpan.pl gist that fakes out CPAN.pm with CPAN::Inject. That's interesting, but many of the tools, such as CPAN::Reporter, depend on the various CPAN clients. Miyagawa's cpanminus can install from GitHub, and there's Garu's App::cpanminus::reporter, but that fragilely depends on the output of cpanminus as it does its work. It's a simple matter of programming to fix that. From there, the group moved on to other things (which someone else will have to write about).

I mostly worked alone and at a feverish pace to make the PerlPowerTools.com website. I didn't really mean to do that, but I saw a gh-pages branch in the PerlPowerTools repo. That's from GitHub Pages, a way to host websites through a GitHub account. I must have pushed a button at some time to automatically create that.

GitHub Pages can use an orphaned branch that has no history. It's like having a separate repo in your repo. Strange but true.

I can also use my own domain name by creating a file named CNAME. Wow. So, I registered PerlPowerTools.com, set that up on Cloudflare, and created a single page website. It was almost too easy.

To go a step further, I moved some things around on the website to make room for translations. I had to adjust the GitHub pages templates to find the right files from the translation directories, but I figured that out quickly too. Once I had that set up, I tweeted that I was ready for translations. About an hour later, @shvgn had forked the repo, translated the page, and sent his first ever pull request. Now there's a Russian translation in time for my talk on it tomorrow!

brian d foy: CPAN Testers from Github?

At the pre-Saint Perl 6 dinner, we were talking about ideas for the hack day. I mentioned something David Farrell and I had talked about a couple of weeks ago.

I'd love to kick off CPAN testers from Github. Instead of uploading a dev version to PAUSE, I'd commit to some special branch, or create a special tag, or something, then get the same results I get now.

I have no idea how this would work or how the CPAN Testers would find out about the new releases. Maybe there's a central point that polls Github repos and acts as the central point for CPAN Testers to poll.

How it works depends on what the testers are willing to do and what's easiest to get going.

Ideas? Comments? How could this thing work?

No time to wait: Mixins in Perl

If you want to use mixins in Perl you don't have to install anything or play with symbol table yourself. It's right there, in the core.

Hacking Thy Fearful Symmetry: Fun in POD-land

I love to tinker with tools, utilities, tweaks, anything that can be used to make grease the production chain into stupendously slick efficiency. So it stands to reason that I'd be drawn to documentation, its format, its processors, the tools to display and search it. Heck, I've even been known to actually read it, now and then.

Documentation, in the Perl sphere, means POD. It's a fairly simple markup format, but with just enough twists to make things... interesting.

As POD is as old as Perl, there is plenty of modules out there to parse it, and to generate output in pretty much all the usual formats. There are none, however, that does exactly what I want.

What do I want? I want simplicity. I want extensibility. I want trivially simple manipulations. And I think I want a peppermint tea. Don't move, I'll be back in 5 minutes.

Aaah, that's better. Where was I? Oh yes, wants. To be more pragmatic, there are two use cases I'm pursuing.

The first one is the transformation of POD documents into any other format. I did a first foray into that when I played with PDF documents and Pod::Manual. That project is still on the backburner, and these days I'm also eyeing exporting Perl distribution documentation as Dash/Zeal docsets.

The second one is POD extension/manipulation in the context of distribution building. "But there are Pod::Elemental and Pod::Weaver for that!", I hear you say. And you are right. But I have a confession to make:

Pod::Elemental and Pod::Weaver scare the everlasting bejeesus outta me.

Although fear can keeps me away only for so long. Underneath the initial gasp, I think that what disatisfy me is that each POD solution that I found on CPAN that gives me DOM manipulating powers is a special snowflake of a parser. Meaning that I have to learn about its DOM structure, its node descending rules, all that jazz. Meh. It's all hard work. Why couldn't it be simpler? Like, couldn't it be just like a jQuery-enabled page, where getting a section could be as simple as $('head1.synopsis') and moving it elsewhere be one $some_section.insert_after($some_other_section)?

If you found yourself nodding at that last paragraph, keep reading. If, on the other hand, you felt the cold finger of dread run down your spine, you might want to go and get your comfort blanket before soldiering on...

It's all too complicated. Let's use XML!

To be honest, that's not something I was expecting to say of my own volition. But, really, this is the type of stuff XML was born to do. Considering how well-understood XML/HTML is, it kinda makes sense to use it as the core format. And with that type of document, we don't have to invent a way to move around in the DOM tree -- there are already XPath and CSS selectors that are there for that. And lookee, lookee, there's Web::Query that would give us access to CSS and XPath selectors (for XPath selectors, just wait a few days for its next release to be out), and nifty jQuery-like DOM-manipulating methods.

In a nutshell, that's what I wanted to try: create a prototype of a pipeline that would slurp in some POD, allow to easily muck with it, and spit it out as whatever is desired.

The prototype, Pod::Knit, exists. It's still extremely alpha, but it's already at a point where an owner tour might be in order. Here, follow me...

POD comes in...

First thing on the agenda: read some POD and convert it to some XML document. Now, writing this part myself, considering how many POD parsers are out there, would be silly. Well, sillier than the rest of my plan, that is. So I went shopping on CPAN.

At first, I found Pod::POM. Its HTML output takes some liberties with the formatting attributes, so I wrote a quick generic XML view module.

... And only then realized the parser isn't easily extended to accept new POD elements. Dammit.

So I switched to Pod::Simple and Pod::Simple::DumpAsXML, making the POD-to-XML journey look like:

          
use Pod::Simple::DumpAsXML;

my $parser = Pod::Simple::DumpAsXML->new;

$parser->output_string( \my $xml );

$parser->parse_string_document( $source_code );

print $xml;

        

So far, so good.

Close the doors, we're altering the patient

Now the fun part comes: modifying the document.

As I want extensibility and modularity, I went for a plugin approach, where the POD document (presented as a thinly wrapped Web::Query object) would be passed through all the plugins in different stages.

And to make it real, I crafted a set of plugins that would exercise the basic manipulations I'd expect the framework to support:

          
---
plugins:
    # create the '=head1 NAME' section from the package/#ABSTRACT lines
    - Abstract
    # add an AUTHORS section
    - Authors:
        authors:
            - Yanick Champoux
    # grok '=method' elements
    - Methods
    # sort the POD elements in the given order
    - Sort:
        order:
            - NAME
            - SYNOPSIS
            - DESCRIPTION
            - METHODS
            - '*'
            - AUTHORS


        

Let's now see the different processing stages, and how those plugins implement them.

Stage 1: POD parser configuration

First, when the Pod::Simple parser is created, each plugin is given the chance to tweak it. For the moment, this is mostly to give them the opportunity to declare new POD elements. For example, the 'Methods' plugin has

          
package Pod::Knit::Plugin::Methods;

use Moose;

with 'Pod::Knit::Plugin';

sub setup_parser {
    my( $self, $parser ) = @_;

    $parser->accept_directive_as_processed( 'method' );
}

        

Stage 2: Preprocessing, aka putting those Russian Dolls together

The second stage is the "preprocessing" stage, where plugins take the raw output of Pod::Simple::DumpAsXML and groom it into the desired base structure. In most of the cases, that will be turning the raw flat list of elements given by Pod::Simple into a structure form.

For example, the raw head elements look like

          
    <para>Blah blah<para>
    <verbatimformatted>$foo->bar</verbatimformatted>
    <para>More blah</para>
    <head1>OTHER SECTION</head1>
    ...

        

but what we want is

          
        <title>DESCRIPTION</title>
        <para>Blah blah<para>
        <verbatimformatted>$foo->bar</verbatimformatted>
        <para>More blah</para>
    </head1>
    ...

        

There's an implicit plugin, HeadsToSections, that take care of that. And in our example, the plugin 'Methods' does the same thing for =method elements, slurping in the relevant following elements:

          
sub preprocess {
    my( $self, $doc ) = @_;

    $doc->find( 'method' )->each(sub{
            $_->html(
                '<title>'. $_->html . '</title>'
            );
            my $done = 0;
            my $method = $_;
            $_->find( \'./following::*' )->each(sub{
                return if $done;

                my $tagname = $_->tagname;

                return if $done = !grep { $tagname eq $_ } 
                                        qw/ para verbatimformatted /;

                $_->detach;
                $method->append($_);
            });
    });

}

        

Stage 3: Do your thing

Finally, the stage where we can expect the document to be in the proper format, and where the plugins can go wild.

Things can be inserted. Based just on configuration items:

          
package Pod::Knit::Plugin::Authors;

use Moose;

use Web::Query;

with 'Pod::Knit::Plugin';

has "authors" => (
    isa => 'ArrayRef',
    is => 'ro',
    lazy => 1,
    default => sub {
        my $self = shift;
        [];
    },
);

sub transform {
    my( $self, $doc ) = @_;

    my $section = wq( '<over-text>' );
    for ( @{ $self->authors } ) {
        $section->append(
            '<item-text>' . $_ . '</item-text>'
        );
    }

    # section() will return the existing
    # section with that title, or create
    # a new one if it doesn't exist yet
    $doc->section( 'authors' )->append(
        $section
    );
}

        

Or by looking at the source code or whatever Pod::Knit makes accessible to the plugins.

          
package Pod::Knit::Plugin::Abstract;

use Moose;

use Web::Query;

with 'Pod::Knit::Plugin';

sub transform {
    my( $self, $doc ) = @_;

    my( $package, $abstract ) =
        $self->source_code =~ /^\s*package\s+(\S+);\s*^\s*#\s*ABSTRACT:\s*(.*?)\s*$/m
            or return;

    $doc->section( 'name' )->append(
        join '',
        '<para>',
            join( ' - ', $package, $abstract ),
        '</para>'
    );
}

        

Things can also be modified.

          
package Pod::Knit::Plugin::Methods;

...

sub transform {
    my( $self, $doc ) = @_;

    my $section = $doc->section( 'methods' );

    $doc->find( 'method' )->each(sub{
        $_->detach;
        $_->tagname( 'head2' );
        $section->append($_);
    });

}


        

Or reordered.

          
package Pod::Knit::Plugin::Sort;

use Moose;

with 'Pod::Knit::Plugin';

has "order" => (
    isa => 'ArrayRef',
    is => 'ro',
    lazy => 1,
    default => sub {
        []
    },
);

sub transform {
    my( $self, $doc ) = @_;

    my $i = 1;
    my %rank = map { uc($_) => $i++ } @{ $self->order };
    $rank{'*'} ||= $i;   # not given? all last

    my %sections;
    $doc->find('head1')->each(sub{
            $_->detach;

            my $title = uc $_->find('title')->first->text =~ s/^\s+|\s+$//gr;
            $sections{$title} = $_;
    });

    for my $s ( sort { ($rank{$a}||$rank{'*'}) <=> ($rank{$b}||$rank{'*'}) } keys %sections ) {
        $doc->append( $sections{$s} );
    }
}

1;

        

Basically, anything goes.

Delicious sausages come out

With the transformed document being XML with a specific schema, we're now free to use whatever transformation engine we want. To make the prototype go full circle, I had to re-translate that XML into POD. And to do that, I resorted to good ol' insane XML::XSS:

          
package Pod::Knit::Output::POD;

use Moose::Role;

use XML::XSS;

sub as_pod {
    my $self = shift;

    my $xss = XML::XSS->new;

    $xss->set( 'document' => {
        pre => "=pod\n\n",
        post => "=cut\n\n",
    });

    $xss->set( "head$_" => {
        pre => "=head$_ ",
    }) for 1..4;

    $xss->set( 'title' => {
        pre => '',
        post => "\n\n",
    });

    $xss->set( 'verbatimformatted' => {
        pre => '',
        content => sub {
            my( $self, $node ) = @_;
            my $output = $self->render( $node->childNodes );
            $output =~ s/^/    /mgr;
        },
        post => "\n\n",
    });

    $xss->set( 'item-text' => {
        pre => "=item ",
        post => "\n\n",
    });

    $xss->set( 'over-text' => {
        pre => "=over\n\n",
    });

    $xss->set( '#text' => {
        filter => sub {
            s/^\s+|\s+$//mgr;
        }
    } );

    $xss->set( 'para' => {
        content => sub {
            my( $self, $node ) = @_;
            my $output = $self->render( $node->childNodes );
            $output =~ s/^\s+|\s+$//g;
            return $output . "\n\n";
        },
    } );

    $xss->render( $self->as_xml );
}

1;

        

Mind you, it's not nowhere near complete. But it's enough to make Pod::Knit take

          
package Foo::Bar;
# ABSTRACT: Do things

=head1 SYNOPSIS

    blah blah
    blah

=method one

Do things

    $self->one
    and   some    stuff


=method two

Do other things.

Not used often.

=head1 DESCRIPTION

Blah

=head2 subtitle

More blah

=over

=item foo

something

=item bar

=back


        

and end up with

          
=pod

=head1 name

Foo::Bar - Do things

=head1 SYNOPSIS

    blah blah
    blah


=head1 DESCRIPTION

Blah

=head2 subtitle

More blah

=over

=item foo

something

=item bar

=head1 methods

=head2 one

Do things

    $self->one
    and   some    stuff

=head2 two

Do other things.

Not used often.

=head1 authors

=over

=item Yanick Champoux

=cut

        

Laufeyjarson writes... » Perl: PBP: 067 C-Style Loops

The Best Practices cast the use of C Style For Loops out of our lexicon.  Personally, I don’t get why this is, but haven’t needed them in a while.The PBP says simply, “No, don’t do this.”  Personally, I don’t know why, because I’ve never had a problem with them; I came from C, and that’s how you write for loops.  However, I’ve seen other people who had a less tortured background struggle with them.  They are apparently not clear to some people.

When I’m in a bad mood, I’ll tell them to learn the darn language they’re using, and get over it.  Then I realize that’s the equivalent of “Get off my lawn!” and I try to reel myself back in.

One thing that is true about the C style of for loop is that it can almost always be written as a while loop instead if you actually need what it does.  While I like having the after-condition in the for instead of in the body where it could be missed (hello, infinite loop!) it isn’t critical and the same loop written as a while is clearer to a lot of people.

Perl also provides a handy other loop to do counting and array iteration, which handles a lot of the cases you’d use the for loop for, anyway.

I generally don’t use the C form any more, but I do occasionally think, “It would be perfect here!”

Perl.org NOC: rt.cpan.org SSL certificate update

Tis the season for SSL certificate renewals.  For the past five years, rt.cpan.org's SSL certificate was sponsored by IT-Kartellet.  This year, Best Practical Solutions has picked up the cert as part of our ongoing support of rt.cpan.org.

Please let us know of any problems with the update at our usual address, rt-cpan-admin at bestpractical.com.

brian d foy: The 2014 White Camel Awards

In the lands where the camel roams, the white camel is a rare and revered individual. Each year, The Perl Foundation recognizes significant non-technical achievement in Perl with the White Camel Awards. This is the 15th year we've done this, and I don't think anyone will be surprised by this year's recipients. Oh, and happy birthday Perl (from Saint Petersburg and Saint Perl 6, GMT+0300)!

Perl community - Amalia Pomian

Amalia Pomian takes care of everything when organizing the cluj.pm events: booking the place to hold the meetings, creating the schwag, taking care that the guest speakers have a great itinerary here, arranging the talks, promoting the events, keeping in touch with all the participants, and most other things.

Perl user groups - VM Brasseur

VM Brasseur now runs the San Francisco Perl mongers (now on Meetup) and has been instrumental in keeping that group running smoothly and constantly growing. She also runs the Perl Companies project to mine job adverts data to identify organizations using Perl. Coincidentally, Fred Moyer, the former organizer of the same group, received a White Camel Award last year.

Perl advocacy - Neil Bowers

Neil Bowers went on a tear this year with CPAN advocacy and participation. He's highlighted areas that need attention, advocated for different and better ways to handle CPAN, and motivated the community to take up the good fight. his blog, The good, the bad, and the beautiful, is a gold mine of CPAN advice. Curiously, it's only this year attended his first Perl mongers meeting.

Neil, a bit annoyed that we apparently pass over Mark Keating each year, awarded him a Silver Camel. We'd love to give Mark his own White Camel once he moves on from his roles inside The Perl Foundation (the people who give out the award)!

Ovid: ZipRecruiter Wants You

By now I'm sure that some of you have heard about ZipRecruiter, the job board startup that recently picked up $63 million dollars in funding and whose backend is written almost entirely in Perl using DBIx::Class, Catalyst, Template Toolkit. And they use sqitch for sane database management.

You'll probably recognize some of the names of people who work with them. Randal Schwartz has been consulting with them for a year. Mark Jason Dominus works there and has released some nifty open source software he wrote for them and while I don't claim to be as talented as Randal or Mark, I have been consulting there for a while now. It's huge amounts of fun. There are also tons of men and women who aren't as publicly involved in tech communities who are nonetheless very talented.

They're growing like mad and hiring for quite a few positions. And yes, they need Perl developers (and Python, and Javascript, and, and, and ...). And they do allow remote work.

Aside from the benefits they list in working there, here are some that I know readers of this blog will appreciate:

  • They're happy to hear new ideas
  • They love it when people write tests
  • They love to see refactoring to cleaner designs

I might add that when they picked up that $63 million in funding, they were already very profitable and are still growing like mad.Come join us and tell 'em I sent you (note: I don't get any perks for this. I just like the company and want to help Perl devs, too).

Perl Hacks: Slideshare Stats

For many years (since the end of 2007, apparently) I’ve been uploading the slides from my talks and training courses to Slideshare.

This morning I got an email from them, telling me that they had made their analytics pages freely available. I don’t know if this is a permanent change or a special offer, but the link (which will only work for logged in users) is http://www.slideshare.net/insight.

There’s a lot of information there and I look forward into digging into it in a lot more detail. But I thought it would be interesting to share the list of my top ten most popular slide decks.

Title Views
Introduction to Perl – Day 1 71722
LPW: Beginners Perl 50935
Modern Web Development with Perl 33034
Modern Perl for Non-Perl Programmers 27376
Matt’s PSGI Archive 24341
Introduction to Web Programming with Perl 22544
Introduction to Perl – Day 2 20489
Introduction to Modern Perl 17709
Introducing Modern Perl 13871
Modern Core Perl 11337

A lot of those course are aimed at people who are starting Perl from scratch. I guess it’s true that there are plenty of people out there who still want to learn Perl.

The post Slideshare Stats appeared first on Perl Hacks.

Ovid: Veure: Building the Look-and-Feel

No Perl in this post. This is mainly for the people who've asked me to keep blogging about the creation of Veure.

In trying to explain to a designer what was going on, I had to make it clear that "space stations" in Veure aren't the tiny doughnut things that Ronnie Raygun envisioned launching (those of you who are old enough might just remember the "Raygun" reference).

Instead, as far in the future as Veure is imagined, creating a new space station involves dropping a robot on a large asteroid, letting it hollow out said asteroid, spinning it, sealing it, and then building on the inside. I threw together this (poor) concept art to show her what I meant (the white bar at the top is the central lighting bar for the station).

Source image of Plzeň od Karlova courtesy Wikimedia Commons

That was important because she might help us further refine our look and feel to differentiate ourselves from our competitors. Frankly, many of the most popular games in this space seem embarrassing in how amateur they look. One, in fact, is still optimized for 800x600 displays and many of the others would have looked amateurish in the 90s. not to mention today. That's not to say that this is necessarily the most brilliant design I can come up with:

However, even that is a huge improvement over this:

That's the main character page from Hobo Wars, a game that's been running for years and lists 16 people on staff (to be fair, the company has a few other games out there and that's very interesting).

The biggest player in this space remains Torn City (sign up for Torn and message me in the game if you want to understand this market better) and they've recently updated to a much more modern look and feel. However, like many in this space, they claim to be an RPG (role-playing game) when, in fact, they're not. They're what we call an MMOBBG (massively-multiplayer online bulletin board game) where there's a great game with many things to do, but calling them RPGs is like claiming you can role play a king in chess. There's no world; it's all game. That quibble aside, Torn has done a great job of cleaning up their act and looking modern. Here's the publicly viewable portion of my character page there:

So, improving the look and feel means more work, but that will be a side-issue compared to the main work of building the universe. Currently, I'm improving the mission system to allow more flexible objectives. Wear-and-tear on spaceships is being implemented and it's possible that some interesting work in creating a self-managed galactic economy might start soon.

On the business side, we have a rough (and conservative) financial plan drawn up, but it's complicated by the fact that we don't appear to have real competitors in our space and for the business model we're following, there's not much information out there.

We also are starting work on the marketing aspects. As most experienced indy game developers will tell you, marketing is the difference between a great game and a successful one. Some aspects are hard. How do you make an intro video for a text-based game? We'll also need screen shots, press releases, a landing page, and so on. None of which we can launch prior to sorting out trademark.

Sadly, I don't realistically think the alpha will launch before the end of 2015. We could get it done much sooner, but ...

  • Contributor legal agreements are being researched
  • Getting some required trademarks is proving time-consuming
  • No one is working full-time on this

But rest assured, working we are.

As an aside, if you know of any MMOBBGs which are both popular and really offer role-playing, I would love to know about them.

As usual, tell me know what you think!

Perl Foundation News: 3 Perl Interns Accepted for the Outreach Program for Women

The winter round of the Outreach Program for Women has begun and will run from the 9th December 2014 to the 9th March 2015. There are forty-four participants in this round and three of them will be working on Perl. When we announced that we would be taking part in the program again we had funding for one intern. There is additional funding available for good candidates and thanks to the generosity of the GNOME Foundation and their sponsors we have three interns this round.

Snigdha Dagar will be working with her mentor Sawyer X on Dancer. We have two MetaCPAN interns, Rose Ames and Andreea Monica Pirvulescu, who will be working with Olaf Alders and Matt Phillips.

I would like to wish the interns every success on their work and I look forward to reading about their achievements.

Perl.org NOC: Greylisting

We have experimentally enabled greylisting on the mail servers that accept mail for cpan.org and pm.org and some other domains.

We expect that this will result in a reduction in the amount of spam that makes it through our filters.

You may notice a slight delay in mail delivery, as your MTA may have to retry to satisfy the greylisting filter.

Sawyer X: New Dancer2 release already waiting on CPAN: 0.157000

Hey everyone!

A new version of Dancer2 has been shipped yesterday and is already waiting for you at a mirror nearby!

It has come to our attention that some people don’t follow public announcements made on mailing lists so I will also be releasing a public announcement on this blog when I release a new Dancer2 version (except patch versions).

Having a weekly release (which is a new habit we’re trying to maintain) means we will usually not have big changes to report. We’re happy about that. We get to spend time working on iterative improvements that eventually end up as bigger changes.

This release carries the following interesting changes:

  • Fixes for installing on Windows.
  • Better skeleton on scaffolding.
  • No server tokens in scaffolded production environment config file.
  • LWP::UserAgent no longer required (not even for testing), same with Test::TCP.
  • Lots of improvement for migration document.

I’d like to thank all those involved in this release (by order of appearance in the Changes log): Dávid Kovács, Chi Trinh, Christian Walde, and Gabor Szabo.

Changes file follows:

BUG FIXES

  • GH #799: Set current request earlier so log formats using requests will work. (Sawyer X)
  • GH #650: Provide default environment to app for templating. (Dávid Kovács, Chi Trinh)
  • GH #800: Better portability code, for different Windows situations. (Christian Walde)
  • Less littering of the test directories with session files. (Sawyer X)

ENHANCEMENT

  • GH #810: strict && warnings in the app.pl. (Sawyer X)
  • Use to_app keyword in skeleton. (Sawyer X)
  • GH #801: Under production, server tokens are disabled. (Sawyer X)
  • GH #588, #779: Remove LWP::UserAgent in favor of HTTP::Tiny. (Dávid Kovács, simbabque, Sawyer X)
  • Remove all usages of Test::TCP in favor of Plack::Test. (Sawyer X)

DOCUMENTATION

  • GH #802: Remove indication of warnings configuration option and add explanation in migration document. (Sawyer X)
  • GH #806: Link in main docs to the migration document. (Gabor Szabo)
  • GH #807: Update migration document with more session data, changes to app.pl, and Template::Toolkit configuration. (Gabor Szabo)
  • GH #813: Update migration document with information on encoding and usage of Plack::Request internally. (Gabor Szabo, Sawyer X)

Laufeyjarson writes... » Perl: PBP: 066 Negative Control Statements

The PBP simply states, “Don’t use unless or until at all.”  I don’t agree with the strength of that statement.

I find it much clearer in most situations to write “Unless Thing” than “If Not Thing”.  Especially true in the postfix cases where I feel if is tolerable.

Unless blocks can be a little less clear, but simple expressions are fine.  If you need to use extra parens just for the “not”, then unless might be right.  If it’s already very complex, the not and a regular if will be okay too.  It depends on the complexity of the statement.

Oddly, I don’t feel the same about until.  I would rather see “while Not Thing” than “until Thing”.  I used to use a C-like do { … } until (); block, but have gotten over it.

There’s a big sidebar (see page 99) that discusses this, and the conclusions there are almost exactly what I come to.  I do think it’s worth the effort, but that it can get confusing.

Laufeyjarson writes... » Perl: PBP: 065 Other Postfix Modifiers

The PBP suggests one simple thing for using other postfix control structures, such as unless, for, while, and until.  It says: Don’t.

All the comments Mr. Conway makes about block forms being easier to read in most cases, and about avoiding $_ are strong, as far as I’m concerned.  Not being able to name the iterator means that if you’re doing something more than trivial, you wind up using $_ or the even stranger defaults.


print for grep {defined $_} @generated_lines;

This makes my teeth squeak, especially when it’s so much clearer in the block form, for the cost of two extra lines with { and } on them.  Oh wait, that’s a benefit here.

In general, avoid most of the postfix operators, except when it isn’t clearer.

brian d foy: My Perl recruitment thoughts

Dave Cross posted his Perl Recruitment Thoughts, which led to the same tired responses we see every time someone is frustrated enough to bring it up. Again. In the past decade I've written this post about every six months, decided it wasn't worth the shitstorm I'd get for posting it, and let it die. This time, I'll write just the highlights, turn off comments, and let people who care enough rant do it on their own blogs.

First, Dave does quite a bit of work to make new Perl programmers. He teaches accessible and cheap classes in London (and anywhere that will have him). I don't work in UK, so I can't speak to the particular things he sees. I teach all over the US, write the books, and occasionally step into companies to un-screw up whatever they have going on. Here's what I've learned in 20 years of doing this, but, as I said, just the highlights.


It doesn't matter what universities teach If you think a university is a trade school and that the graduates are going to show up ready to work, it's already game over. If someone coming out of a university can't pick up a new language, why would you hire them? They should already have the skills to learn new tools. Outside of Perl, they are going to have to learn all sorts of things to be useful, including...

Your architecture matters more than the language It's much harder to figure out how all the pieces fit together than it is to use some Perl in a method. Avar said as much in the reddit thread. But, you didn't design what you have and the documentation is a mess, even though you provide a wiki that no one updates. Everything accreted over years from a succession of programmers who quit when they got tired of the mess. You don't have anyone in charge of the idea, so you let anyone with a keyboard do it based on whatever fire you want to put out that week. And you have no tests, so everyone is afraid to change things. Because...

You offer no professional development I have more than a couple of very successful customers who make their own Perl programmers. They have a career ladder that takes people from almost no tech skills and turns them into programmers in a couple of years. Yeah, years. They are pro-active in professional development and there's something for new hires in the trenches to aspire to. Are you fully mining the job market or filling a position for several months until you wear out someone? Which leads too..

Your company has a bad reputation I've tried to help at more than a few places where the word about town is to avoid your company. Part of my work is always to track down people who used to work on the code to find out what advice they have. Most of the time I get warnings that aren't technical and the people who might fill the job know the same gossip. But, that doesn't matter if...

You aren't doing something interesting The really good Programmers I know don't care that much about the tools as long as they are decent tools. They want to work on interesting problems. Of all the problems out there, most aren't interesting. Of the ones that can be, bad management can make them intolerable. Some companies think they can make up for that with money. Some think they can offer equity because they have a three year exit plan that involves a buy out, so...

You don't pay enough Well, you don't for the level of pre-packaged skills you want to parachute into your mess. Even then, you kill them with the death of a thousand cuts. You start by saying you want them to work on a test suite and within a month they are fighting fires like the rest of the burnt out crew. Your daily agile standup takes 45 minutes and people sit down. You never get out of the mess that's causing the problem. But...

Pay doesn't matter if you suck I know plenty of really good Programmers who'd rather be poor than work in most environments managers let them have. I know of very few places where the programmers don't gripe about the obstacles to getting things done, and many of those gripes are social obstacles. Sometimes that means you need to fire particular people that bring down the entire team. I know many companies that bleed talent because they don't get rid of the non-performing black holes of negativity who don't document the institutional knowledge that has become their job security.

The people you want to hire don't know about your job If you're merely posting job adverts, you're only getting the people who don't have jobs. I've never hired a Perl programmer that way. I hire them away from jobs they already have when they aren't looking. This is why I (and others) invented Perl mongers. We designed Perl mongers as a business networking medium. Presentations were rare in the beginning. Drinking and socializing were the intent. Personal relationships lead to opportunities. That's not what happens now in many places. You don't know the rock stars who might turn around your company because you don't even know who they are, much less what they're interested in.

You aren't where anyone is You might be aces in everything, but if you've set up shop where nobody is and where nobody wants to go to, don't expect the same response to a job opening that you'd get in San Francisco, New York, or London.


In short, the language doesn't matter. There's much more going on in the job market and the employment opportunities that people like to blame on the language. As programmers, however, we know the ultimate excuse of the poor worker is the tools.

We don't need everyone using Perl to make it smart for businesses; we need just enough to make it easy to get things done. We have enough. Anyone wanting to use the language is going to find an engaged, interested, enthusiastic, and motivated community. They are going to find fresh releases. They will find libraries, modules, and frameworks to handle what they need. They will get their questions answered by top-shelf people. They will find answers in StackOverflow. The employers have all the tools they need to create Perl programmers if that's the language they want to use. There's not anything we can do to make it an order of magnitude easier for them.

But then, my other rant is that we've run out of Programmers. We have people who program for money, but that's not the same thing. Out of all the people in the world, only so many have the talent, skill, and motivation to design (not just type) computer programs. I think that number is very small. That you like and enjoy futzing with computers doesn't make you a Programmer any more than me reading gun magazines or firing a pistol at the range makes me a Navy SEAL.

Your real trick is to hire one real Programmer and let him handle a crew of people with moderate skills (perhaps no talent, though). But then, you'd have to actually think about organizational dynamics and how to train a tech person to be a manager then not piss them off so they leaves.


Perl Foundation News: Maintaining the Perl 5 Core: Report for Month 14

Dave Mitchell writes:

I spent nearly all my time last month developing a new tool for benchmarking perl itself, Porting/bench.pl. See

http://nntp.perl.org/group/perl.perl5.porters/222802

for the announcement.

Summary

2:00 [perl #123156] /\G^/ seems abnormally slow
0:40 [perl #123198] Memory leak in regex appears in 5.20.1]
0:43 [perl #123202] Slow global pattern match in taint mode with input from utf8
41:22 create Porting/bench.pl
11:20 process p5p mailbox

56:05 TotaL (HH::MM)

As of 2014/11/30: since the beginning of the grant:

59.0 weeks
803.8 total hours
13.6 average hours per week

There are 396 hours left on the grant

PAL-Blog: An den Weinachtsmann

Alle Kinder lernen eine goldene Regel: Wer Geschenke haben möchte, muss dem Weihnachtsmann rechtzeitig seinen Wunschzettel schicken. Alle Kinder? Naja, fast alle. Zoe möchte natürlich Geschenke, aber Wunschzettel schreiben ist ja laaaaangweilig. Heute - 10 Tage vor Weihnachten - konnte sie sich endlich dazu aufraffen. Ob das noch rechtzeitig für den Weihnachtsmann ist?

Perl Hacks: Dev Assistant

A couple of days ago, I updated to my laptop to Fedora 21. One of the new features was an application called DevAssistant which claimed that:

It does not matter if you only recently discovered the world of software development, or if you have been coding for two decades, there’s always something DevAssistant can do to make your life easier.

I thought it was worth investigating – particularly when I saw that it had support for Perl.

Starting the GUI and pressing the Perl button gives me two options: “Basic Class” and “Dancer”. I chose the “Basic Class” option. That gave me an dialogue box where I could give my new project a name. I chose “MyClass” (it’s only an example!) This created a directory called MyClass in my home directory and put two files in that directory. Here are the contents of those two files.

main.pl

#!/usr/bin/perl

#use strict;
use warnings;

use POSIX qw(strftime);

use myClass;

my $myClass = new myClass( "Holiday", "Baker Street", "Sherlock Holmes");
my $tm = strftime "%m/%d/%Y", localtime;
$myClass->enterBookedDate($tm);

print ("The hotel name is ". $myClass->getHotelName() . "\n");
print ("The hotel street is ". $myClass->getStreet() . "\n");
print ("The hotel is booked on the name ". $myClass->getGuestName() . "\n");
print ("Accomodation starts at " . $myClass->getBookedDate() . "\n");

myClass.pm

package myClass;

use strict;
use warnings;

sub new {
    my $class = shift;
    my $self = {
        _hotelName => shift,
        _street => shift,
        _name => shift,
        _date => undef
    };
    bless $self, $class;
    return $self;
}

sub enterBookedDate {
    my ($self) = shift;
    my $date = shift;
    $self->{_date} = $date;
}

sub getHotelName {
    my $self = shift;
    return $self->{_hotelName};
}

sub getStreet {
    my $self = shift;
    return $self->{_street};
}

sub getGuestName {
    my $self = shift;
    return $self->{_name};
}

sub getBookedDate {
    my $self = shift;
    return $self->{_date};
}

1;

It’s great, of course, that the project wants to support Perl. I think that we should do everything we can to help them. But it’s clear to me that they don’t have anyone on the team who knows anything about modern Perl practices.

So who wants to volunteer to help them?

Update: So it turns out that the dev team are really responsive to pull requests :-)

The post Dev Assistant appeared first on Perl Hacks.

Gabor Szabo: Perl Maven under 100,000 and above 200,000

For the full article visit Perl Maven under 100,000 and above 200,000

Perl Hacks: Perl Recruitment Thoughts

Not many weeks go by when I don’t hear of another Perl-using company that has been evaluating alternative technologies. In most cases, it’s not because they think that Perl is a bad language to use. The most common reason I hear is that it is becoming harder and harder to find good Perl programmers.

On Quora I recently saw a question asking what job opportunities were like for Perl programmers. This is how I answered:

Right now is a good time to be a Perl programmer. Perl is losing mindshare. Very few new Perl programmers are arriving on the scene and quite a lot of former Perl programmers have moved away from the language to what they see as more lucrative, enjoyable or saleable languages.

But there are still a lot of companies with a lot of Perl code. That all needs to be maintained and enhanced. And many of those companies continue to write new projects in Perl too.

All of which means that it’s a seller’s market for good Perl skills. That won’t last forever, of course. To be honest, I’d be surprised if it lasts for more than five or ten years (well, unless Perl 6 takes off quickly). But it’ll do me for the next few years at least.

I’m putting a positive spin on it, but it’s getting to be a real problem. More programmers abandon Perl, that makes it harder to find good Perl programmers, which makes it more likely that companies will abandon Perl, which leads to fewer Perl jobs and more programmers decide to abandon Perl. It’s a vicious circle.

I’m not sure how we get to the root of that problem, but do have some suggestions for on particular area. A client recently asked my for suggestions on how they can improve their hit rate for recruiting good Perl programmers. My suggestions all revolved about making your company better known in the Perl community (because that’s where many of the better Perl programmers are).

I know that many of the Perl-using companies already know this. But in the interests of levelling the playing field, I thought was worth sharing some of my suggestions.

Perl Mongers Social Meetings

Do you have a local Perl Mongers group? If so, they almost certainly have monthly social meetings. And in many cases they will welcome a company that puts a few quid behind the bar for drinks at one of those meetings. For smaller groups (and there are many smaller groups) you might even offer to buy them dinner.

It’s worth contacting them before doing this. Just turning up and flashing your money around might be seen as rude. And some groups might object to this kind of commercialisation. But it’s always worth asking.

Perl Mongers Technical Meeting

Some Perl Mongers groups have technical meetings (either instead of or as well as social meetings). In this case, instead of meeting in a pub (or bar or restaurant), they’ll meet in the offices of a friendly local company and some of the members will give presentations to the group. Many groups struggle to find venues for these kinds of meetings. Why not offer your office? And perhaps throw in some pizza and beer.

Perl Workshop

The next step up from technical meetings is Perl workshops. Many Perl Mongers groups organise annual one-day workshops. There can be many talks taking place across a number of tracks over the course of (usually) a day. The organisers often like to make these events free (mainly, it seems, because charging for stuff like this adds a whole new layer of complexity). But it’s not free to put on these events so they rely heavily on sponsors. Can you help pay for the venue? Or the printing? Or the catering? Different events will have different opportunities available. Contact the organisers.

YAPC

Workshops are national and (usually) one-day events. YAPC are international conferences that span many days. They have all the same requirements, but bigger. So they need more money. And, of course, sponsors can be at the conference telling potential employees just how wonderful it is to work for them.

The Perl Foundation

The Perl Foundation are the organisation that promotes Perl, holds various Perl trademarks and hosts many Perl web sites. They issue grants for people to work on various Perl-related projects. They never have enough money. They love companies who donate money to them as thanks for the benefit that Perl brings. How much you donate is up to you, but as a guide, most announcements seem to be in the $10,000 range.

In each of these cases, the idea is really to show the Perl community how much you value Perl by helping various Perl organisations to organise events that raise people’s awareness of Perl. Everyone wins. The sponsors get seen as good people to work for and the events themselves demonstrate that modern Perl is still a great language.

So the next time someone in your company asks how they can find good Perl people, consider a different approach. Can you embed your company in the conciousness of the Perl community and make yourselves look more attractive to some of the best Perl programmers in the world?

The post Perl Recruitment Thoughts appeared first on Perl Hacks.

NEILB: Proposed convention for todo lists on CPAN

I think it would be helpful to establish (more of) a convention for recording your todo list for a distribution with the distribution itself. Some dists already have a TODO file. I can't find any proposed conventions for this (eg in Perl Best Practices), so how about we say it's markdown, call it TODO.md, and get MetaCPAN to present it on a distribution's home page, like it does for the Changes in the most recent release?

Perl Foundation News: Grants Committee 2014 Year-End Report

Grants Committee Year-End Report is here in PDF format.

It's a bit on the formal side; I'll post a less formal one at my Grants Committee Secretary blog.

The Cattle Grid: Graphemes, code points, characters and bytes

The origins of Unicode date back to 1987, but it wasn't until the late '90s that it became well known, and general adoption really picked on after year 2000. General adoption was possible mainly thanks to UTF-8, the encoding (dating back to 1993, by the way) which provided full compatibility with US-ASCII character set. Anyway, this is an history that most of us know, and now it's clear to the most that characters do not map to bytes anymore. Here's a small Perl 5 example for this: # Perl 5 use v5.20; use Encode qw/encode/; my $snoopy = "cit\N{LATIN SMALL LETTER E WITH ACUTE}"; say '==> ' . $snoopy; say 'Characters (code points): ' . length $snoopy; say 'Bytes in UTF-8: ' . length encode('UTF-8', $snoopy); say 'Bytes in UTF-16: ' . length encode('UTF-16', $snoopy); say 'Bytes in ISO-8859-1: ' . length encode('ISO-8859-1', $snoopy); The output is as (expected): ==> cité Characters (code points): 4 Bytes in UTF-8: 5 Bytes in UTF-16: 10 Bytes in ISO-8859-1: 4 Ok, this is is well known. However, if you assume that when thinking in characters instead of bytes you are safe, well, you're wrong. Here's are two example, one in JavaScript (ECMAScript) and one in Perl 5: /* JavaScript */ var snoopy = "cit\u00E9"; var lucy = "cit\u0065\u0301"; window.document.write('Code points ' + snoopy + ': ' + snoopy.length); window.document.write('Code points ' + lucy + ': ' + lucy.length); # Perl 5 my $snoopy = "cit\N{LATIN SMALL LETTER E WITH ACUTE}"; my $lucy = "cit\N{LATIN SMALL LETTER E}\N{COMBINING ACUTE ACCENT}"; say "Code points $snoopy: " . length $snoopy; say "Code points $lucy: " . length $lucy; The output of both these scripts is: Code points cité: 4 Code points cité: 5 Ach! What happened here, with the same string (apparently of 4 chars) having the same length?!? First of all, we should ditch the concept of character, which is way too vague (not to mention in some contexts it's still a byte) and use the concepts of code point and grapheme. A code point is any "thing" that the Unicode Consortium assigned a code to, while a grapheme is the visual thing you actually see on the computer screen. Both strings in our example have 4 graphemes. However, snoopy contains an é using a latin small letter e with acute (U+00E9), while in lucy the accented e is made up using two different code points: latin small letter e (U+0065) and combining acute accent (U+0301); since the accent is combining, it joins with the letter before it in a single grapheme. Comparison is a problem as well, as the two string will not be equal one to the other when compared - and this might not be what you expect. This is a non-problem in languages such as Perl 6: # Perl 6 # There are like "length" in JavaScript and Perl 5 say $snoopy.codes; # 4 say $lucy.codes; # 5 # These actually count the graphemes say $snoopy.graphs; # 4 say $lucy.graphs; # 4 If you don't have Perl 6 on hand, you need to normalize strings, which means to bring them to the same Unicode form. In JavaScript this is possible only starting from ECMAScript 6. Even though current browsers (Firefox 34, Chrome 39 at the time of this article) do not fully support it (not surprisingly, as the standard will be finalized in 2015), Unicode normalization is (luckily) already there. Let's see some examples: /* JavaScript */ window.document.write('NFD code points ' + str1 + ': ' + str1.normalize('NFD').length); window.document.write('NFD code points ' + str2 + ': ' + str2.normalize('NFD').length); window.document.write('NFC code points ' + str1 + ': ' + str1.normalize('NFC').length); window.document.write('NFC code points ' + str2 + ': ' + str2.normalize('NFC').length); # Perl 5 use Unicode::Normalize; say "NFC code points $snoopy": ' . length NFC($snoopy); say "NFC code points $lucy:" . length NFC($lucy); say "NFD code points $snoopy:" . length NFD($snoopy); say "NFD code points $lucy"' . length NFD($lucy); The output should be: NFC code points cité: 4 NFC code points cité: 4 NFD code points cité: 5 NFD code points cité: 5 We're using a couple normalization forms here. One is NFD (canonical decomposition), where all the code points are decomposed: in this case, the é becomes always made up of 2 code points. The second one is NFC (canonical decomposition followed by canonical composition), where you get a string with all characters made of one code point (where possible: not all the combining code point sequences of the string may be representable as single code points, so even in the NFC form the number of graphemes might be different than the number of code points): in this case, the é becomes made up of one code point. In this specific case, since snoopy is fully composed and lucy is fully decomposed, you could (de)compose only one of the string. This should, however, be avoided, since you likely don't know what's in the strings you get - so always normalize both. Please note that there's much more behind normalization: you can take a look here for more information. So it's now clear enough how to know the length of a string in bytes, code points and characters. but what should be the default way of determining a string length? There's no unique answer to this: most languages return the number of code points, while others such as Perl 6 return the number of graphemes. If you have a database field which can hold up to a certain number of characters, it probably means code points so you should use those to check the length of a string. If you are determining the length of some user input, you likely want to use graphemes: an user would not understand a "please enter a maximum of 4 characters" error wen entering cité. The length in bytes is necessary when you are working with memory or disk space: of course, the length in bytes should be determined on the string encoded in the character set you plan to use. It's worth noting that an approach such as "well, I'll just write cité in my code instead of using all those ugly code points"e; is not recommended. First of all, in most time you are not the one to write but you take input from somewhere. Then, by writing this code: var str1 = "cité"; var str2 = "cité"; window.document.write(str1 + ' - ' + str1.length + ''); window.document.write(str2 + ' - ' + str2.length + ''); I've been able to get this result: Code points cité: 4 Code points cité: 5 You should be able to copy and paste the above code and get an identical result, because my browser and blog software didn't normalize it (which is scary enough, but useful in this particular case)....

Laufeyjarson writes... » Perl: PBP: 064 Postfix Selectors

Having given a concrete statement of “… always use the block form of if.” the PBP then gives you a time and place to use a postix if.  It says that’s okay if  you’re it for using a flow control statement like next, return, last, etc.

The goal of this exception is to make those control structures more visible, and keep them from being hidden deep in code.  Fine.

I see this as similar to the exception I like at the start of the function, and find it a fine use of a postfix if.

The Cattle Grid: Bretagna + Normandia + Guernsey 2012

Ed arriva anche il diario del 2012. Questa volta abbiamo scelto la costa della Francia sulla Manica: da Cognac fino a Dunkirk, con qualche digressione....

Perlgeek.de : A new Perl 6 community server - update

In my previous post I announced my plans for a new Perl 6 community server (successor to feather.perl6.nl), and now I'd like to share some updates.

Thanks to the generosity of the Perl 6 community, the server has been ordered and paid. I am now in the process of contacting those donors who haven't paid yet, leaving them the choice to re-purpose their pledge to ongoing costs (traffic, public IPv4 addresses, domain(s), SSL certs if necessary) and maintenance, or withdraw their pledges.

Some details of the hardware we'll get:

  • CPU: Intel® Xeon® Haswell-EP Series Processor E5-2620 v3, 2.40 GHz, 6-Core Socket 2011-3, 15MB Cache
  • RAM: 4x8GB DDR4 DDR4 PC2133 Reg. ECC 2R
  • HD: 2x 2TB SATA3-HD

The vendor has told me that all parts have arrived, and will be assembled today or tomorrow.

Currently I lean towards using KVM to create three virtual hosts: one for websites (*.perl6.org, perlcabal.syn), one for general hacking and IRC activity, and one for high-risk stuff (evalbots, try.rakudo.org, ...).

I've secured the domain p6c.org (for "perl 6 community"), and the IPv4 range 213.95.82.52 - 213.95.82.62 and the IPv6 net 2001:780:101:ff00::/64.

So the infrastructure is in place, now I'm waiting for the delivery of the hardware.

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Perlgeek.de : Rakudo's Abstract Syntax Tree

After or while a compiler parses a program, the compiler usually translates the source code into a tree format called Abstract Syntax Tree, or AST for short.

The optimizer works on this program representation, and then the code generation stage turns it into a format that the platform underneath it can understand. Actually I wanted to write about the optimizer, but noticed that understanding the AST is crucial to understanding the optimizer, so let's talk about the AST first.

The Rakudo Perl 6 Compiler uses an AST format called QAST. QAST nodes derive from the common superclass QAST::Node, which sets up the basic structure of all QAST classes. Each QAST node has a list of child nodes, possibly a hash map for unstructured annotations, an attribute (confusingly) named node for storing the lower-level parse tree (which is used to extract line numbers and context), and a bit of extra infrastructure.

The most important node classes are the following:

QAST::Stmts
A list of statements. Each child of the node is considered a separate statement.
QAST::Op
A single operation that usually maps to a primitive operation of the underlying platform, like adding two integers, or calling a routine.
QAST::IVal, QAST::NVal, QAST::SVal
Those hold integer, float ("numeric") and string constants respectively.
QAST::WVal
Holds a reference to a more complex object (for example a class) which is serialized separately.
QAST::Block
A list of statements that introduces a separate lexical scope.
QAST::Var
A variable
QAST::Want
A node that can evaluate to different child nodes, depending on the context it is compiled it.

To give you a bit of a feel of how those node types interact, I want to give a few examples of Perl 6 examples, and what AST they could produce. (It turns out that Perl 6 is quite a complex language under the hood, and usually produces a more complicated AST than the obvious one; I'll ignore that for now, in order to introduce you to the basics.)

Ops and Constants

The expression 23 + 42 could, in the simplest case, produce this AST:

QAST::Op.new(
    :op('add'),
    QAST::IVal.new(:value(23)),
    QAST::IVal.new(:value(42)),
);

Here an QAST::Op encodes a primitive operation, an addition of two numbers. The :op argument specifies which operation to use. The child nodes are two constants, both of type QAST::IVal, which hold the operands of the low-level operation add.

Now the low-level add operation is not polymorphic, it always adds two floating-point values, and the result is a floating-point value again. Since the arguments are integers and not floating point values, they are automatically converted to float first. That's not the desired semantics for Perl 6; actually the operator + is implemented as a subroutine of name &infix:<+>, so the real generated code is closer to

QAST::Op.new(
    :op('call'),
    :name('&infix:<+>'),    # name of the subroutine to call
    QAST::IVal.new(:value(23)),
    QAST::IVal.new(:value(42)),
);

Variables and Blocks

Using a variable is as simple as writing QAST::Var.new(:name('name-of-the-variable')), but it must be declared first. This is done with QAST::Var.new(:name('name-of-the-variable'), :decl('var'), :scope('lexical')).

But there is a slight caveat: in Perl 6 a variable is always scoped to a block. So while you can't ordinarily mention a variable prior to its declaration, there are indirect ways to achieve that (lookup by name, and eval(), to name just two).

So in Rakudo there is a convention to create QAST::Block nodes with two QAST::Stmts children. The first holds all the declarations, and the second all the actual code. That way all the declaration always come before the rest of the code.

So my $x = 42; say $x compiles to roughly this:

QAST::Block.new(
    QAST::Stmts.new(
        QAST::Var.new(:name('$x'), :decl('var'), :scope('lexical')),
    ),
    QAST::Stmts.new(
        QAST::Op.new(
            :op('p6store'),
            QAST::Var.new(:name('$x')),
            QAST::IVal.new(:value(42)),
        ),
        QAST::Op.new(
            :op('call'),
            :name('&say'),
            QAST::Var.new(:name('$x')),
        ),
    ),
);

Polymorphism and QAST::Want

Perl 6 distinguishes between native types and reference types. Native types are closer to the machine, and their type name is always lower case in Perl 6.

Integer literals are polymorphic in that they can be either a native int or a "boxed" reference type Int.

To model this in the AST, QAST::Want nodes can contain multiple child nodes. The compile-time context decides which of those is acutally used.

So the integer literal 42 actually produces not just a simple QAST::IVal node but rather this:

QAST::Want.new(
    QAST::WVal(Int.new(42)),
    'Ii',
    QAST::Ival(42),
)

(Note that Int.new(42) is just a nice notation to indicate a boxed integer object; it doesn't quite work like this in the code that translate Perl 6 source code into ASTs).

The first child of a QAST::Want node is the one used by default, if no other alternative matches. The comes a list where the elements with odd indexes are format specifications (here Ii for integers) and the elements at even-side indexes are the AST to use in that case.

An interesting format specification is 'v' for void context, which is always chosen when the return value from the current expression isn't used at all. In Perl 6 this is used to eagerly evaluate lazy lists that are used in void context, and for several optimizations.

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Ocean of Awareness: Removing obsolete versions of Marpa from CPAN

Marpa::XS, Marpa::PP, and Marpa::HTML are obsolete versions of Marpa, which I have been keeping on CPAN for the convenience of legacy users. All new users should look only at Marpa::R2.

I plan to delete the obsolete releases from CPAN soon. For legacy users who need copies, they will still be available on backPAN.

I do this because their placement on CPAN placement makes them "attractive nuisances" -- they show up in searches and generally make it harder to find Marpa::R2, which is the version that new users should be interested in. There is also some danger a new user could, by mistake, use the obsolete versions instead of Marpa::R2.

It's been some time since someone has reported a bug in their code, so they should be stable for legacy applications. I would usually promise to fix serious bugs that affect legacy users, but unfortunately, especially in the case of Marpa::XS, it is a promise I would have trouble keeping. Marpa::XS depends on Glib, and uses a complex build which I last performed on a machine I no longer use for development.

For this reason, a re-release to CPAN with deprecatory language is also not an option. I probably would not want to do so anyway -- the CPAN infrastructure by default pushes legacy users into upgrading, which always carries some risk. New deprecatory language would add no value for the legacy users, and they are the only audience these releases exist to serve.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Perlgeek.de : A new Perl 6 community server - call for funding

So far, many Perl 6 developers have used feather as a generic development server. Juerd, who has genereously provided this server for us for free for many years, has announced that it will be shut down at the end of the year.

My daytime job is at a b2b IT outsourcing and hosting company called noris network, and they have agreed to sponsor the hosting/housing of a 1U 19" server in one of their state-of-the-art data centers in Nürnberg, Germany.

What's missing is the actual hardware. Some folks in the community have already agreed to participate in funding the hardware, though I have few concrete pledges.

So here is the call to action: If you want to help the Perl 6 community with a one-time donation towards a new community server, please send me an e-mail to moritz at faui2k3 dot org, specifying the amount you're willing do pledge, and whether you want to stay private as a donor. I accept money transfer by paypal and wire transfer (SWIFT). Direct hardware donations are also welcome. (Though actual money will be deferred until the final decision what hardware to buy, and thus the total amount required).

How much do we need?

Decent, used 1U servers seem to start at about 250€, though 350€ would get us a lot more bang (mostly RAM and hard disk space). And in general, the more the merrier. (Cheaper offers exist, for example on ebay, but usually they are without hard disks, so the need for extra drives makes them more expensive in total).

With more money, even beefier hardware and/or spare parts and/or a maintainance contract and/new hardware would be an option.

What do we need it for?

The main tasks for the server are:

  • Hosting websites like perl6.org and the synopses
  • Hosting infrastructure like the panda metadata server
  • Be available for smoke runs of the compilers, star distributions and module ecosystem.
  • Be available as a general development machine for people who don't have linux available and/or not enough resources to build some Perl 6 compilers on their own machines comfortably.
  • A place for IRC sessions for community memebers
  • A backup location for community services like the IRC logs, the camelia IRC eval bot etc. Those resources are currently hosted elswewhere, though having another option for hosting would be very valuable.
  • A webspace for people who want to host Perl 6-related material.
  • It is explicitly not meant as a general hosting platform, nor as a mail server.

    Configuration

    If the hardware we get is beefy enough, I'd like to virtualize the server into two to three components. One for hosting the perl6.org and related websites that should be rather stable, and one for the rest of the system. If resources allow it, and depending on feedback I get, maybe a third virtual system for high-risk stuff like evalbot.

    As operating system I'll install Debian Jessie (the current testing), simply because I'll end up maintaing the system, and it's the system I'm most familiar with.

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: XML::Tiny released

Perlgeek.de : Pattern Matching and Unpacking

When talking about pattern matching in the context of Perl 6, people usually think about regex or grammars. Those are indeed very powerful tools for pattern matching, but not the only one.

Another powerful tool for pattern matching and for unpacking data structures uses signatures.

Signatures are "just" argument lists:

sub repeat(Str $s, Int $count) {
    #     ^^^^^^^^^^^^^^^^^^^^  the signature
    # $s and $count are the parameters
    return $s x $count
}

Nearly all modern programming languages have signatures, so you might say: nothing special, move along. But there are two features that make them more useful than signatures in other languages.

The first is multi dispatch, which allows you to write several routines with the name, but with different signatures. While extremely powerful and helpful, I don't want to dwell on them. Look at Chapter 6 of the "Using Perl 6" book for more details.

The second feature is sub-signatures. It allows you to write a signature for a sigle parameter.

Which sounds pretty boring at first, but for example it allows you to do declarative validation of data structures. Perl 6 has no built-in type for an array where each slot must be of a specific but different type. But you can still check for that in a sub-signature

sub f(@array [Int, Str]) {
    say @array.join: ', ';
}
f [42, 'str'];      # 42, str
f [42, 23];         # Nominal type check failed for parameter '';
                    # expected Str but got Int instead in sub-signature
                    # of parameter @array

Here we have a parameter called @array, and it is followed by a square brackets, which introduce a sub-signature for an array. When calling the function, the array is checked against the signature (Int, Str), and so if the array doesn't contain of exactly one Int and one Str in this order, a type error is thrown.

The same mechanism can be used not only for validation, but also for unpacking, which means extracting some parts of the data structure. This simply works by using variables in the inner signature:

sub head(*@ [$head, *@]) {
    $head;
}
sub tail(*@ [$, *@tail]) {
    @tail;
}
say head <a b c >;      # a
say tail <a b c >;      # b c

Here the outer parameter is anonymous (the @), though it's entirely possible to use variables for both the inner and the outer parameter.

The anonymous parameter can even be omitted, and you can write sub tail( [$, *@tail] ) directly.

Sub-signatures are not limited to arrays. For working on arbitrary objects, you surround them with parenthesis instead of brackets, and use named parameters inside:

multi key-type ($ (Numeric :$key, *%)) { "Number" }
multi key-type ($ (Str     :$key, *%)) { "String" }
for (42 => 'a', 'b' => 42) -> $pair {
    say key-type $pair;
}
# Output:
# Number
# String

This works because the => constructs a Pair, which has a key and a value attribute. The named parameter :$key in the sub-signature extracts the attribute key.

You can build quite impressive things with this feature, for example red-black tree balancing based on multi dispatch and signature unpacking. (More verbose explanation of the code.) Most use cases aren't this impressive, but still it is very useful to have occasionally. Like for this small evaluator.

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Perlgeek.de : YAPC Europe 2013 Day 3

The second day of YAPC Europe climaxed in the river boat cruise, Kiev's version of the traditional conference dinner. It was a largish boat traveling on the Dnipro river, with food, drinks and lots of Perl folks. Not having fixed tables, and having to get up to fetch food and drinks led to a lot of circulation, and thus meeting many more people than at traditionally dinners. I loved it.

Day 3 started with a video message from next year's YAPC Europe organizers, advertising for the upcoming conference and talking a bit about the oppurtunities that Sofia offers. Tempting :-).

Monitoring with Perl and Unix::Statgrab was more about the metrics that are available for monitoring, and less about doing stuff with Perl. I was a bit disappointed.

The "Future Perl Versioning" Discussion was a very civilized discussion, with solid arguments. Whether anybody changed their minds remain to be seen.

Carl Mäsak gave two great talks: one on reactive programming, and one on regular expressions. I learned quite a bit in the first one, and simply enjoyed the second one.

After the lunch (tasty again), I attended Jonathan Worthington's third talk, MoarVM: a metamodel-focused runtime for NQP and Rakudo. Again this was a great talk, based on great work done by Jonathan and others during the last 12 months or so. MoarVM is a virtual machine designed for Perl 6's needs, as we understand them now (as opposed to parrot, which was designed towards Perl 6 as it was understood around 2003 or so, which is considerably different).

How to speak manager was both amusing and offered a nice perspective on interactions between managers and programmers. Some of this advice assumed a non-tech-savy manager, and thus didn't quite apply to my current work situation, but was still interesting.

I must confess I don't remember too much of the rest of the talks that evening. I blame five days of traveling, hackathon and conference taking their toll on me.

The third session of lightning talks was again an interesting mix, containing interesting technical tidbits, the usual "we are hiring" slogans, some touching and thoughtful moments, and finally a song by Piers Cawley. He had written the lyrics in the previous 18 hours (including sleep), to (afaict) a traditional irish song. Standing up in front of ~300 people and singing a song that you haven't really had time to practise takes a huge amount of courage, and I admire Piers both for his courage and his great performance. I hope it was recorded, and makes it way to the public soon.

Finally the organizers spoke some closing words, and received their well-deserved share of applause.

As you might have guess from this and the previous blog posts, I enjoyed this year's YAPC Europe very much, and found it well worth attending, and well organized. I'd like to give my heart-felt thanks to everybody who helped to make it happen, and to my employer for sending me there.

This being only my second YAPC, I can't make any far-reaching comparisons, but compared to YAPC::EU 2010 in Pisa I had an easier time making acquaintances. I cannot tell what the big difference was, but the buffet-style dinners at the pre-conference meeting and the river boat cruise certainly helped to increase the circulation and thus the number of people I talked to.

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Perlgeek.de : A small regex optimization for NQP and Rakudo

Recently I read the course material of the Rakudo and NQP Internals Workshop, and had an idea for a small optimization for the regex engine. Yesterday night I implemented it, and I'd like to walk you through the process.

As a bit of background, the regex engine that Rakudo uses is actually implemented in NQP, and used by NQP too. The code I am about to discuss all lives in the NQP repository, but Rakudo profits from it too.

In addition one should note that the regex engine is mostly used for parsing grammar, a process which involves nearly no scanning. Scanning is the process where the regex engine first tries to match the regex at the start of the string, and if it fails there, moves to the second character in the string, tries again etc. until it succeeds.

But regexes that users write often involve scanning, and so my idea was to speed up regexes that scan, and where the first thing in the regex is a literal. In this case it makes sense to find possible start positions with a fast string search algorithm, for example the Boyer-Moore algorithm. The virtual machine backends for NQP already implement that as the index opcode, which can be invoked as start = index haystack, needle, startpos, where the string haystack is searched for the substring needle, starting from position startpos.

From reading the course material I knew I had to search for a regex type called scan, so that's what I did:

$ git grep --word scan
3rdparty/libtommath/bn_error.c:   /* scan the lookup table for the given message
3rdparty/libtommath/bn_mp_cnt_lsb.c:   /* scan lower digits until non-zero */
3rdparty/libtommath/bn_mp_cnt_lsb.c:   /* now scan this digit until a 1 is found
3rdparty/libtommath/bn_mp_prime_next_prime.c:                   /* scan upwards 
3rdparty/libtommath/changes.txt:       -- Started the Depends framework, wrote d
src/QRegex/P5Regex/Actions.nqp:                     QAST::Regex.new( :rxtype<sca
src/QRegex/P6Regex/Actions.nqp:                     QAST::Regex.new( :rxtype<sca
src/vm/jvm/QAST/Compiler.nqp:    method scan($node) {
src/vm/moar/QAST/QASTRegexCompilerMAST.nqp:    method scan($node) {
Binary file src/vm/moar/stage0/NQPP6QRegexMoar.moarvm matches
Binary file src/vm/moar/stage0/QASTMoar.moarvm matches
src/vm/parrot/QAST/Compiler.nqp:    method scan($node) {
src/vm/parrot/stage0/P6QRegex-s0.pir:    $P5025 = $P5024."new"("scan" :named("rx
src/vm/parrot/stage0/QAST-s0.pir:.sub "scan" :subid("cuid_135_1381944260.6802") 
src/vm/parrot/stage0/QAST-s0.pir:    push $P5004, "scan"

The binary files and .pir files are generated code included just for bootstrapping, and not interesting for us. The files in 3rdparty/libtommath are there for bigint handling, thus not interesting for us either. The rest are good matches: src/QRegex/P6Regex/Actions.nqp is responsible for compiling Perl 6 regexes to an abstract syntax tree (AST), and src/vm/parrot/QAST/Compiler.nqp compiles that AST down to PIR, the assembly language that the Parrot Virtual Machine understands.

So, looking at src/QRegex/P6Regex/Actions.nqp the place that mentions scan looked like this:

    $block<orig_qast> := $qast;
    $qast := QAST::Regex.new( :rxtype<concat>,
                 QAST::Regex.new( :rxtype<scan> ),
                 $qast,
                 ($anon
                      ?? QAST::Regex.new( :rxtype<pass> )
                      !! (nqp::substr(%*RX<name>, 0, 12) ne '!!LATENAME!!'
                            ?? QAST::Regex.new( :rxtype<pass>, :name(%*RX<name>) )
                            !! QAST::Regex.new( :rxtype<pass>,
                                   QAST::Var.new(
                                       :name(nqp::substr(%*RX<name>, 12)),
                                       :scope('lexical')
                                   ) 
                               )
                          )));

So to make the regex scan, the AST (in $qast) is wrapped in QAST::Regex.new(:rxtype<concat>,QAST::Regex.new( :rxtype<scan> ), $qast, ...), plus some stuff I don't care about.

To make the optimization work, the scan node needs to know what to scan for, if the first thing in the regex is indeed a constant string, aka literal. If it is, $qast is either directly of rxtype literal, or a concat node where the first child is a literal. As a patch, it looks like this:

--- a/src/QRegex/P6Regex/Actions.nqp
+++ b/src/QRegex/P6Regex/Actions.nqp
@@ -667,9 +667,21 @@ class QRegex::P6Regex::Actions is HLL::Actions {
     self.store_regex_nfa($code_obj, $block, QRegex::NFA.new.addnode($qast))
     self.alt_nfas($code_obj, $block, $qast);
 
+    my $scan := QAST::Regex.new( :rxtype<scan> );
+    {
+        my $q := $qast;
+        if $q.rxtype eq 'concat' && $q[0] {
+            $q := $q[0]
+        }
+        if $q.rxtype eq 'literal' {
+            nqp::push($scan, $q[0]);
+            $scan.subtype($q.subtype);
+        }
+    }
+
     $block<orig_qast> := $qast;
     $qast := QAST::Regex.new( :rxtype<concat>,
-                 QAST::Regex.new( :rxtype<scan> ),
+                 $scan,
                  $qast,

Since concat nodes have always been empty so far, the code generators don't look at their child nodes, and adding one with nqp::push($scan, $q[0]); won't break anything on backends that don't support this optimization yet (which after just this patch were all of them). Running make test confirmed that.

My original patch did not contain the line $scan.subtype($q.subtype);, and later on some unit tests started to fail, because regex matches can be case insensitive, but the index op works only case sensitive. For case insensitive matches, the $q.subtype of the literal regex node would be ignorecase, so that information needs to be carried on to the code generation backend.

Once that part was in place, and some debug nqp::say() statements confirmed that it indeed worked, it was time to look at the code generation. For the parrot backend, it looked like this:

    method scan($node) {
        my $ops := self.post_new('Ops', :result(%*REG<cur>));
        my $prefix := self.unique('rxscan');
        my $looplabel := self.post_new('Label', :name($prefix ~ '_loop'));
        my $scanlabel := self.post_new('Label', :name($prefix ~ '_scan'));
        my $donelabel := self.post_new('Label', :name($prefix ~ '_done'));
        $ops.push_pirop('repr_get_attr_int', '$I11', 'self', %*REG<curclass>, '"$!from"');
        $ops.push_pirop('ne', '$I11', -1, $donelabel);
        $ops.push_pirop('goto', $scanlabel);
        $ops.push($looplabel);
        $ops.push_pirop('inc', %*REG<pos>);
        $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
        $ops.push_pirop('repr_bind_attr_int', %*REG<cur>, %*REG<curclass>, '"$!from"', %*REG<pos>);
        $ops.push($scanlabel);
        self.regex_mark($ops, $looplabel, %*REG<pos>, 0);
        $ops.push($donelabel);
        $ops;
    }

While a bit intimidating at first, staring at it for a while quickly made clear what kind of code it emits. First three labels are generated, to which the code can jump with goto $label: One as a jump target for the loop that increments the cursor position ($looplabel), one for doing the regex match at that position ($scanlabel), and $donelabel for jumping to when the whole thing has finished.

Inside the loop there is an increment (inc) of the register the holds the current position (%*REG<pos>), that position is compared to the end-of-string position (%*REG<eos>), and if is larger, the cursor is marked as failed.

So the idea is to advance the position by one, and then instead of doing the regex match immediately, call the index op to find the next position where the regex might succeed:

--- a/src/vm/parrot/QAST/Compiler.nqp
+++ b/src/vm/parrot/QAST/Compiler.nqp
@@ -1564,7 +1564,13 @@ class QAST::Compiler is HLL::Compiler {
         $ops.push_pirop('goto', $scanlabel);
         $ops.push($looplabel);
         $ops.push_pirop('inc', %*REG<pos>);
-        $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
+        if nqp::elems($node.list) && $node.subtype ne 'ignorecase' {
+            $ops.push_pirop('index', %*REG<pos>, %*REG<tgt>, self.rxescape($node[0]), %*REG<pos>);
+            $ops.push_pirop('eq', %*REG<pos>, -1, %*REG<fail>);
+        }
+        else {
+            $ops.push_pirop('gt', %*REG<pos>, %*REG<eos>, %*REG<fail>);
+        }
         $ops.push_pirop('repr_bind_attr_int', %*REG<cur>, %*REG<curclass>, '"$!from"', %*REG<pos>);
         $ops.push($scanlabel);
         self.regex_mark($ops, $looplabel, %*REG<pos>, 0);

The index op returns -1 on failure, so the condition for a cursor fail are slightly different than before.

And as mentioned earlier, the optimization can only be safely done for matches that don't ignore case. Maybe with some additional effort that could be remedied, but it's not as simple as case-folding the target string, because some case folding operations can change the string length (for example ß becomes SS while uppercasing).

After successfully testing the patch, I came up with a small, artifical benchmark designed to show a difference in performance for this particular case. And indeed, it sped it up from 647 ± 28 µs to 161 ± 18 µs, which is roughly a factor of four.

You can see the whole thing as two commits on github.

What remains to do is implementing the same optimization on the JVM and MoarVM backends, and of course other optimizations. For example the Perl 5 regex engine keeps track of minimal and maximal string lengths for each subregex, and can anchor a regex like /a?b?longliteral/ to 0..2 characters before a match of longliteral, and generally use that meta information to fail faster.

But for now I am mostly encouraged that doing a worthwhile optimization was possible in a single evening without any black magic, or too intimate knowledge of the code generation.

Update: the code generation for MoarVM now also uses the index op. The logic is the same as for the parrot backend, the only difference is that the literal needs to be loaded into a register (whose name fresh_s returns) before index_s can use it.

Perlgeek.de : Quo Vadis Perl?

The last two days we had a gathering in town named Perl (yes, a place with that name exists). It's a lovely little town next to the borders to France and Luxembourg, and our meeting was titled "Perl Reunification Summit".

Sadly I only managed to arrive in Perl on Friday late in the night, so I missed the first day. Still it was totally worth it.

We tried to answer the question of how to make the Perl 5 and the Perl 6 community converge on a social level. While we haven't found the one true answer to that, we did find that discussing the future together, both on a technical and on a social level, already brought us closer together.

It was quite a touching moment when Merijn "Tux" Brand explained that he was skeptic of Perl 6 before the summit, and now sees it as the future.

We also concluded that copying API design is a good way to converge on a technical level. For example Perl 6's IO subsystem is in desperate need of a cohesive design. However none of the Perl 6 specification and the Rakudo development team has much experience in that area, and copying from successful Perl 5 modules is a viable approach here. Path::Class and IO::All (excluding the crazy parts) were mentioned as targets worth looking at.

There is now also an IRC channel to continue our discussions -- join #p6p5 on irc.perl.org if you are interested.

We also discussed ways to bring parallel programming to both perls. I missed most of the discussion, but did hear that one approach is to make easier to send other processes some serialized objects, and thus distribute work among several cores.

Patrick Michaud gave a short ad-hoc presentation on implicit parallelism in Perl 6. There are several constructs where the language allows parallel execution, for example for Hyper operators, junctions and feeds (think of feeds as UNIX pipes, but ones that allow passing of objects and not just strings). Rakudo doesn't implement any of them in parallel right now, because the Parrot Virtual Machine does not provide the necessary primitives yet.

Besides the "official" program, everybody used the time in meat space to discuss their favorite projects with everybody else. For example I took some time to discuss the future of doc.perl6.org with Patrick and Gabor Szabgab, and the relation to perl6maven with the latter. The Rakudo team (which was nearly completely present) also discussed several topics, and I was happy to talk about the relation between Rakudo and Parrot with Reini Urban.

Prior to the summit my expectations were quite vague. That's why it's hard for me to tell if we achieved what we and the organizers wanted. Time will tell, and we want to summarize the result in six to nine months. But I am certain that many participants have changed some of their views in positive ways, and left the summit with a warm, fuzzy feeling.

I am very grateful to have been invited to such a meeting, and enjoyed it greatly. Our host and organizers, Liz and Wendy, took care of all of our needs -- travel, food, drinks, space, wifi, accommodation, more food, entertainment, food for thought, you name it. Thank you very much!

Update: Follow the #p6p5 hash tag on twitter if you want to read more, I'm sure other participants will blog too.

Other blogs posts on this topic: PRS2012 – Perl5-Perl6 Reunification Summit by mdk and post-yapc by theorbtwo

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Perlgeek.de : iPod nano 5g on linux -- works!

For Christmas I got an iPod nano (5th generation). Since I use only Linux on my home computers, I searched the Internet for how well it is supported by Linux-based tools. The results looked bleak, but they were mostly from 2009.

Now (December 2012) on my Debian/Wheezy system, it just worked.

The iPod nano 5g presents itself as an ordinary USB storage device, which you can mount without problems. However simply copying files on it won't make the iPod show those files in the play lists, because there is some meta data stored on the device that must be updated too.

There are several user-space programs that allow you to import and export music from and to the iPod, and update those meta data files as necessary. The first one I tried, gtkpod 2.1.2, worked fine.

Other user-space programs reputed to work with the iPod are rhythmbox and amarok (which both not only organize but also play music).

Although I don't think anything really depends on some particular versions here (except that you need a new enough version of gtkpod), here is what I used:

  • Architecture: amd64
  • Linux: 3.2.0-4-amd64 #1 SMP Debian 3.2.35-2
  • Userland: Debian GNU/Linux "Wheezy" (currently "testing")
  • gtkpod: 2.1.2-1

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: Perl isn't dieing

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Perlgeek.de : The Fun of Running a Public Web Service, and Session Storage

One of my websites, Sudokugarden, recently surged in traffic, from about 30k visitors per month to more than 100k visitors per month. Here's the tale of what that meant for the server side.

As a bit of background, I built the website in 2007, when I knew a lot less about the web and programming. It runs on a host that I share with a few friends; I don't have root access on that machine, though when the admin is available, I can generally ask him to install stuff for me.

Most parts of the websites are built as static HTML files, with Server Side Includes. Parts of those SSIs are Perl CGI scripts. The most popular part though, which allows you to solve Sudoku in the browser and keeps hiscores, is written as a collection of Perl scripts, backed by a mysql database.

When at peak times the site had more than 10k visitors a day, lots of visitors would get a nasty mysql: Cannot connect: Too many open connections error. The admin wasn't available for bumping the connection limit, so I looked for other solutions.

My first action was to check the logs for spammers and crawlers that might hammered the page, and I found and banned some; but the bulk of the traffic looked completely legitimate, and the problem persisted.

Looking at the seven year old code, I realized that most pages didn't actually need a database connection, if only I could remove the session storage from the database. And, in fact, I could. I used CGI::Session, which has pluggable backend. Switching to a file-based session backend was just a matter of changing the connection string and adding a directory for session storage. Luckily the code was clean enough that this only affected a single subroutine. Everything was fine.

For a while.

Then, about a month later, the host ran out of free disk space. Since it is used for other stuff too (like email, and web hosting for other users) it took me a while to make the connection to the file-based session storage. What happened was 3 million session files on a ext3 file system with a block size of 4 kilobyte. A session is only about 400 byte, but since a file uses up a multiple of the block size, the session storage amounted to 12 gigabyte of used-up disk space, which was all that was left on that machine.

Deleting those sessions turned out to be a problem; I could only log in as my own user, which doesn't have write access to the session files (which are owned by www-data, the Apache user). The solution was to upload a CGI script that deleted the session, but of course that wasn't possible at first, because the disk was full. In the end I had to delete several gigabyte of data from my home directory before I could upload anything again. (Processes running as root were still writing to reserved-to-root portions of the file system, which is why I had to delete so much data before I was able to write again).

Even when I was able to upload the deletion script, it took quite some time to actually delete the session files; mostly because the directory was too large, and deleting files on ext3 is slow. When the files were gone, the empty session directory still used up 200MB of disk space, because the directory index doesn't shrink on file deletion.

Clearly a better solution to session storage was needed. But first I investigated where all those sessions came from, and banned a few spamming IPs. I also changed the code to only create sessions when somebody logs in, not give every visitor a session from the start.

My next attempt was to write the sessions to an SQLite database. It uses about 400 bytes per session (plus a fixed overhead for the db file itself), so it uses only a tenth of storage space that the file-based storage used. The SQLite database has no connection limit, though the old-ish version that was installed on the server doesn't seem to have very fine-grained locking either; within a few days I could errors that the session database was locked.

So I added another layer of workaround: creating a separate session database per leading IP octet. So now there are up to 255 separate session database (plus a 256th for all IPv6 addresses; a decision that will have to be revised when IPv6 usage rises). After a few days of operation, it seems that this setup works well enough. But suspicious as I am, I'll continue monitoring both disk usage and errors from Apache.

So, what happens if this solution fails to work out? I can see basically two approaches: move the site to a server that's fully under my control, and use redis or memcached for session storage; or implement sessions with signed cookies that are stored purely on the client side.

Perlgeek.de : YAPC Europe 2013 Day 2

The second day of YAPC Europe was enjoyable and informative.

I learned about ZeroMQ, which is a bit like sockets on steriods. Interesting stuff. Sadly Design decisions on p2 didn't quite qualify as interesting.

Matt's PSGI archive is a project to rewrite Matt's infamous script archive in modern Perl. Very promising, and a bit entertaining too.

Lunch was very tasty, more so than the usual mass catering. Kudos to the organizers!

After lunch, jnthn talked about concurrency, parallelism and asynchrony in Perl 6. It was a great talk, backed by great work on the compiler and runtime. Jonathans talk are always to be recommended.

I think I didn't screw up my own talk too badly, at least the timing worked fine. I just forgot to show the last slide. No real harm done.

I also enjoyed mst's State of the Velociraptor, which was a summary of what went on in the Perl world in the last year. (Much better than the YAPC::EU 2010 talk with the same title).

The Lightning talks were as enjoyable as those from the previous day. So all fine!

Next up is the river cruise, I hope to blog about that later on.

Perlgeek.de : Stop The Rewrites!

What follows is a rant. If you're not in the mood to read a rant right now, please stop and come back in an hour or two.

The Internet is full of people who know better than you how to manage your open source project, even if they only know some bits and pieces about it. News at 11.

But there is one particular instance of that advice that I hear often applied to Rakudo Perl 6: Stop the rewrites.

To be honest, I can fully understand the sentiment behind that advice. People see that it has taken us several years to get where we are now, and in their opinion, that's too long. And now we shouldn't waste our time with rewrites, but get the darn thing running already!

But Software development simply doesn't work that way. Especially not if your target is moving, as is Perl 6. (Ok, Perl 6 isn't moving that much anymore, but there are still areas we don't understand very well, so our current understanding of Perl 6 is a moving target).

At some point or another, you realize that with your current design, you can only pile workaround on top of workaround, and hope that the whole thing never collapses.

Picture of
a Jenga tower
Image courtesy of sermoa

Those people who spread the good advice to never do any major rewrites again, they never address what you should do when you face such a situation. Build the tower of workarounds even higher, and pray to Cthulhu that you can build it robust enough to support a whole stack of third-party modules?

Curiously this piece of advice occasionally comes from people who otherwise know a thing or two about software development methodology.

I should also add that since the famous "nom" switchover, which admittedly caused lots of fallout, we had three major rewrites of subsystems (longest-token matching of alternative, bounded serialization and qbootstrap), All three of which caused no new test failures, and two of which caused no fallout from the module ecosystem at all. In return, we have much faster startup (factor 3 to 4 faster) and a much more correct regex engine.

Perlgeek.de : The REPL trick

A recent discussion on IRC prompted me to share a small but neat trick with you.

If there are things you want to do quite often in the Rakudo REPL (the interactive "Read-Evaluate-Print Loop"), it makes sense to create a shortcut for them. And creating shortcuts for often-used stuff is what programming languages excel at, so you do it right in Perl module:

use v6;
module REPLHelper;

sub p(Mu \x) is export {
    x.^mro.map: *.^name;
}

I have placed mine in $HOME/.perl6/repl.

And then you make sure it's loaded automatically:

$ alias p6repl="perl6 -I$HOME/.perl6/repl/ -MREPLHelper"
$ p6repl
> p Int
Int Cool Any Mu
>

Now you have a neat one-letter function which tells you the parents of an object or a type, in method resolution order. And a way to add more shortcuts when you need them.

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Perlgeek.de : New Perl 6 community server now live, accepting signups

The new Perl 6 community server is now alive and kicking.

As planned, I've set up KVM virtualization, and so far there are two guest systems. hack.p6c.org is meant for general Perl 6 development activity (which also includes irssi/weechat sessions), and is equipped with 20GB RAM to handle multiple concurrent rakudo-jvm compilations :-). It runs a pretty bare-bones Debian Jessie.

www.p6c.org is the web server where I plan to host perl6.org and related (sub-)domains. It's not as beefy as hack, but sufficiently large to compile and run Rakudo, in preparation for future Perl 6-based web hosting. Currently I'm running a copy of several perl6.org subdomains on it (with the domain name p6c instead of perl6 for test purposes); the plan is to switch the perl6.org DNS over once all of the websites have been copied/migrated.

If you have a Perl 6 related use for a shell account or for serving websites, please request an account by email (moritz@faui2k3.org) or IRC (moritz on freenode and magnet), including:

  1. Your desired username
  2. What you want to do on the machine(s) (not necessary for #perl6 regulars)
  3. Which of the machine(s) you need access to
  4. Optionally an openssh public key
  5. Whether you'd be willing to help a bit with sysadmin tasks (mostly apt-get update && apt-get dist-upgrade, restarting hung services, killing huge processes)
  6. Software you need installed (it's OK to not know this up-front)

Note that feather.perl6.nl will shut down soon (no fixed date yet, but "end of 2014" is expected), so if you rely on feather now, you should consider migrating to the new server.

The code of conduct is pretty simple:

  1. Be reasonable in your resource usage.
  2. Use technical means to limit your resource usage so that it doesn't accidentally explode (ulimit comes to mind).
  3. Limit yourself to legal and Perl 6-related use cases (no warez).
  4. Help your fellow hackers.

The standard disclaimer applies:

  • Expect no privacy. There will potentially be many root users, who could all read your files and memory.
  • There are no promises of continued service or even support. Your account can be terminated without notice.
  • Place of jurisdiction in Nürnberg, Germany. You have to comply with German law while using the server. (Note that this puts pretty high standards on privacy for any user data you collect, including from web applications). It's your duty to inform yourself about the applicable laws. Illegal activities will be reported to the authorities.

With all that said, happy hacking!.

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Dave's Free Press: Journal: Thanks, Yahoo!

Ocean of Awareness: Parsing: Top-down versus bottom-up

Comparisons between top-down and bottom-up parsing are often either too high-level or too low-level. Overly high-level treatments reduce the two approaches to buzzwords, and the comparison to a recitation of received wisdom. Overly low-level treatments get immersed in the minutiae of implementation, and the resulting comparison is as revealing as placing two abstractly related code listings side by side. In this post I hope to find the middle level; to shed light on why advocates of bottom-up and top-down parsing approaches take the positions they do; and to speculate about the way forward.

Top-down parsing

The basic idea of top-down parsing is as brutally simple as anything in programming: Starting at the top, we add pieces. We do this by looking at the next token and deciding then and there where it fits into the parse tree. Once we've looked at every token, we have our parse tree.

In its purest form, this idea is too simple for practical parsing, so top-down parsing is almost always combined with lookahead. Lookahead of one token helps a lot. Longer lookaheads are very sparsely used. They just aren't that helpful, and since the number of possible lookaheads grows exponentially, they get very expensive very fast.

Top-down parsing has an issue with left recursion. It's straightforward to see why. Take an open-ended expression like

    a + b + c + d + e + f + [....]

Here the plus signs continue off to the right, and adding any of them to the parse tree requires a dedicated node which must be above the node for the first plus sign. We cannot put that first plus sign into a top-down parse tree without having first dealt with all those plus signs that follow it. For a top-down strategy, this is a big, big problem.

Even in the simplest expression, there is no way of counting the plus signs without looking to the right, quite possibly a very long way to the right. When we are not dealing with simple expressions, this rightward-looking needs to get sophisticated. There are ways of dealing with this difficulty, but all of them share one thing in common -- they are trying to make top-down parsing into something that it is not.

Advantages of top-down parsing

Top-down parsing does not look at the right context in any systematic way, and in the 1970's it was hard to believe that top-down was as good as we can do. (It's not all that easy to believe today.) But its extreme simplicity is also top-down parsing's great strength. Because a top-down parser is extremely simple, it is very easy to figure out what it is doing. And easy to figure out means easy to customize.

Take another of the many constructs incomprehensible to a top-down parser:

    2 * 3 * 4 + 5 * 6
    

How do top-down parsers typically handle this? Simple: as soon as they realize they are faced with an expression, they give up on top-down parsing and switch to a special-purpose algorithm.

These two properties -- easy to understand and easy to customize -- have catapulted top-down parsing to the top of the heap. Behind their different presentations, combinator parsing, PEG, and recursive descent are all top-down parsers.

Bottom-up parsing

Few theoreticians of the 1970's imagined that top-down parsing might be the end of the parsing story. Looking to the right in ad hoc ways clearly does help. It would be almost paradoxical if there was no systematic way to exploit the right context.

In 1965, Don Knuth found an algorithm to exploit right context. Knuth's LR algorithm was, like top-down parsing as I have described it, deterministic. Determinism was thought to be essential -- allowing more than one choice easily leads to a combinatorial explosion in the number of possibilities that have to be considered at once. When parsers are restricted to dealing with a single choice, it is much easier to guarantee that they will run in linear time.

Knuth's algorithm did not try to hang each token from a branch of a top-down parse tree as soon as it was encountered. Instead, Knuth suggested delaying that decision. Knuth's algorithm collected "subparses".

When I say "subparses" in this discussion, I mean pieces of the parse that contain all the decisions necessary to construct the part of the parse tree that is below them. But subparses do not contain any decisions about what is above them in the parse tree. Put another way, subparses know who they are, but not where they belong.

Subparses may not know where they belong, but knowing who they are is enough for them to be assembled into larger subparses. And, if we keep assembling the subparses, eventually we will have a "subparse" that is the full parse tree. And at that point we will know both who everyone is and where everyone belongs.

Knuth's algorithm stored subparses by shifting them onto a stack. The operation to do this was called a "shift". (Single tokens of the input are treated as subparses with a single node.) When there was enough context to build a larger subparse, the algorithm popped one or more subparses off the stack, assembled a larger subparse, and put the resulting subparse back on the stack. This operation was called a "reduce", based on the idea that its repeated application eventually "reduces" the parse tree to its root node.

In handling the stack, we will often be faced with choices. One kind of choice is between using what we already have on top of the stack to assemble a larger subparse; or pushing more subparses on top of the stack instead ("shift/reduce"). When we decide to reduce, we may encounter the other kind of choice -- we have to decide which rule to use ("reduce/reduce").

Like top-down parsing, bottom-up parsing is usually combined with lookahead. For the same lookahead, a bottom-up parser parses everything that a top-down parser can handle, and more.

Formally, Knuth's approach is now called shift/reduce parsing. I want to demonstrate why theoreticians, and for a long time almost everybody else as well, was so taken with this method. I'll describe how it works on some examples, including two very important ones that stump top-down parsers: arithmetic expressions and left-recursion. My purpose here is bring to light the basic concepts, and not to guide an implementor. There are excellent implementation-oriented presentations in many other places. The Wikipedia article, for example, is excellent.

Bottom-up parsing solved the problem of left recursion. In the example from above,

    a + b + c + d + e + f + [....]

we simply build one subparse after another, as rapidly as we can. In the terminology of shift/reduce, whenever we can reduce, we do. Eventually we will have run out of tokens, and will have reduced until there is only one element on the stack. That one remaining element is the subparse that is also, in fact, our full parse tree.

The top-down parser had a problem with left recursion precisely because it needed to build top-down. To build top-down, it needed to know about all the plus signs to come, because these needed to be fitted into the parse tree above the current plus sign. But when building bottom-up, we don't need to know anything about the plus signs that will be above the current one in the parse tree. We can afford to wait until we encounter them.

But if working bottom-up solves the left recursion problem, doesn't it create a right recursion problem? In fact, for a bottom-up parser, right recursion is harder, but not much. That's because of the stack. For a right recursion like this:

    a = b = c = d = e = f = [....]

we use a strategy opposite to the one we used for the left recursion. For left recursion, we reduced whenever we could. For right recursion, when we have a choice, we always shift. This means we will immediately shift the entire input onto the stack. Once the entire input is on the stack, we have no choice but to start reducing. Eventually we will reduce the stack to a single element. At that point, we are done. Essentially, what we are doing is exactly what we did for left recursion, except that we use the stack to reverse the order.

Arithmetic expressions like

    2 * 3 * 4 + 5 * 6

require a mixed strategy. Whenever we have a shift/reduce choice, and one of the operators is on the stack, we check to see if the topmost operator is a multiply or an addition operator. If it is a multiply operator, we reduce. In all other cases, if there is a shift/reduce choice, we shift.

In the discussion above, I have pulled the strategy for making stack decisions (shift/reduce and reduce/reduce) out of thin air. Clearly, if bottom-up parsing was going to be a practical parsing algorithm, the stack decisions would have to be made algorithmically. In fact, discovering a practical way to do this was a far from trivial task. The solution in Knuth's paper was considered (and apparently intended) to be mathematically provocative, rather than practical. But by 1979, it was thought a practical way to make stack decisions had been found and yacc, a parser generator based on bottom-up parsing, was released. (Readers today may be more familiar with yacc's successor, bison.)

The fate of bottom-up parsing

With yacc, it looked as if the limitations of top-down parsing were past us. We now had a parsing algorithm that could readily and directly parse left and right recursions, as well as arithmetic expressions. Theoreticians thought they'd found the Holy Grail.

But not all of the medieval romances had happy endings. And as I've described elsewhere, this story ended badly. Bottom-up parsing was driven by tables which made the algorithm fast for correct inputs, but unable to accurately diagnose faulty ones. The subset of grammars parsed was still not quite large enough, even for conservative language designers. And bottom-up parsing was very unfriendly to custom hacks, which made every shortcoming loom large. It is much harder to work around a problem in a bottom-up parser than than it is to deal with a similar shortcoming in a top-down parser. After decades of experience with bottom-up parsing, top-down parsing has re-emerged as the algorithm of choice.

Non-determinism

For many, the return to top-down parsing answers the question that we posed earlier: "Is there any systematic way to exploit right context when parsing?" So far, the answer seems to be a rather startling "No". Can this really be the end of the story?

It would be very strange if the best basic parsing algorithm we know is top-down. Above, I described at some length some very important grammars that can be parsed bottom-up but not top-down, at least not directly. Progress like this seems like a lot to walk away from, and especially to walk back all the way to what is essentially a brute force algorithm. This perhaps explains why lectures and textbooks persist in teaching bottom-up parsing to students who are very unlikely to use it. Because the verdict from practitioners seems to be in, and likely to hold up on appeal.

Fans of deterministic top-down parsing, and proponents of deterministic bottom-up parsing share an assumption: For a practical algorithm to be linear, it has to be deterministic. But is this actually the case?

It's not, in fact. To keep bottom-up parsing deterministic, we restricted ourselves to a stack. But what if we track all possible subpieces of parses? For efficiency, we can link them and put them into tables, making the final decisions in a second pass, once the tables are complete. (The second pass replaces the stack-driven see-sawing back and forth of the deterministic bottom-up algorithm, so it's not an inefficiency.) Jay Earley in 1968 came up with an algorithm to do this, and in 1991 Joop Leo added a memoization to Earley's algorithm which made it linear for all deterministic grammars.

The "deterministic grammars" are exactly the bottom-up parseable grammars with lookahead -- the set of grammars parsed by Knuth's algorithm. So that means the Earley/Leo algorithm parses, in linear time, everything that a deterministic bottom-up parser can parse, and therefore every grammar that a deterministic top-down parser can parse. (In fact, the Earley/Leo algorithm is linear for a lot of ambiguous grammars as well.)

Top-down parsing had the advantage that it was easy to know where you are. The Earley/Leo algorithm has an equivalent advantage -- its tables know where it is, and it is easy to query them programmatically. In 2010, this blogger modified the Earley/Leo algorithm to have the other big advantage of top-down parsing: The Marpa algorithm rearranges the Earley/Leo parse engine so that we can stop it, perform our own logic, and restart where we left off. A quite useable parser based on the Marpa algorithm is available as open source.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net. To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: cgit syntax highlighting

Perlgeek.de : First day at YAPC::Europe 2013 in Kiev

Today was the first "real" day of YAPC Europe 2013 in Kiev. In the same sense that it was the first real day, we had quite a nice "unreal" conference day yesterday, with a day-long Perl 6 hackathon, and in the evening a pre-conference meeting a Sovjet-style restaurant with tasty food and beverages.

The talks started with a few words of welcome, and then the announcement that the YAPC Europe next year will be in Sofia, Bulgaria, with the small side note that there were actually three cities competing for that honour. Congratulations to Sofia!

Larry's traditional keynote was quite emotional, and he had to fight tears a few times. Having had cancer and related surgeries in the past year, he still does his perceived duty to the Perl community, which I greatly appreciate.

Afterwards Dave Cross talked about 25 years of Perl in 25 minutes, which was a nice walk through some significant developments in the Perl world, though a bit hasty. Maybe picking fewer events and spending a bit more time on the selected few would give a smoother experience.

Another excellent talk that ran out of time was on Redis. Having experimented a wee bit with Redis in the past month, this was a real eye-opener on the wealth of features we might have used for a project at work, but in the end we didn't. Maybe we will eventually revise that decision.

Ribasushi talked about how hard benchmarking really is, and while I was (in principle) aware of that fact that it's hard to get right, there were still several significant factors that I overlooked (like the CPU's tendency to scale frequency in response to thermal and power-management considerations). I also learned that I should use Dumbbench instead of the Benchmark.pm core module. Sadly it didn't install for me (Capture::Tiny tests failing on Mac OS X).

The Perl 6 is dead, long live Perl 5 talk was much less inflammatory than the title would suggest (maybe due to Larry touching on the subject briefly during the keynote). It was mostly about how Perl 5 is used in the presenter's company, which was mildly interesting.

After tasty free lunch I attended jnthn's talk on Rakudo on the JVM, which was (as is typical for jnthn's talk) both entertaining and taught me something, even though I had followed the project quite a bit.

Thomas Klausner's Bread::Board by example made me want to refactor the OTRS internals very badly, because it is full of the anti-patterns that Bread::Board can solve in a much better way. I think that the OTRS code base is big enough to warrant the usage of Bread::Board.

I enjoyed Denis' talk on Method::Signatures, and was delighted to see that most syntax is directly copied from Perl 6 signature syntax. Talk about Perl 6 sucking creativity out of Perl 5 development.

The conference ended with a session of lighning talks, something which I always enjoy. Many lightning talks had a slightly funny tone or undertone, while still talking about interesting stuff.

Finally there was the "kick-off party", beverages and snacks sponsored by booking.com. There (and really the whole day, and yesterday too) I not only had conversations with my "old" Perl 6 friends, but also talked with many interesting people I never met before, or only met online before.

So all in all it was a nice experience, both from the social side, and from quality and contents of the talks. Venue and food are good, and the wifi too, except when it stops working for a few minutes.

I'm looking forward to two more days of conference!

(Updated: Fixed Thomas' last name)

Ocean of Awareness: What makes a parsing algorithm successful?

What makes a parsing algorithm successful? Two factors, I think. First, does the algorithm parse a workably-defined set of grammars in linear time? Second, does it allow the application to intervene in the parse with custom code? When parsing algorithms are compared, typically neither of these gets much attention. But the successful algorithms do one or the other.

Does the algorithm parse a workably-defined set of grammars in linear time?

"Workably-defined" means more than well-defined in the mathematical sense -- the definition has to be workable. That is, the definition must be something that, with reasonable effort, a programmer can use in practice.

The algorithms in regular expression engines are workably-defined. A regular expression, in the pure sense consists of a sequence of symbols, usually shown by concatenation:

a b c

or a choice among sequences, usually shown by a vertical bar:

a | b | c

or a repetition of any of the above, typically shown with a star:

a*

or any recursive combination of these. True, if this definition is new to you, it can take time to get used to. But vast numbers of working programming are very much "used to it", can think in terms of regular expressions, and can determine if a particular problem will yield to treatment as a regular expression, or not.

Parsers in the LALR family (yacc, bison, etc.) do not have a workably defined set of grammars. LALR is perfectly well-defined mathematically, but even experts in parsing theory are hard put to decide if a particular grammar is LALR.

Recursive descent also does not have a workably defined set of grammars. Recursive descent doesn't even have a precise mathematical description -- you can say that recursive descent is LL, but in practice LL tables are rarely used. Also in practice, the LL logic is extended with every other trick imaginable, up to and including switching to other parsing algorithms.

Does it allow the user to intervene in the parse?

It is not easy for users to intervene in the processing of a regular expression, though some implementations attempt to allow such efforts. LALR parsers are notoriously opaque. Those who maintain the LALR-driven Perl parser have tried to supplement its abilities with custom code, with results that will not encourage others making the same attempt.

Recursive descent, on the other hand, has no parse engine -- it is 100% custom code. You don't get much friendlier than that.

Conclusions

Regular expressions are a success, and will remain so, because the set of grammars they handle is very workably-defined. Applications using regular expressions have to take what the algorithm gives them, but what it gives them is very predictable.

For example, an application can write regular expressions on the fly, and the programmer can be confident they will run as long as they are well-formed. And it is easy to determine if the regular expression is well-formed. (Whether it actually does what you want is a separate issue.)

Recursive descent does not handle a workably-defined set of grammars, and it also has to be hand-written. But it makes up for this by allowing the user to step into the parsing process anywhere, and "get his hands dirty". Recursive descent does nothing for you, but it does allow you complete control. This is enough to make recursive descent the current algorithm of choice for major parsing projects.

As I have chronicled elsewhere, LALR was once, on highly convincing theoretical grounds, seen as the solution to the parsing problem. But while mathematically well-defined, LALR was not workably defined. And it was very hostile to applications that tried to alter, or even examine, its syntax-driven workings. After decades of trying to make it work, the profession has abandoned LALR almost totally.

What about Marpa?

Marpa has both properties: its set of grammars is workably-defined. And, while Marpa is syntax-driven like LALR and regular expressions, it also allows the user to stop the parse engine, communicate with it about the state of the parse, do her own parsing for a while, and restart the parse engine at any point she wants.

Marpa's workable definition has a nuance that the one for regular expressions does not. For regular expressions, linearity is a given -- they parse in linear time or fail. Marpa parses a much larger class of grammars, the context-free grammars -- anything that can be written in BNF. BNF is used to describe languages in standards, and is therefore itself a kind of "gold standard" for a workable definition of a set of grammars. However, Marpa does not parse everything that can be written in BNF in linear time.

Marpa linearly-parsed set of grammars is smaller than the context-free grammars, but it is still very large, and it is still workably-defined. Marpa will parse any unambiguous language in linear time, unless it contains unmarked middle recursions. An example of a "marked" middle recursion is the language described by

S ::= a S a | x

a string of which is "aaaxaaa", where the "x" marks the middle. An example of an "unmarked" middle recursion is the language described by

S ::= a S a | a

a string of which is "aaaaaaa", where nothing marks the middle, so that you don't know until the end where the middle of the recursion is. If a human can reliably find the middle by eyeball, the middle recursion is marked. If a human can't, then the middle recursion might be unmarked.

Marpa also parses a large set of unambiguous grammars linearly, and this set of grammars is also workably-defined. Marpa parses an ambiguous grammar in linear time if

  • It has no unmarked middle recursions.
  • All right recursions are unambiguous.
  • There are no cycles. A cycle occurs, for example, if there is a rule A ::= A in the grammar.
  • Marpa's level of ambiguity at any location is bounded by a constant.

The term "level of ambiguity" requires a bit of explanation. At any given location, there can be as many rules "in play" as you like, without affecting the level of ambiguity. The key question: What is the maximum number of different origins that a rule might have? (The "origin" of a rule is the location where it begins.) That is, can a rule currently in play have at most 20 different origins? Or could it have its origin at every location so far? If the maximum number of origins is 20 or any other fixed constant, the level of ambiguity is "bounded". But if the maximum number of origins keeps growing as the length of the input grows, the level of ambiguity is unbounded.

For the unambiguous case, Marpa's workable definition encompasses a much larger class of grammars, but is no more complex than that for regular expressions. If you want to extend even further, and work with ambiguous grammars, the definition remains quite workable. Of the four restrictions needed to ensure linearity, the one requiring a bounded level of ambiguity is the only one that might force you to exercise real vigliance -- once you get into ambiguity, unboundedness is easy to slip into.

As for the other three, cycles never occur in a practical grammars, and Marpa reports them, so that you simply fix them when they happen. Most recursions will be left recursions, which are unrestricted. My experience has been that, in practical grammars, unmarked middle recursions and ambiguous right recursions are not especially tempting features. If you note whenever you use a right recursion, checking that it is not ambiguous, and if you note whenever you use a middle recursion, checking that it is marked, then you will stay linear.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site.

Comments

Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Ocean of Awareness: Parsing: a timeline

[ Revised 22 Oct 2014 ]

1960: The ALGOL 60 spec comes out. It specifies, for the first time, a block structured language. The ALGOL committee is well aware that nobody knows how to parse such a language. But they believe that, if they specify a block-structured language, a parser for it will be invented. Risky as this approach is, it pays off ...

1961: Ned Irons publishes his ALGOL parser. In fact, the Irons parser is the first parser of any kind to be described in print. Ned's algorithm is a left parser -- a form of recursive descent. Unlike modern recursive descent, the Irons algorithm is general and syntax-driven. "General" means it can parse anything written in BNF. "Syntax-driven" (aka declarative) means that parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Almost simultaneously, hand-coded approaches to left parsing appear. These we would now recognize as recursive descent. Over the following years, hand-coding approaches will become more popular for left parsers than syntax-driven algorithms. Three factors are at work:

  • In 1960's, memory and CPU are both extremely limited. Hand-coding pays off, even when the gains are small.
  • Pure left parsing is a very weak parsing technique. Hand-coding is often necessary to overcome its limits. This is as true today as it is in 1961.
  • Left parsing works well in combination with hand-coding -- they are a very good fit.

1965: Don Knuth invents LR parsing. Knuth is primarily interested in the mathematics. Knuth describes a parsing algorithm, but it is not thought practical.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's core idea is to track everything about the parse in tables. Earley's algorithm is enticing, but it has three major issues:

  • First, there is a bug in the handling of zero-length rules.
  • Second, it is quadratic for right recursions.
  • Third, the bookkeeping required to set up the tables is, by the standards of 1968 hardware, daunting.

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size.

1972: Aho and Ullmann describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed. Central to the toolkit is yacc, an LALR based parser generator. With a bit of hackery, yacc parses its own input language, as well as the language of V7's main compiler, the portable C compiler. After two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of Earley algorithm. But Earley parsing is almost forgotten. It will be 20 years before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: Aycock&Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

2000 to today: With the retreat from LALR comes a collapse in the prestige of parsing theory. After a half century, we seem to be back where we started. If you took Ned Iron's original 1961 algorithm, changed the names and dates, and translated the code from the mix of assembler and ALGOL into Haskell, you would easily republish it today, and bill it as as revolutionary and new.

Marpa

Over the years, I had come back to Earley's algorithm again and again. Around 2010, I realized that the original, long-abandoned vision -- an efficient, practical, general and syntax-driven parser -- was now, in fact, quite possible. The necessary pieces had fallen into place.

Aycock&Horspool have solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and probably no longer relevant in any case -- caches misses are now the bottleneck.

But while the original issues with Earley's disappeared, a new issue emerged. With a parsing algorithm as powerful as Earley's behind it, a syntax-driven approach can do much more than it can with a left parser. But with the experience with LALR in their collective consciousness, few modern programmers are prepared to trust a purely declarative parser. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove."

To be accepted, Marpa needed to allow procedure parsing, not just declarative parsing. So Marpa allows the user to specify events -- occurrences of symbols and rules -- at which declarative parsing pauses. While paused, the application can call procedural logic and single-step forward token by token. The procedural logic can hand control back over to syntax-driven parsing at any point it likes. The Earley tables can provide the procedural logic with full knowledge of the state of the parse so far: all rules recognized in all possible parses so far, and all symbols expected. Earley's algorithm is now a even better companion for hand-written procedural logic than recursive descent.

For more

For more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group.

Perlgeek.de : Correctness in Computer Programs and Mathematical Proofs

While reading On Proof and Progress in Mathematics by Fields Medal winner Bill Thurston (recently deceased I was sorry to hear), I came across this gem:

The standard of correctness and completeness necessary to get a computer program to work at all is a couple of orders of magnitude higher than the mathematical community’s standard of valid proofs. Nonetheless, large computer programs, even when they have been very carefully written and very carefully tested, always seem to have bugs.

I noticed that mathematicians are often sloppy about the scope of their symbols. Sometimes they use the same symbol for two different meanings, and you have to guess from context which on is meant.

This kind of sloppiness generally doesn't have an impact on the validity of the ideas that are communicated, as long as it's still understandable to the reader.

I guess on reason is that most mathematical publications still stick to one-letter symbol names, and there aren't that many letters in the alphabets that are generally accepted for usage (Latin, Greek, a few letters from Hebrew). And in the programming world we snort derisively at FORTRAN 77 that limited variable names to a length of 6 characters.

Ocean of Awareness: Reporting mismatched delimiters

In many contexts, programs need to identify non-overlapping pieces of a text. One very direct way to do this is to use a pair of delimiters. One delimiter of the pair marks the start and the other marks the end. Delimiters can take many forms: Quote marks, parentheses, curly braces, square brackets, XML tags, and HTML tags are all delimiters in this sense.

Mismatching delimiters is easy to do. Traditional parsers are often poor at reporting these errors: hopeless after the first mismatch, and for that matter none too precise about the first one. This post outlines a scaleable method for the accurate reporting of mismatched delimiters. I will illustrate the method with a simple but useable tool -- a utility which reports mismatched brackets.

The example script

The example script, bracket.pl, reports mismatched brackets in the set:

() {} []

They are expected to nest without overlaps. Other text is treated as filler. bracket.pl is not smart about things like strings or comments. This does have the advantage of making bracket.pl mostly language-agnostic.

Because it's intended primarily to be read as an illustration of the technique, bracket.pl's grammar is a basic one. The grammar that bracket.pl uses is so simple that an emulator of bracket.pl could be written using recursive descent. I hope the reader who goes on to look into the details will see that this technique scales to more complex situations, in a way that a solution based on a traditional parser will not.

Error reports

The description of how the method works will make more sense after we've looked at some examples of the diagnostics bracket.pl produces. To be truly useful, bracket.pl must report mismatches that span many lines, and it can do this. But single-line examples are easier to follow. All the examples in this post will be contained in a one line. Consider the string '((([))'. bracket.pl's diagnostics are:

* Line 1, column 1: Opening '(' never closed, problem detected at end of string
((([))
^
====================
* Line 1, column 4: Missing close ], problem detected at line 1, column 5
((([))
   ^^

In the next example bracket.pl realizes that it cannot accept the ')' at column 16, without first closing the set of curly braces started at column 5. It identifies the problem, along with both of the locations involved.

* Line 1, column 5: Missing close }, problem detected at line 1, column 16
[({({x[]x{}x()x)})]
    ^          ^

So far, so good. But an important advantage of bracket.pl has yet to be seen. Most compilers, once they report a first mismatched delimiter, produce error messages that are unreliable -- so unreliable that they are useless in practice. bracket.pl repairs a mismatched bracket before continuing, so that it can do a reasonable job of analyzing the text that follows. Consider the text '({]-[(}-[{)'. The output of bracket.pl is

* Line 1, column 1: Missing close ), problem detected at line 1, column 3
({]-[(}-[{)
^ ^
====================
* Line 1, column 2: Missing close }, problem detected at line 1, column 3
({]-[(}-[{)
 ^^
====================
* Line 1, column 3: Missing open [
({]-[(}-[{)
  ^
====================
* Line 1, column 5: Missing close ], problem detected at line 1, column 7
({]-[(}-[{)
    ^ ^
====================
* Line 1, column 6: Missing close ), problem detected at line 1, column 7
({]-[(}-[{)
     ^^
====================
* Line 1, column 7: Missing open {
({]-[(}-[{)
      ^
====================
* Line 1, column 9: Missing close ], problem detected at line 1, column 11
({]-[(}-[{)
        ^ ^
====================
* Line 1, column 10: Missing close }, problem detected at line 1, column 11
({]-[(}-[{)
         ^^
====================
* Line 1, column 11: Missing open (
({]-[(}-[{)
          ^

Each time, bracket.pl corrects itself, and accurately reports the next set of problems.

A difficult error report

To be 100% accurate, bracket.pl would have to guess the programmer's intent. This is, of course, not possible. Let's look at a text where bracket.pl's guesses are not so good: {{]}. Here we will assume the closing square bracket is a typo for a closing parenthesis. Here's the result:

* Line 1, column 1: Missing close }, problem detected at line 1, column 3
{{]}
^ ^
====================
* Line 1, column 2: Missing close }, problem detected at line 1, column 3
{{]}
 ^^
====================
* Line 1, column 3: Missing open [
{{]}
  ^
====================
* Line 1, column 4: Missing open {
{{]}
   ^

Instead of one error, bracket.pl finds four.

But even in this case, the method is fairly good, especially when compared with current practice. The problem is at line 1, column 3, and the first three messages all identify this as one of their potential problem locations. It is reasonable to believe that a programmer, especially once he becomes used to this kind of mismatch reporting, will quickly find the first mismatch and fix it. For this difficult case, bracket.pl may not be much better than the state of the art, but it is certainly no worse.

How it works

For full details of the workings of bracket.pl there is the code, which is heavily commented. This section provides a conceptual overview.

bracket.pl uses two features of Marpa: left-eideticism and the Ruby Slippers. By left-eidetic, I mean that Marpa knows everything there is to know about the parse at, and to left of, the current position. As a consequence, Marpa also knows exactly which of its input symbols can lead to a successful parse, and is able to stop as soon as it knows that the parse cannot succeed.

In the Ruby Slippers technique, we arrange for parsing to stop whenever we encounter an input which would cause parsing to fail. The application then asks Marpa, "OK. What input would allow the parse to continue?" The application takes Marpa's answer to this question, and uses it to concoct an input that Marpa will accept.

In this case, bracket.pl creates a virtual token which fixes the mismatch of brackets. Whatever the missing bracket may be, bracket.pl invents a bracket of that kind, and adds it to the virtual input. This done, parsing and error detection can proceed as if there was no problem. Of course, the error which made the Ruby Slippers token necessary is recorded, and those records are the source of the error reports we saw above.

To make its error messages as informative as possible in the case of missing closing brackets, bracket.pl needs to report the exact location of the opening bracket. Left-eideticism again comes in handy here. Once the virtual closing bracket is supplied to Marpa, bracket.pl asks, "That bracketed text that I just closed -- where did it begin?" The Marpa parser tracks the start location of all symbol and rule instances, so it is able to provide the application with the exact location of the starting bracket.

When bracket.pl encounters a problem at a point where there are unclosed opening brackets, it has two choices. It can be optimistic or it can be pessimistic. "Optimistic" means it can hope that something later in the input will close the opening bracket. "Pessimistic" means it can decide that "all bets are off" and use Ruby Slippers tokens to close all the currently active open brackets.

bracket.pl uses the pessimistic strategy. While the optimistic strategy sounds better, in practice the pessimistic one seems to provide better diagnostics. The pessimistic strategy does report some fixable problems as errors. But the optimistic one can introduce spurious fixes. These hide the real errors, and it is worse to miss errors than it is to overreport them. Even when the pessimistic strategy overreports, its first error message will always accurately identify the first problem location.

While bracket.pl is already useable, I think of it as a prototype. Beyond that, the problem of matching delimiters is in fact very general, and I believe these techniques may have very wide application.

For more

The example script of this post is a Github gist. For more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group.

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Header image by Tambako the Jaguar. Some rights reserved.