perlancar's blog: Getopt modules 08: Getopt::Tiny

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

Getopt::Tiny is a module written in 1999-2002 by David Muir Sharnoff (MUIR). David is a veteran Perl programmer who also programs in several other languages like Go, Python, node, Java. This module is admittedly among his earlier works, and the age shows e.g. it only recognizes “old” style long option with a single dash prefix (-foo) instead of “new” one with double dash prefix (--foo). But there are a couple of things I want to highlight.

Getopt::Tiny is not a Getopt::Long wrapper; it implements its own parsing. The code is short and simple, as the name ::Tiny would suggest, at around only 115 lines. It currently does not have any CPAN distributions depending on it.

The feeling that I get looking at this module is that it tries to be less redundant than Getopt::Long. For example, instead of option spec being something like name=s or name=s@ or name=s%, this module guesses the destination type from the type of references the option is paired with:

    opt1 => \$str,  # a scalar/string
    opt2 => \@ary,  # an array
    opt3 => \%hash, # a hash

Getting a bit quirky, when generating usage message, instead of letting user specify a summary string like some other modules (e.g. Getopt::Long::Descriptive, Getopt::Simple, Getopt::Long::More) or by extracting POD like some (e.g. Getopt::Euclid, Getopt::Declare) this module searches from comment in source code. The comment must be particularly formatted, i.e.:

# begin usage info
my %flags = (
    opt1 => \$str,  # description for opt1
    opt2 => \@ary,  # description for opt2
    opt3 => \%hash, # description for opt3
);
my %switches = (
    switch1 => \$switch1,  # description for switch1
);
# end usage info
getopt(\@ARGV, \%flags, \%switches);

Another rather quirky thing, also as the result of trying (a bit too hard) to be compact, the fourth option to its getopt() option is a string that will be used in the usage message for symbolizing the arguments, e.g. “files”, which will generate a usage message e.g.:

Usage: myprog [flags] [switches] files

This means, it’s okay if @ARGV contains arguments after the options (flags, switches) have been stripped, e.g.:

% myprog --opt1 val --opt2 val --opt2 val foo bar

But, if the fourth argument is not specified, the usage message is simply:

Usage: myprog [flags] [switches]

And the command-line is not allowed to have extra arguments (foo, bar). On the other hand, there’s no way to express that command-line arguments are required. Or whether a flag is required. Or default value.

But at least the code is compact as well as very straightforward, which is better than some other modules that I looked at.


perlancar's blog: Getopt modules 07: Getopt::Std::Strict

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

Getopt::Std::Strict is a module written by Leo Charre (LEOCHARRE) in 2008. The author is using this module in several of his CPAN distributions, but no one else on CPAN seems to be using it. More "recent" option parsing modules, such as those written after the mid-2000's, even including Docopt (port of a popular Python library) do not see much adoption on CPAN. This could be caused by various factors, the speculation of which I don't intend to get into right now. But I can say that there are two exceptions: Getopt::Lucid and Getopt::Long::Descriptive, and this might have something to do with the reputation of their authors.

Back to our module of the day. There are not a lot of wrappers for or "forks" of Getopt::Std, simply due to the fact that the module is so simple already. But the author of Getopt::Std::Strict has an itch to scratch. First, when we are in strict mode, instead of this:

getopts('oif:');

or:

getopts('oif:', \%opts);

one has to do this instead:

use vars qw($opt_o $opt_i $opt_f);
getopts('oif:');

or:

my %opts;
getopts('oif:', \%opts);

which the author probably finds annoying. So he wrote a module to declare the variable(s) for you. And to do this, the option specification needs to be passed/processed at compile time. Thus:

use Getopt::Std::Strict 'oif:', 'opt'; # will declare the $opt_* for you
if ($opt_o) { ... } # so $opt_o can be used without you declaring it

This is rather nice because if you mistype $opt_o to $opt_p (an unknown option), this mistake will be caught at compile-time.

As a bonus, the module also provides you %OPT and opt() function. You can access for example the -o option via $OPT{o}, or via opt('o'). The former won't catch a typo e.g. when you type $OPT{p} but the later will (opt('p')
will die).

I personally find all this rather unnecessary: I probably would just bear the cost of typing my %opts; or my ($opt_o, $opt_i, $opt_f); and be more explicit.

But I have a couple of suggestions. First, it might be nice if the module dies when user specifies unknown option (to be more inline with the "Strict" spirit in the name). And second, the bonuses are additional cognitive burdens that I'd personally do without. Or, if they are to stay, to be more strict the %OPT can be made to die too when fed unknown options too, e.g. using tie mechanism.


Perl Foundation News: Grant Proposal: Learning Perl 6, a book from O'Reilly Media

The Grants Committee has received one grant proposal for the November/December round. Before the Committee members vote, we would like to solicit feedback from the Perl community on the proposal.

Review the proposal below and please comment here by December 12th, 2016. The Committee members will start the voting process following that and the conclusion will be announced approximately in one week.

Learning Perl 6, a book from O'Reilly Media

  • Name:

    brian d foy

  • Amount Requested:

    USD 10,000

Synopsis

Partially fund the writing process for Learning Perl 6, a book from O'Reilly Media

Benefits to the Perl Community

Note: I am also running a Kickstarter campaign (https://www.kickstarter.com/projects/1422827986/learning-perl-6). I have not given much thought to the mechanism of The Perl Foundation funding, but I think I'd want it to be through your normal mechanism rather than Kickstarter. I understand that payment on completion in part of that process. That works for me. I would appreciate TPF's help in spreading the word and finding backers.

Perl 6 does not have a tutorial book. To reach further out into the general programming world, people need a gentle introduction to the language. This book is about bringing people into the community more than serving the ones already here.

Some people have written survey books that have covered ancient versions of the language (Perl 6 and Parrot Essentials, Perl 6 Now). Other people are working on cookbook-style example books (Moritz Lenz most notably). Several other efforts have stalled or stopped. A book can be a big chunk of a person's life, and financial stability along with the removal of the distractions of normal commerce are key to success.

The big question is my need for the money, especially when I have a major publisher committed to publishing the book. In short, I've shifted O'Reilly risk tolerance by taking on some of it myself.

First, book sales aren't what they used to be. I'm taking a big gamble here and I have a lot of personal risk, mostly in lost time. Many comments on the kickstarter have centered around "Why doesn't O'Reilly pay for the book?" Publishers don't "pay" so much as give you advanced royalties. That robs the future to pay for the present. I've never taken an advance on a book, and even if I did, a sane publisher wouldn't provide me enough money to allow me to do this. I fully expect Learning Perl 6 to be my least-popular book.

The entire technology book market shrinking and has been for years. Perl 6 as a usable language is a new technology in an already crowded market. There's a big chance that this is a "front list" book that makes all of its sales in the first months of its release and is never bought again. This situation happens with a dedicated fan base or a highly promoted book that doesn't catch on. I already know the sales numbers for Learning Perl. If I had to guess at a sales target for Learning Perl 6, I'd take one-tenth that number. It's not a motivating amount for me. As the author of several existing books, I don't have the same secondary rewards of new fame and recognition as a first-time author.

Second, book sales aren't primarily important to the marketing goals of the Perl 6 community. To be taken seriously as a technology (even if the community is small), someone needs to be able to point to something in the marketplace. A technology manager might take a signalling cue from the existence of a dead-tree book because the publisher has already judged risk and committed to the book. Even if we think this is a poor way to make judgements (and granted, we'd often be right), there's the dirty mess of reality versus how we think the world should be. This is important to many people in the Perl 6 community. The lack of a book is a big hurdle for our "force multipliers"—the teachers and trainers who don't have the time to construct a full curriculum themselves but could use a tutorial book that's ready to teach in a segmented, classroom enviromented.

Third, the Kickstarter amount sounds impressive, but I'm also responsible for all expenses and taxes. Every e-book and print book I give out as a reward is actually a pre-sale. Most of that money flows to the publisher as a sale (of which I still get a royalty). The Kickstarter money is also subject to taxes. I've run a small business for a couple decades; that number doesn't seem that large for the effort. I'm looking for any support out there. When you consider the amount, I think it's much more productive to think about getting what you want at a price that makes sense to you. Considering the levels that TPF has funded similar grants delivering less, I think this is more than reasonable value.

If you could get this book without this grant and without the Kickstarter, I'd support that. Indeed, I've waited for that book just like you. But, no one has stepped up to write it and that book doesn't exist. TPF's role could underpin the community support I'm already receiving. This grant further ensures the end result.

Deliverables

The deliverable is a printed O'Reilly Media book. O'Reilly publishes and promotes the book. I am already under contract with O'Reilly. Brian Jepson is my editor, and we've developed mock cover art for the book (meaning, O'Reilly has assigned us an animal).

I am running a campaign through Kickstarter to fund this book. Several of the rewards involve delivery of the book. The folks at O'Reilly tell me this means I can't have my working sources open as I normally do because it would violate the Robinson-Patman Act (some customers get the same product at a lower price). I can, however, allow access for selected technical reviewers and grant managers.

Project Details

I'm writing a tutorial-style book for beginners. I expect the book to be about the size and scope of the existing Learning Perl book. I'm aiming for 300 pages.

This is not be definitive or a reference (others are working on that). This is not the documentation. A tutorial's task is to introduce the language in steps and uses the least number of concepts along the way. My goal is to build a solid foundation for understanding the language both in syntax and philosophy. Documentation explains a tool but doesn't put it into the context of a task.

This book progresses by slowly introducing concepts and giving readers a chance to practice those. Typically, I explain the fully expanded form of syntax and work my way to the idiomatic representation. This technique helps readers understand the implied parts of the idiomatic form, but also helps them become better readers of other people's code.

Each chapter of the book includes exercises along with explicated answers. Practice is the key to learning.

Inch-stones

A book project isn't straightforward. There's not a single path to the result. Often the work is a big soup until it seemingly miraculously comes together close to the end. The progress in a new book is more experimental and philosophical than concrete. The initial effort is usually a big mess, but that's part of the process.

Project Schedule

I have these major milestones specified in the contract for the book:

  • Two completed chapters by January
  • Half the book by May
  • The whole book by August

It takes approximately three months for a book to make it through the O'Reilly publishing process. This includes one round of technical review, two rounds of copy-editing, an indexer (a live person who indexes the book), and finally injection into the distribution channels.

Additionally, there is a set of people (official technical reviewers or self-selected backers through a Kickstarter reward) who will receive monthly drafts of the book. They will be able to provide feedback and comments, and should be an effective commitment device. They get a PDF that looks just like an O'Reilly book since it comes out of the continuous publishing system.

As for a more detailed schedule, it's been my experience in publishing that they are never honest and no one expects them to be true. I could say that I'd deliver a particular chapter in a particular month, but later discover that it's more important to work on a different chapter. Halfway through I typically have a moment of clarity that makes me go back through everything I've already done to take it in a different direction.

I typically view the schedules as a bit squishy, but I've always been close.

Completeness Criteria

To be judged complete, my O'Reilly editor, Brian Jepson, certifies that I've met the conditions imposed by O'Reilly. Once I've done that, O'Reilly mostly takes over and it's a sure thing.

Under the standard author contract, O'Reilly judges the book complete when they think the content and form of the book are at or above the level of their usual quality. If it is not (and this have never been a problem for me), they have the right to ask another author to make changes to bring it up to their standards.

Bio

I'm brian d foy, the author or co-author of many of the existing Perl 5 books from O'Reilly Media, including Learning Perl, (Editions 4 to 7), Intermediate Perl (Editions 1 and 2), Mastering Perl (Editions 1 and 2), and Programming Perl (4th Edition). I am a U.S. citizen living in the U.S.

I've worked extensively in the publishing process, have great tools that I've already used for my previous books, and have already set up everything with O'Reilly Media.

perlancar's blog: Getopt modules 05: Getopt::Valid

perlancar's blog
<p><B>About this mini-article series.</B> Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the <TT>Getopt::*</TT> namespace). <A HREF="https://perlancar.wordpress.com/2016/12/01/getopt-modules-01-getoptlong/">First article is here</A>. </p> <p><A HREF="https://metacpan.org/pod/Getopt::Valid">Getopt::Valid</A> is yet another module that wraps Getopt::Long to try to add some features (examples of modules like these that have been reviewed are <A HREF="https://metacpan.org/pod/Getopt::Long::Descriptive">Getopt::Long::Descriptive</A> and <A HREF="https://metacpan.org/pod/Getopt::Compact">Getopt::Compact</A>). The module starts with a single specific goal (even the name already reflects that goal nicely), which is to add extended validation. Getopt::Long admittedly only allows for expressing limited validation, e.g. that something needs to be an integer (<TT>&#8211;name=i</TT>) or a floating point number (<TT>&#8211;name=f</TT>). Everything else is not validated, except by ourselves if we assign an option to an option handler (coderef) instead of scalar reference or array/hash reference. </p> <p>The actual design of Getopt::Valid, however, breaks down in places. The main thing that pops up the most to me is the inconsistency of names, reflecting the lack of design clarity. The top-level structure to be passed to the <TT>GetOptionsValid()</TT> routine is called <TT>$validation_ref</TT>, but it&#8217;s not exactly a validation specification: it&#8217;s the program&#8217;s specification with its list of options. The options are called &#8220;params&#8221; in one place, &#8220;arguments&#8221; in another places. The <TT>collect_argv</TT> method is described as a method to collect &#8220;args&#8221; (arguments). For an option parsing module which must differentiate between option, arguments (before and after the options are parsed), the convoluted naming really turns me off. </p> <p>The OO interface also leaves something to be desired. After instantiation, first you have to <TT>collect_argv()</TT> which again is a misleading name because it actually call <TT>Getopt::Long::GetOptions()</TT> to do the actual parsing. Then we have to manually call <TT>validate()</TT> to do the extra validation. Then, we have to manuall call another method <TT>valid_args()</TT> to get the validated arguments, er, options. </p> <p>The validators themselves can be in the form of a coderef: </p> <p> <pre class="brush: perl; title: ; notranslate"> 'name=s' =&gt; sub { ... }</pre> <p>or for more simpler cases, a regex object: </p> <p> <pre class="brush: perl; title: ; notranslate"> 'name=s' =&gt; qr/^blah.+/,</pre> <p>Fine by me. But then, a third case is allowed which is a string to mean&#8230; a description of the option instead of something related to validation as the former two cases.</p><br /> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/perlancar.wordpress.com/1519/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/perlancar.wordpress.com/1519/" /></a> <img alt="" border="0" src="https://pixel.wp.com/b.gif?host=perlancar.wordpress.com&#038;blog=83450492&#038;post=1519&#038;subd=perlancar&#038;ref=&#038;feed=1" width="1" height="1" />

perlancar's blog: Getopt modules 06: Opt::Imistic

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

Option parsing modules can be categorized into two: those that require you to write a specification and those that do not (actually there's a third category: those that allow you to choose to supply a spec or not). Modules that accept a usage text as input like Docopt, or POD like Getopt::Euclid, or some other form, count into the first category. Modules in the second category are usually modules in this category are meant for shorter and simpler scripts. (And the modules are often simple themselves with short implementation, too simple to contain something worth babbling about. Most of them just collect anything that looks like an option in @ARGV then put them in a hash, done!)

Modules that don't accept a specification face an ambiguity problem when it comes to this syntax:

–foo bar

Is this a flag option –foo (an option that does not require a value) followed by a command-line argument bar, or an option –foo with its value bar?

Some modules resolve this by disallowing the ambiguous syntax. Option value must always be specified using:

–foo=bar

But this is inconvenient to (some) users and is not how most Unix programs behave.

Other modules resolve this by simply assuming that all –foo bar means option –foo with value of bar. In other words, user must be careful not to put an argument after a flag option, usually using to separate options and arguments).

Aside from the abovementioned two, other approaches are possible. One such approach is by looking at how the option value is used in the program. Using a pragma like overload, we can trap boolean, string, even array/hash operations. For example:

package OptionObject;

use overload
    '""'   => sub { $_[0]{type} = 'scalar'; $_[0]{values}[0] },
    'bool' => sub { $_[0]{type} = 'bool'  ; @{$_[0]{values}} ? 1:0 },
    '@{}'  => sub { $_[0]{type} = 'array' ; $_[0]{values} },
    ;

sub new {
    my $class = shift;
    bless {@_}, $class;
}

Then:

my $opt1 = OptionObject->new(value => ["a", "b"]); # e.g. after user specifies --opt1 a --opt1 b
my $opt2 = OptionObject->new(value => ...);
my $opt3 = OptionObject->new(value => ...);

if ($opt1 =~ /foo/) {
    # this is regex matching, meaning user wants opt1 to be a string/scalar option
}

if ($opt2) {
    # this is boolean testing, meaning user wants opt2 to be a flag option
}

for (@$opt3) {
    # user array-deferences, she probably wants opt3 to be an array option
}

In the above example code, the overloading mechanism will trigger to let us know that user wants –opt1 to be a string/scalar option, –opt2 a flag option (which does not take value), and –opt3 an array option.

CPAN module Opt::Imistic offers something like this approach, although it doesn't push it that far. The module was written by Alastair McGowan-Douglas (ALTREUS) in 2010, and it sees a new release in 2014 and 2015. It does not yet have any CPAN distribution depending on it.

Opt::Imistic tries to solve this other ambiguity that non-spec-using modules also faces. When we receive this in the command-line:

–foo bar –foo baz

sometimes –foo is not meant to accept multiple values, but user might specify the option multiple times due to mistake or some other cause. With a spec, we can detect this. Without spec, Opt::Imistic tries to detect this by looking at how the option value is used by the program.

To use Opt::Imistic to parse command-line options, you do this:

use Opt::Imistic;

This will cause the module to parse @ARGV and put the result in %ARGV. For example, if your script is called like this:

% myapp –foo bar –foo baz –qux 1 2 3

Then you'll have $ARGV{foo} and $ARGV{qux} available for you. All options are assumed to take values. The remaining arguments in @ARGV will be [2, 3].

$ARGV{foo} and $ARGV{qux} are actually overloaded objects. If you use them in a scalar context, then you will get a scalar, for example:

open my $fh, "<", $ARGV{foo}; # here, foo is used in scalar context

then $ARGV{foo} will return the value baz (the last specified). But actually, Opt::Imistic stores all option values as array. If you use it as an array, you can:

for my $file (@{ $ARGV{foo} }) { ... }

If this were combined this with lazy/delayed parsing, the module could even resolve the ambiguous –foo bar syntax. For example, if it didn't parse until it sees:

if ($ARGV{foo}) {
    ...
}

then it would know that –foo is meant to be a flag option and thus does not take value. Then Opt::Imistic could parse options and not slurp bar as the option for –foo. This can be repeated, command-line can be reparsed whenever a new hint is encountered.

I am tempted to write such a proof-of-concept. But in general, I am more interested in option parsing that uses specification. Writing specification is not that much of a pain anyway, and it offers so much more, like the ability to check for unknown options, auto-abbreviation, autogeneration of usage messages, and so on.


Perl Hacks: Hacking Symbol::Approx::Sub

In October, for (I think) the second year, Digital Ocean ran Hacktoberfest – a campaign encouraging people to submit pull requests to Github repos in exchange for free t-shirts.

A few of us thought that this might be a good way to do a small bit of easy Perl advocacy, so we tagged some issues on Perl repos with “hacktoberfest” and waited to see what would happen.

I created a few issues on some of my repos. But the one I concentrated on most was symbol-approx-sub. This is a very silly CPAN module that allows you to make errors in the names of your subroutines. I wrote it many years ago and there’s an article I wrote for The Perl Journal explaining why (and how) I did it.

Long-time readers might remember that in 2014 I wrote an article for the Perl Advent Calendar about Perl::Metrics::Simple. I used Symbol::Approx::Sub as the example module in the article and it showed me that the module had some depressingly high complexity scores and I planned to get round to doing something about that. Of course, real life got in the way and Symbol::Approx::Sub isn’t exactly high on my list of things to do, so nothing happened. Until this October.

Over the month, a lot of changes were made to the module. I probably did about half of it and the rest was pull requests from other people. The fixes include:

  • Better tests (and better test coverage – it’s now at 100%)
  • Using Module::Load to load module
  • Using real exceptions to report errors
  • Updating the code to remove unnecessary ampersands on subroutine calls
  • Fixed a couple of long-term bugs (that were found by the improved tests)
  • Breaking monolithic subroutines down

And I’m pretty happy with how it all went. The work was mostly completed in October and this morning I finally got round to doing the last couple of admin-y bits and version 3.0.0 of Symbol::Approx::Sub is now on the way to CPAN. You still shouldn’t use it in production code though!

Thanks to everyone who submitted a pull request. I hope you did enough to earn a free t-shirt.

If you want to get involved in fixing or improving other people’s code, there’s the 24 Pull Request Challenge taking place over Advent. Or for more Perl-specific code, there’s the CPAN Pull Request Challenge.

p.s. In the Advert Calendar article, I linked to the HTML version of the results. For comparison, I’ve also put the new results online. It’s a pretty good improvement.

The post Hacking Symbol::Approx::Sub appeared first on Perl Hacks.

Joel Berger: meta::hack 2016

MetaCPAN is the community developed and maintained website and api for finding and learning about Perl modules. This year, we dedicated a long weekend to improving it and oh what a weekend it was!

MetaCPAN had been developed using an early version of Elasticsearch and though it has served the project well ES moves quickly. A year or so ago MetaCPAN developers decided they needed to address the fact that the database was well out of date and the daunting amount of breaking changes that had happened since then. This work continued at this year’s Perl QA Hackathon, when I met Olaf Alders, Mickey Nasriachi, and Leo Lapworth who were diligently working on the migration.

I hadn’t come to QAH with any specific plans; I hopped aboard with MetaCPAN when I heard that they had started using Minion. Minion is a job queue, generally used with Mojolicious but easily used elsewhere. As part of the Mojolicious project I was happy to get up to speed on the project to be ready to help them if needed.

At QAH the migration made a lot of progress but didn’t quite make it over the hump. Just afterwards, we mentioned that with a few more days of hacking we could have finished. Not only that but we enjoyed each other’s company and we thought it would be fun to do it again.

The thought took shape over the following months and this past weekend I had the extreme pleasure of attending the (first) Meta::Hack with a group of very talented developers to work on MetaCPAN. The attendees were Olaf, Mickey and Leo, as well as Thomas Sibley, Graham Knop, Brad Lhotsky and Doug Bell. Doug’s and my employer, ServerCentral hosted the event November 17-20, 2016. They also host Chicago Perl Mongers; we are very grateful to work for a company so devoted to supporting Perl and Open Source.

This primary focus of the first two days was the remaining work of the Elasticsearch migration. I’m happy to report that as a result of the hackathon the migration is complete and has been deployed!

While the lead developers were completing that work, I audited the code and wrote some additional tests for the new download_url endpoint. Once the new backend went live it was “all hands on deck” watching for errors in the log and checking for other oddities.

One of the most fun projects was trying to discover why certain search results were getting repeated. As we dug through the search, trying more and more ideas, pulling in developer after developer. Finally Graham won the day, figuring out that as a result of the new data structures, we were uniquely filtering on not the name but an array reference containing the name, defeating the uniqueness check. It may sounds obvious now, but it really was quite an adventure getting there!

As a result of that challenge, I became much more familiar with the code surrounding the search itself. Several years ago I had noticed that it was nearly impossible to get search results from the api that mimicked the website’s own results. The reason turns out to be that the website performs a very complex Elasticsearch query from the front-end server, more-or-less bypassing the api to do so. Having learned about the search from the uniqueness fiasco, I spent the next two days moving the search from the web service to the api service.

This change, once merged, will allow api consumers to get search results directly. In this way external projects could have documentation or module search directly from MetaCPAN. Additionally, the search could also be extended to expose additional search parameters, possibly allowing clients to optimize the search for their particular needs.

I was so pleased to be able to take part and contribute in my own way to the furthering of Meta::CPAN. I want to thank the attendees once again, gentlemen and scholars all, and Neil Bowers for organizational help. Further this meeting wouldn’t have been possible without the extraordinarily generous support of our sponsors:

Our platinum sponsors were Booking.com and cPanel. Our gold sponsors were Elastic, FastMail, and Perl Careers. Our silver sponsors were ActiveState, Perl Services, ServerCentral, and Advance Systems Our bronze sponsors were Vienna.pm, easyname, and the Enlightened Perl Organisation (EPO). My warm thanks to each one!

NEILB: The London Perl Workshop 2016

I went to the London Perl Workshop yesterday. It was the usual excellent mix of good talks, catching up with old friends, and making some new ones. It was also the 10th and final LPW organised by Mark Keating. I came away reminded that what makes a languages isn't really the language, it's an ecosystem for sharing code (CPAN) and the community. And what a great community we have!

perlancar's blog: Getopt modules 04: Getopt::Compact

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

Getopt::Compact is a module that was first released in 2004 by Andrew Stewart Williams (ASW) and last updated in 2006. It manages to have 7 CPAN distributions depending on it.

Like Getopt::Long::Descriptive, this module is a wrapper for Getopt::Long mainly to allow users to specify summary string for each option as that's what is lacking in Getopt::Long to produce a useful usage/help message. Getopt::Compact also tries to present a different interface that claims to be more compact if you have a lot of (flag) options. For example, this code using Getopt::Long:

GetOptions(
    "--flag1" => \$opts{flag1},
    "--flag2" => \$opts{flag1},
    "--flag3" => \$opts{flag1},
    "--flag4" => \$opts{flag1},
    "--val1|1=s"  => \$opts{val1},
    "--val2|2=i"  => \$opts{val2},
    "--val3=s@"   =>  $opts{val3},
);

will become like this when using Getopt::Compact:

my $opts = Getopt::Compact->new(
    modes  => [qw/flag1 flag2 flag3 flag4/],
    struct => [
        [["val1", "1"], "Value1 blah blah"],
        [["val2", "2"], "Value2 blah blah", "=i"],
        ["val3"       , "Value3 blah blah", "=s@"],
    ],
)->opts;

But if you always put option values into hash elements (instead of sometimes assigning an option handler), GetOptions provides an alternative interface in which you specify hashref as first argument. This makes for a more compact syntax:

GetOptions(
    \%opts,
    [qw/--flag1 --flag2 --flag3 --flag4
        --val1|1=s --val2|2=i --val3=s@/],
);

So basically what Getopt::Compact makes you do is specifying option in split parts: –name|a=s@ in Getopt::Long becomes:

[["name","a"], "summary", "=s@"]

I recommend using Getopt::Long::Descriptive (GLD) instead of this because: 1) the interface is slightly nicer (no split option specification so more familiar to Getopt::Long users); 2) GLD allows specifying default value for options; 3) GLD allows expressing that an option is required.


Perlgeek.de : Perl 6 By Example: Formatting a Sudoku Puzzle

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


As a gentle introduction to Perl 6, let's consider a small task that I recently encountered while pursuing one of my hobbies.

Sudoku is a number-placement puzzle played on a grid of 9x9 cells, subdivided into blocks of 3x3. Some of the cells are filled out with numbers from 1 to 9, some are empty. The objective of the game is to fill out the empty cells so that in each row, column and 3x3 block, each digit from 1 to 9 occurs exactly once.

An efficient storage format for a Sudoku is simply a string of 81 characters, with 0 for empty cells and the digits 1 to 9 for pre-filled cells. The task I want to solve is to bring this into a friendlier format.

The input could be:

000000075000080094000500600010000200000900057006003040001000023080000006063240000

On to our first Perl 6 program:

# file sudoku.p6
use v6;
my $sudoku = '000000075000080094000500600010000200000900057006003040001000023080000006063240000';
for 0..8 -> $line-number {
    say substr $sudoku, $line-number * 9, 9;
}

You can run it like this:

$ perl6 sudoku.p6
000000075
000080094
000500600
010000200
000900057
006003040
001000023
080000006
063240000

There's not much magic in there, but let's go through the code one line at a time.

The first line, starting with a # is a comment that extends to the end of the line.

use v6;

This line is not strictly necessary, but good practice anyway. It declares the Perl version you are using, here v6, so any version of the Perl 6 language. We could be more specific and say use v6.c; to require exactly the version discussed here. If you ever accidentally run a Perl 6 program through Perl 5, you'll be glad you included this line, because it'll tell you:

$ perl sudoku.p6
Perl v6.0.0 required--this is only v5.22.1, stopped at sudoku.p6 line 1.
BEGIN failed--compilation aborted at sudoku.p6 line 1.

instead of the much more cryptic

syntax error at sudoku.p6 line 4, near "for 0"
Execution of sudoku.p6 aborted due to compilation errors.

The first interesting line is

my $sudoku = '00000007500...';

my declares a lexical variable. It is visible from the point of the declaration to the end of the current scope, which means either to the end of the current block delimited by curly braces, or to the end of the file if it's outside any block. As it is in this example.

Variables start with a sigil, here a '$'. Sigils are what gave Perl the reputation of being line noise, but there is signal in the noise. The $ looks like an S, which stands for scalar. If you know some math, you know that a scalar is just a single value, as opposed to a vector or even a matrix.

The variable doesn't start its life empty, because there's an initialization right next to it. The value it starts with is a string literal, as indicated by the quotes.

Note that there is no need to declare the type of the variable beyond the very vague "it's a scalar" implied by the sigil. If we wanted, we could add a type constraint:

my Str $sudoku = '00000007500...';

But when quickly prototyping, I tend to forego type constraints, because I often don't know yet how exactly the code will work out.

The actual logic happens in the next lines, by iterating over the line numbers 0 to 8:

for 0..8 -> $line-number {
    ...
}

The for loop has the general structure for ITERABLE BLOCK. Here the iterable is a range, and the block is a pointy block. The block starts with ->, which introduces a signature. The signature tells the compiler what arguments the blocks expects, here a single scalar called $line-number.

Perl 6 allows to use a dash - or a single quote ' inside identifier, as long as there is a letter on both sides (to disambiguate it from subtraction).

Again, type constraints are optional. If you chose to include them, it would be for 0..9 -> Int $line-number { ... }.

$line-number is again a lexical variable, and visible inside the block that comes after the signature. Blocks are delimited by curly braces.

say substr $sudoku, $line-number * 9, 9;

Both say and substr are functions provided by the Perl 6 standard library. substr($string, $start, $chars) extracts a substring of (up to) $chars characters length from $string, starting from index $start. Oh, and indexes are zero-based in Perl 6.

say then prints this substring, followed by a line break.

As you can see from the example, function invocations don't need parenthesis, though you can add them if you want:

say substr($sudoku, $line-number * 9, 9);

or even

say(substr($sudoku, $line-number * 9, 9));

Making the Sudoku playable

As the output of our script stands now, you can't play the resulting Sudoku even if you printed it, because all those pesky zeros get in your way of actually entering the numbers you carefully deduce while solving the puzzle.

So, let's substitute each 0 with a blank:

# file sudoku.p6
use v6;

my $sudoku = '000000075000080094000500600010000200000900057006003040001000023080000006063240000';
$sudoku = $sudoku.trans('0' => ' ');

for 0..8 -> $line-number {
    say substr $sudoku, $line-number * 9, 9;
}

trans is a method of the Str class. Its argument is a Pair. The boring way to create a Pair would be Pair.new('0', ' '), but since it's so commonly used, there is a shortcut in the form of the fat arrow, =>. The method trans replaces each occurrence of they pair's key with the pair's value, and returns the resulting string.

Speaking of shortcuts, you can also shorten $sudoku = $sudoku.trans(...) to $sudoku.=trans(...). This is a general pattern that turns methods that return a result into mutators.

With the new string substitution, the result is playable, but ugly:

$ perl6 sudoku.p6
       75
    8  94
   5  6  
 1    2  
   9   57
  6  3 4 
  1    23
 8      6
 6324    

A bit ASCII art makes it bearable:

+---+---+---+
|   | 1 |   |
|   |   |79 |
| 9 |   | 4 |
+---+---+---+
|   |  4|  5|
|   |   | 2 |
|3  | 29|18 |
+---+---+---+
|  4| 87|2  |
|  7|  2|95 |
| 5 |  3|  8|
+---+---+---+

To get the vertical dividing lines, we need to sub-divide the lines into smaller chunks. And since we already have one occurrence of dividing a string into smaller strings of a fixed size, it's time to encapsulate it into a function:

sub chunks(Str $s, Int $chars) {
    gather for 0 .. $s.chars / $chars - 1 -> $idx {
        take substr($s, $idx * $chars, $chars);
    }
}

for chunks($sudoku, 9) -> $line {
    say chunks($line, 3).join('|');
}

The output is:

$ perl6 sudoku.p6
   |   | 75
   | 8 | 94
   |5  |6  
 1 |   |2  
   |9  | 57
  6|  3| 4 
  1|   | 23
 8 |   |  6
 63|24 |   

But how did it work? Well, sub (SIGNATURE) BLOCK declares a subroutine, short sub. Here I declare it to take two arguments, and since I tend to confuse the order of arguments to functions I call, I've added type constraints that make it very likely that Perl 6 catches the error for me.

gather and take work together to create a list. gather is the entry point, and each execution of take adds one element to the list. So

gather {
    take 1;
    take 2;
}

would return the list 1, 2. Here gather acts as a statement prefix, which means it collects all takes from within the for loop.

A subroutine returns the value from the last expression, which here is the gather for ... thing discussed above.

Coming back to the program, the for-loop now looks like this:

for chunks($sudoku, 9) -> $line {
    say chunks($line, 3).join('|');
}

So first the program chops up the full Sudoku string into lines of nine characters, and then for each line, again into a list of three strings of three characters length. The join method turns it back into a string, but with pipe symbols inserted between the chunks.

There are still vertical bars missing at the start and end of the line, which can easily be hard-coded by changing the last line:

    say '|', chunks($line, 3).join('|'), '|';

Now the output is

|   |   | 75|
|   | 8 | 94|
|   |5  |6  |
| 1 |   |2  |
|   |9  | 57|
|  6|  3| 4 |
|  1|   | 23|
| 8 |   |  6|
| 63|24 |   |

Only the horizontal lines are missing, which aren't too hard to add:

my $separator = '+---+---+---+';
my $index = 0;
for chunks($sudoku, 9) -> $line {
    if $index++ %% 3 {
        say $separator;
    }
    say '|', chunks($line, 3).join('|'), '|';
}
say $separator;

Et voila:

+---+---+---+
|   |   | 75|
|   | 8 | 94|
|   |5  |6  |
+---+---+---+
| 1 |   |2  |
|   |9  | 57|
|  6|  3| 4 |
+---+---+---+
|  1|   | 23|
| 8 |   |  6|
| 63|24 |   |
+---+---+---+

There are two new aspects here: the if conditional, which structurally very much resembles the for loop. The second new aspect is the divisibility operator, %%. From other programming languages you probably know % for modulo, but since $number % $divisor == 0 is such a common pattern, $number %% $divisor is Perl 6's shortcut for it.

Shortcuts, Constants, and more Shortcuts

Perl 6 is modeled after human languages, which have some kind of compression scheme built in, where commonly used words tend to be short, and common constructs have shortcuts.

As such, there are lots of ways to write the code more succinctly. The first is basically cheating, because the sub chunks can be replaced by a built-in method in the Str class, comb:

# file sudoku.p6
use v6;

my $sudoku = '000000075000080094000500600010000200000900057006003040001000023080000006063240000';
$sudoku = $sudoku.trans: '0' => ' ';

my $separator = '+---+---+---+';
my $index = 0;
for $sudoku.comb(9) -> $line {
    if $index++ %% 3 {
        say $separator;
    }
    say '|', $line.comb(3).join('|'), '|';
}
say $separator;

The if conditional can be applied as a statement postfix:

say $separator if $index++ %% 3;

Except for the initialization, the variable $index is used only once, so there's no need to give it name. Yes, Perl 6 has anonymous variables:

my $separator = '+---+---+---+';
for $sudoku.comb(9) -> $line {
    say $separator if $++ %% 3;
    say '|', $line.comb(3).join('|'), '|';
}
say $separator;

Since $separator is a constant, we can declare it as one:

`constant $separator = '+---+---+---+';

If you want to reduce the line noise factor, you can also forego the sigil, so constant separator = '...'.

Finally there is a another syntax for method calls with arguments: instead of $obj.method(args) you can say $obj.method: args, which brings us to the idiomatic form of the small Sudoku formatter:

# file sudoku.p6
use v6;

my $sudoku = '000000075000080094000500600010000200000900057006003040001000023080000006063240000';
$sudoku = $sudoku.trans: '0' => ' ';

constant separator = '+---+---+---+';
for $sudoku.comb(9) -> $line {
    say separator if $++ %% 3;
    say '|', $line.comb(3).join('|'), '|';
}
say separator;

IO and other Tragedies

A practical script doesn't contain its input as a hard-coded string literal, but reads it from the command line, standard input or a file.

If you want to read the Sudoku from the command line, you can declare a subroutine called MAIN, which gets all command line arguments passed in:

# file sudoku.p6
use v6;

constant separator = '+---+---+---+';

sub MAIN($sudoku) {
    my $substituted = $sudoku.trans: '0' => ' ';

    for $substituted.comb(9) -> $line {
        say separator if $++ %% 3;
        say '|', $line.comb(3).join('|'), '|';
    }
    say separator;
}

This is how it's called:

$ perl6-m sudoku-format-08.p6 000000075000080094000500600010000200000900057006003040001000023080000006063240000
+---+---+---+
|   |   | 75|
|   | 8 | 94|
|   |5  |6  |
+---+---+---+
| 1 |   |2  |
|   |9  | 57|
|  6|  3| 4 |
+---+---+---+
|  1|   | 23|
| 8 |   |  6|
| 63|24 |   |
+---+---+---+

And you even get a usage message for free if you use it wrongly, for example by omitting the argument:

$ perl6-m sudoku.p6 
Usage:
  sudoku.p6 <sudoku> 

You might have noticed that the last example uses a separate variable for the substituted Sudoku string.This is because function parameters (aka variables declared in a signature) are read-only by default. Instead of creating a new variable, I could have also written sub MAIN($sudoku is copy) { ... }.

Classic UNIX programs such as cat and wc, follow the convention of reading their input from file names given on the command line, or from the standard input if no file names are given on the command line.

If you want your program to follow this convention, lines() provides a stream of lines from either of these source:

# file sudoku.p6
use v6;

constant separator = '+---+---+---+';

for lines() -> $sudoku {
    my $substituted = $sudoku.trans: '0' => ' ';

    for $substituted.comb(9) -> $line {
        say separator if $++ %% 3;
        say '|', $line.comb(3).join('|'), '|';
    }
    say separator;
}

Get Creative!

You won't learn a programming language from reading a blog, you have to actually use it, tinker with it. If you want to expand on the examples discussed earlier, I'd encourage you to try to produce Sudokus in different output formats.

SVG offers a good ratio of result to effort. This is the rough skeleton of an SVG file for a Sudoku:

<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="304" height="304" version="1.1"
xmlns="http://www.w3.org/2000/svg">
    <line x1="0" x2="300" y1="33.3333" y2="33.3333" style="stroke:grey" />
    <line x1="0" x2="300" y1="66.6667" y2="66.6667" style="stroke:grey" />
    <line x1="0" x2="303" y1="100" y2="100" style="stroke:black;stroke-width:2" />
    <line x1="0" x2="300" y1="133.333" y2="133.333" style="stroke:grey" />
    <!-- more horizontal lines here -->

    <line y1="0" y2="300" x1="33.3333" x2="33.3333" style="stroke:grey" />
    <!-- more vertical lines here -->


    <text x="43.7333" y="124.5"> 1 </text>
    <text x="43.7333" y="257.833"> 8 </text>
    <!-- more cells go here -->
    <rect width="304" height="304" style="fill:none;stroke-width:1;stroke:black;stroke-width:6"/>
</svg>

If you have a Firefox or Chrome browser, you can use it to open the SVG file.

If you are adventurous, you could also write a Perl 6 program that renders the Sudoku as a Postscript (PS) or Embedded Postscript document. It's also a text-based format.

Subscribe to the Perl 6 book mailing list

* indicates required

perlancar's blog: Getopt modules 03: Getopt::Long::Descriptive

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

Note that from this day on, all the reviewed modules are non-core since only Getopt::Std and Getopt::Long are core modules. The choice of using these modules must take into account this factor, as your user must bear an additional cost of installing the module from CPAN (unless your application bundles the module).

Some of these modules are wrappers for Getopt::Long, either because the author wants to offer a different interface and/or add some missing features. Some of the modules are higher-level: they are more than mere option parsing modules, usually a CLI framework.

Among the missing features often added is the ability to generate usage message (and the other common one is the ability to parse commands/subcommands). When using Getopt::Long, one already specifies a list of options. But there is no way to add a summary string for each option, making it impossible to create a useful/nice usage message. The modules solve this problem either by allowing user to specify the per-option summary string, or using/parsing user-supplied usage text/POD.

Getopt::Long::Descriptive is one such module: it allows you to specify per-option summary string, as well as default value for an option and whether an option is required. Judging from the number of reverse dependencies, Getopt::Long::Descriptive is the fourth most popular option parsing module on CPAN with 64 reverse dependencies (after Getopt::Long with 1127, Getopt::Std with 167, and MooseX::Getopt with 134). I also have actually reviewed Getopt::Long::Descriptive in one of my Perinci::CmdLine tutorial posts.

Aside my minor nitpick as described in the linked post, there are two additional notes: Getopt::Long::Descriptive depends on another non-core module Sub::Exporter, and its startup is ~twice that of Getopt::Long:

| participant               | time (ms) | mod_overhead_time (ms) |
|—————————+———–+————————|
| Getopt::Long::Descriptive | 36 | 33.9 |
| Getopt::Long | 15 | 12.9 |
| Getopt::Std | 3.8 | 1.7 |
| perl -e1 (baseline) | 2.1 | 0 |

Not that this should be a concern to most. If you use Getopt::Long, Getopt::Long::Descriptive is pretty recommended.

Tab completion. if you have a Getopt::Long::Descriptive-based CLI script, your users can now also use shcompgen to get tab completion, because shcompgen now supports detecting Getopt::Long::Descriptive-based scripts and activating tab completion for such scripts. In shells like fish and zsh, the description for each option will even be shown.


perlancar's blog: Getopt modules 02: Getopt::Std

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace). First article is here.

Getopt::Std is the other core module that comes with Perl when it comes to parsing command-line options. The problem is, it only supports one-letter options. Since Getopt::Long also supports one-letter options, there is really little or no reason for you to use Getopt::Std.

But Getopt::Std is still the second-most popular CPAN module when it comes to parsing command-line options, if you look at the number of reverse dependencies it has (167, after Getopt::Long which has 1127). Maybe this is because some people prefer using only short options. Also note that since these two modules are core, some distributions do not specify them as dependencies. Which means that the number of reverse dependencies for these two are actually higher.

One advantage of using Getopt::Std is its dead-simple API. You just call getopts() and user-specified options will be collected in $opt_* variables. Perfect if you have a short script that only accepts a few one-letter options.

Tab completion. If you have a Getopt::Std-based CLI script, your users can use shcompgen to get tab completion, because shcompgen now supports detecting Getopt::Std-based scripts and activating tab completion for such scripts. It only supports completing option names though. To be more useful (completing option values and arguments) you will need to use one of the other Getopt modules, e.g. Getopt::Long::Complete.


Sawyer X: Perl 5 Porters Mailing List Summary: November 21st-30th

Hey everyone,

Following is the p5p (Perl 5 Porters) mailing list summary for the past week and a half.

Enjoy!

November 21st-30th

News and highlights

New API function by Dave Mitchell: sv_set_undef, which can be used as a better version of sv_setsv(sv, &PL_sv_undef).

Andy Lester announced he will be removing the support he originally added to the Splint source code analyzer. It is not maintained and he could not even get it to compile.

Karl Williamson updated about new Emoji characters and skin tones.

Dan Kogai released Encode 2.88.

Grant reports

  • Tony Cook TPF Grant 8 report 5.
  • Tony Cook TPF Grant 8 report 6.
  • Dave Mitchell TPF Grant 2 reports #149, #150, and #151.

Issues

New issues

Resolved issues

  • Perl #126614: Assert fail/segfault in Perl_sv_pvn_force_flags.
  • Perl #129347: null pointer deref S_ft_return_false.
  • Perl #129953: lib/locale.t: Test failures and segfaulting on FreeBSD-11 and FreeBSD-12.
  • Perl #130101: Reduce memory needed for Data::Dumper.
  • Perl #130128: [PATCH] time64.c tweaks, take 2.
  • Perl #130132: Bleadperl v5.25.6-266-ga083329 breaks SBECK/Date-Manip-6.56.tar.gz.
  • Perl #130133: [PATCH] Configure leaves garbage in $Config{longdblinfbytes}.
  • Perl #130163: lib/locale.t fails on FreeBSD 10.
  • Perl #130169: Fix const correctness for header files.
  • Perl #130188: Crash on return from substitution in subroutine.

Suggested patches

John Lightsey (JD) mentioned ([perl #68348] Storable null pointer deref on truncated data) that the patch in Perl #130098 (Multiple segfaults in Storable) also fixes Perl #68348 (Storable null pointer deref on truncated data).

Andy Lester provided a patch in Perl #130194 to organize files in pod/.gitignore correctly.

Tony Cook provided a patch for Perl #130108 (Perl 5.24.1 fails to compile with DTrace enabled on FreeBSD).

Andy Lester also found several issues when running clang with -Weverything and provided a patch in Perl #130195 to resolve them.

Sullivan Beck provided a patch to upgrade Locale::Codes to version 3.42.

Discussion

Eric Brine asks if refaliasing will no longer be experimental in 5.26.

There is an interesting conversation around new syntax for a foreach with a loop index.

perlancar's blog: Getopt modules 01: Getopt::Long

perlancar's blog

About this mini-article series. Each day for 24 days, I will be reviewing a module that parses command-line options (such module is usually under the Getopt::* namespace).

Today, our module is Getopt::Long. I actually have reviewed it in a couple of my Perinci::CmdLine tutorial posts: Getopt::Long and What’s wrong with Getopt::Long. To recap: Getopt::Long is a core module that should be your go-to module for parsing command-line options.

Two things to remember. First, you should start your code with something like this:

Getopt::Long::Configure("bundling", "no_ignore_case", "permute", "no_getopt_compat");

bundling is to enable you to say -abc instead of -a -b -c if you happen to have these short options. Most Unix programs allow this. no_ignore_case is so that Getopt::Long differentiates -v and -V. Most Unix programs also behave like this, they are not case-insensitive when it comes to command-line options. Since there are only so many Latin letters, very often the lowercase letter and uppercase letter are used for different purposes.

permute is to allow you to say --option val --flag arg1 arg2 --another-flag. That is, you intersperse command-line options and arguments. For convenience, many Unix programs behave like this. For example, you can say ls -l A* or ls A* -l. By default (under no_permute mode), Getopt::Options requires a user to specify all options first before any argument, which is rather inconvenient.

no_getopt_compat is to disable interpreting +foo the same as --foo. Most programs nowaday do not interpret + as the start of command-line options anymore. Enabling getopt_compat (the default) only serves to interfere, for example if you have a filename like +foo then you’ll have to write it as ./+foo to avoid it being parsed as command-line options.

These modes should be the default, right? Just like use strict and use warnings should be the default in Perl. But for the sake of backward compatibility, they aren’t.

Tab completion. Another thing I want to add is: if you have a Getopt::Long-based CLI script, aside from modifying your script to use Getopt::Long::Complete instead, your users can now also use shcompgen to get tab completion, because shcompgen now supports detecting Getopt::Long-based scripts and activating tab completion for such scripts.


Perl Foundation News: Metacpan Upgraded

The team that supports meta::cpan had a busy weekend as they completed their long project to upgrade all of the metacpan infrastructure. Thanks to the team for all of their work and to the sponsors who made meta::hack possible. The site looks and performs great!

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Perl isn't dieing

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Ocean of Awareness: Top-down parsing is guessing

Top-down parsing is guessing. Literally. Bottom-up parsing is looking.

The way you'll often hear that phrased is that top-down parsing is looking, starting at the top, and bottom-up parsing is looking, starting at the bottom. But that is misleading, because the input is at the bottom -- at the top there is nothing to look at. A usable top-down parser must have a bottom-up component, even if that component is just lookahead.

A more generous, but still accurate, way to describe the top-down component of parsers is "prediction". And prediction is, indeed, a very useful component of a parser, when used in combination with other techniques.

Of course, if a parser does nothing but predict, it can predict only one input. Top-down parsing must always be combined with a bottom-up component. This bottom-up component may be as modest as lookahead, but it must be there or else top-down parsing is really not parsing at all.

So why is top-down parsing used so much?

Top-down parsing may be unusable in its pure form, but from one point of view that is irrelevant. Top-down parsing's biggest advantage is that it is highly flexible -- there's no reason to stick to its "pure" form.

A top-down parser can be written as a series of subroutine calls -- a technique called recursive descent. Recursive descent allows you to hook in custom-written bottom-up logic at every top-down choice point, and it is a technique which is completely understandable to programmers with little or no training in parsing theory. When dealing with recursive descent parsers, it is more useful to be a seasoned, far-thinking programmer than it is to be a mathematician. This makes recursive descent very appealing to seasoned, far-thinking programmers, and they are the audience that counts.

Switching techniques

You can even use the flexibility of top-down to switch away from top-down parsing. For example, you could claim that a top-down parser could do anything my own parser (Marpa) could do, because a recursive descent parser can call a Marpa parser.

A less dramatic switchoff, and one that still leaves the parser with a good claim to be basically top-down, is very common. Arithmetic expressions are essential for a computer language. But they are also among the many things top-down parsing cannot handle, even with ordinary lookahead. Even so, most computer languages these days are parsed top-down -- by recursive descent. These recursive descent parsers deal with expressions by temporarily handing control over to an bottom-up operator precedence parser. Neither of these parsers is extremely smart about the hand-over and hand-back -- it is up to the programmer to make sure the two play together nicely. But used with caution, this approach works.

Top-down parsing and language-oriented programming

But what about taking top-down methods into the future of language-oriented programming, extensible languages, and grammars which write grammars? Here we are forced to confront the reality -- that the effectiveness of top-down parsing comes entirely from the foreign elements that are added to it. Starting from a basis of top-down parsing is literally starting with nothing. As I have shown in more detail elsewhere, top-down techniques simply do not have enough horsepower to deal with grammar-driven programming.

Perl 6 grammars are top-down -- PEG with lots of extensions. These extensions include backtracking, backtracking control, a new style of tie-breaking and lots of opportunity for the programmer to intervene and customize everything. But behind it all is a top-down parse engine.

One aspect of Perl 6 grammars might be seen as breaking out of the top-down trap. That trick of switching over to a bottom-up operator precedence parser for expressions, which I mentioned above, is built into Perl 6 and semi-automated. (I say semi-automated because making sure the two parsers "play nice" with each other is not automated -- that's still up to the programmer.)

As far as I know, this semi-automation of expression handling is new with Perl 6 grammars, and it may prove handy for duplicating what is done in recursive descent parsers. But it adds no new technique to those already in use. And features like

  • mulitple types of expression, which can be told apart based on their context,
  • n-ary expressions for arbitrary n, and
  • the autogeneration of multiple rules, each allowing a different precedence scheme, for expressions of arbitrary arity and associativity,

all of which are available and in current use in Marpa, are impossible for the technology behind Perl 6 grammars.

I am a fan of the Perl 6 effort. Obviously, I have doubts about one specific set of hopes for Perl 6 grammars. But these hopes have not been central to the Perl 6 effort, and I will be an eager student of the Perl 6 team's work over the coming months.

Comments

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Ocean of Awareness: Parsing: an expanded timeline

The fourth century BCE: In India, Pannini creates a sophisticated description of the Sanskrit language, exact and complete, and including pronunciation. Sanskrit could be recreated using nothing but Pannini's grammar. Pannini's grammar is probably the first formal system of any kind, predating Euclid. Even today, nothing like it exists for any other natural language of comparable size or corpus. Pannini is the object of serious study today. But in the 1940's and 1950's Pannini is almost unknown in the West. His work has no direct effect on the other events in this timeline.

1943: Emil Post defines and studies a formal rewriting system using productions. With this, the process of reinventing Pannini in the West begins.

1948: Claude Shannon publishes the foundation paper of information theory. Andrey Markov's finite state processes are used heavily.

1952: Grace Hopper writes a linker-loader and describes it as a "compiler". She seems to be the first person to use this term for a computer program. Hopper uses the term "compiler" in its original sense: "something or someone that brings other things together".

1954: At IBM, a team under John Backus begins working on the language which will be called FORTRAN. The term "compiler" is still being used in Hopper's looser sense, instead of its modern one. In particular, there is no implication that the output of a "compiler" is ready for execution by a computer. The output of one 1954 "compiler", for example, produces relative addresses, which need to be translated by hand before a machine can execute them.

1955: Noam Chomsky is awarded a Ph.D. in linguistics and accepts a teaching post at MIT. MIT does not have a linguistics department and Chomsky, in his linguistics course, is free to teach his own approach, highly original and very mathematical.

1956: Chomsky publishes the paper which is usually considered the foundation of Western formal language theory. The paper advocates a natural language approach that involves

  • a bottom layer, using Markov's finite state processes;
  • a middle, syntactic layer, using context-free grammars and context-sensitive grammars; and
  • a top layer, which involves mappings or "transformations" of the output of the syntactic layer.

These layers resemble, and will inspire, the lexical, syntactic and AST transformation phases of modern parsers. For finite state processes, Chomsky acknowledges Markov. The other layers seem to be Chomsky's own formulations -- Chomsky does not cite Post's work.

1957: Steven Kleene discovers regular expressions, a very handy notation for Markov's processes. Regular expressions turn out to describe exactly the mathematical objects being studied as finite state automata, as well as some of the objects being studied as neural nets.

1957: Noam Chomsky publishes Syntactic Structures, one of the most influential books of all time. The orthodoxy in 1957 is structural linguistics which argues, with Sherlock Holmes, that "it is a capital mistake to theorize in advance of the facts". Structuralists start with the utterances in a language, and build upward.

But Chomsky claims that without a theory there are no facts: there is only noise. The Chomskyan approach is to start with a grammar, and use the corpus of the language to check its accuracy. Chomsky's approach will soon come to dominate linguistics.

1957: Backus's team makes the first FORTRAN compiler available to IBM customers. FORTRAN is the first high-level language that will find widespread implementation. As of this writing, it is the oldest language that survives in practical use. FORTRAN is a line-by-line language and its parsing is primitive.

1958: John McCarthy's LISP appears. LISP goes beyond the line-by-line syntax -- it is recursively structured. But the LISP interpreter does not find the recursive structure: the programmer must explicitly indicate the structure herself, using parentheses.

1959: Backus invents a new notation to describe the IAL language (aka ALGOL). Backus's notation is influenced by his study of Post -- he seems not to have read Chomsky until later.

1960: Peter Naur improves the Backus notation and uses it to describe ALGOL 60. The improved notation will become known as Backus-Naur Form (BNF).

1960: The ALGOL 60 report specifies, for the first time, a block structured language. ALGOL 60 is recursively structured but the structure is implicit -- newlines are not semantically significant, and parentheses indicate syntax only in a few specific cases. The ALGOL compiler will have to find the structure. It is a case of 1960's optimism at its best. As the ALGOL committee is well aware, a parsing algorithm capable of handling ALGOL 60 does not yet exist. But the risk they are taking will soon pay off.

1960: A.E. Gleenie publishes his description of a compiler-compiler. Glennie's "universal compiler" is more of a methodology than an implementation -- the compilers must be written by hand. Glennie credits both Chomsky and Backus, and observes that the two notations are "related". He also mentions Post's productions. Glennie may have been the first to use BNF as a description of a procedure instead of as the description of a Chomsky grammar. Glennie points out that the distinction is "important".

Chomskyan BNF and procedural BNF: BNF, when used as a Chomsky grammar, describes a set of strings, and does not describe how to parse strings according to the grammar. BNF notation, if used to describe a procedure, is a set of instructions, to be tried in some order, and used to process a string. Procedural BNF describes a procedure first, and a language only indirectly.

Both procedural and Chomskyan BNF describe languages, but usually not the same language. That is,

  • Suppose D is some BNF description.
  • Let P(D) be D interpreted as a procedure,
  • Let L(P(D)) be the language which the procedure P(D) parses.
  • Let G(D) be D interpreted as a Chomsky grammar.
  • Let L(G(D)) be the language which the grammar G(D) describes.
  • Then, usually, L(P(D)) != L(G(D)).

The pre-Chomskyan approach, using procedural BNF, is far more natural to someone trained as a computer programmer. The parsing problem appears to the programmer in the form of strings to be parsed, exactly the starting point of procedural BNF and pre-Chomsky parsing.

Even when the Chomskyan approach is pointed out, it does not at first seem very attractive. With the pre-Chomskyan approach, the examples of the language more or less naturally lead to a parser. In the Chomskyan approach the programmer has to search for an algorithm to parse strings according to his grammar -- and the search for good algorithms to parse Chomskyan grammars has proved surprisingly long and difficult. Handling semantics is more natural with a Chomksyan approach. But, using captures, semantics can be added to a pre-Chomskyan parser and, with practice, this seems natural enough.

Despite the naturalness of the pre-Chomskyan approach to parsing, we will find that the first fully-described automated parsers are Chomskyan. This is a testimony to Chomsky's influence at the time. We will also see that Chomskyan parsers have been dominant ever since.

1961: In January, Ned Irons publishes a paper describing his ALGOL 60 parser. It is the first paper to fully describe any parser. The Irons algorithm is Chomskyan and top-down with a "left corner" element. The Irons algorithm is general, meaning that it can parse anything written in BNF. It is syntax-driven (aka declarative), meaning that the parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Peter Lucas publishes the first description of a purely top-down parser. This can be considered to be recursive descent, though in Lucas's paper the algorithm has a syntax-driven implementation, useable only for a restricted class of grammars. Today we think of recursive descent as a methodology for writing parsers by hand. Hand-coded approaches became more popular in the 1960's due to three factors:

  • Memory and CPU were both extremely limited. Hand-coding paid off, even when the gains were small.
  • Non-hand coded top-down parsing, of the kind Lucas's syntax-driven approach allowed, is a very weak parsing technique. It was (and still is) often necessary to go beyond its limits.
  • Top-down parsing is intuitive -- it essentially means calling subroutines. It therefore requires little or no knowledge of parsing theory. This makes it a good fit for hand-coding.

1963: L. Schmidt, Howard Metcalf, and Val Schorre present papers on syntax-directed compilers at a Denver conference.

1964: Schorre publishes a paper on the Meta II "compiler writing language", summarizing the papers of the 1963 conference. Schorre cites both Backus and Chomsky as sources for Meta II's notation. Schorre notes that his parser is "entirely different" from that of Irons 1961 -- in fact it is pre-Chomskyan. Meta II is a template, rather than something that readers can use, but in principle it can be turned into a fully automated compiler-compiler.

1965: Don Knuth invents LR parsing. The LR algorithm is deterministic, Chomskyan and bottom-up, but it is not thought to be practical. Knuth is primarily interested in the mathematics.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is Chomskyan, syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's algorithm is both top-down and bottom-up at once -- it uses dynamic programming and keeps track of the parse in tables. Earley's approach makes a lot of sense and looks very promising indeed, but there are three serious issues:

  • First, there is a bug in the handling of zero-length rules.
  • Second, it is quadratic for right recursions.
  • Third, the bookkeeping required to set up the tables is, by the standards of 1968 hardware, daunting.

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size. LALR looks practical.

1969: Ken Thompson writes the "ed" editor as one of the first components of UNIX. At this point, regular expressions are an esoteric mathematical formalism. Through the "ed" editor and its descendants, regular expressions will become an everyday part of the working programmer's toolkit.

Recognizers: In comparing algorithms, it can be important to keep in mind whether they are recognizers or parsers. A recognizer is a program which takes a string and produces a "yes" or "no" according to whether a string is in part of a language. Regular expressions are typically used as recognizers. A parser is a program which takes a string and produces a tree reflecting its structure according to a grammar. The algorithm for a compiler clearly must be a parser, not a recognizer. Recognizers can be, to some extent, used as parsers by introducing captures.

1972: Alfred Aho and Jeffrey Ullman publish a two volume textbook summarizing the theory of parsing. This book is still important. It is also distressingly up-to-date -- progress in parsing theory slowed dramatically after 1972. Aho and Ullman describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1972: Under the names TDPL and GTDPL, Aho and Ullman investigate the non-Chomksyan parsers in the Schorre lineage. They note that "it can be quite difficult to determine what language is defined by a TDPL parser". That is, GTDPL parsers do whatever they do, and that whatever is something the programmer in general will not be able to describe. The best a programmer can usually do is to create a test suite and fiddle with the GTDPL description until it passes. Correctness cannot be established in any stronger sense. GTDPL is an extreme form of the old joke that "the code is the documentation" -- with GTDPL nothing documents the language of the parser, not even the code.

GTDPL's obscurity buys nothing in the way of additional parsing power. Like all non-Chomskyan parsers, GTDPL is basically a extremely powerful recognizer. Pressed into service as a parser, it is comparatively weak. As a parser, GTDPL is essentially equivalent to Lucas's 1961 syntax-driven algorithm, which was in turn a restricted form of recursive descent.

At or around this time, rumor has it that the main line of development for GTDPL parsers is classified secret by the US government. GTDPL parsers have the property that even small changes in GTDPL parsers can be very labor-intensive. For some government contractors, GTDPL parsing provides steady work for years to come. Public interest in GTDPL fades.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed.

1979: Part of the V7 toolkit is Yet Another Compiler Compiler (YACC). YACC is LALR-powered. Despite its name, YACC is the first compiler-compiler in the modern sense. For some useful languages, the process of going from Chomskyan specification to executable is fully automated. Most practical languages, including the C language and YACC's own input language, still require manual hackery. Nonetheless, after two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses YACC and LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of the Earley algorithm.

But Earley parsing is almost forgotten. Twenty years will pass before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: John Aycock and R. Nigel Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2004: Bryan Ford publishes his paper on PEG. Implementers by now are avoiding YACC, and it seems as if there might soon be no syntax-driven algorithms in practical use. Ford fills this gap by repackaging the nearly-forgotten GTDPL. Ford adds packratting, so that PEG is always linear, and provides PEG with an attractive new syntax. But nothing has been done to change the problematic behaviors of GTDPL.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

Today: After five decades of parsing theory, the state of the art seems to be back where it started. We can imagine someone taking Ned Iron's original 1961 algorithm from the first paper ever published describing a parser, and republishing it today. True, he would have to translate its code from the mix of assembler and ALGOL into something more fashionable, say Haskell. But with that change, it might look like a breath of fresh air.

Marpa: an afterword

The recollections of my teachers cover most of this timeline. My own begin around 1970. Very early on, as a graduate student, I became unhappy with the way the field was developing. Earley's algorithm looked interesting, and it was something I returned to on and off.

The original vision of the 1960's was a parser that was

  • efficient,
  • practical,
  • general, and
  • syntax-driven.

By 2010 this vision seemed to have gone the same way as many other 1960's dreams. The rhetoric stayed upbeat, but parsing practice had become a series of increasingly desperate compromises.

But, while nobody was looking for them, the solutions to the problems encountered in the 1960's had appeared in the literature. Aycock and Horspool had solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and are probably no longer relevant in any case -- cache misses are now the bottleneck.

The programmers of the 1960's would have been prepared to trust a fully declarative Chomskyan parser. With the experience with LALR in their collective consciousness, modern programmers might be more guarded. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove." But I found it straightforward to rearrange the Earley parse engine to allow efficient event-driven handovers between procedural and syntax-driven logic. And Earley tables provide the procedural logic with full knowledge of the state of the parse so far, so that Earley's algorithm is a better platform for hand-written procedural logic than recursive descent.

References, comments, etc.

My implementation of Earley's algorithm is called Marpa. For more about Marpa, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Ocean of Awareness: What are the reasonable computer languages?

"You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?'" -- George Bernard Shaw

In the 1960's and 1970's computer languages were evolving rapidly. It was not clear which way they were headed. Would most programming be done with general-purpose languages? Or would programmers create a language for every task domain? Or even for every project? And, if lots of languages were going to be created, what kinds of languages would be needed?

It was in that context that Čulik and Cohen, in a 1973 paper, outlined what they thought programmers would want and should have. In keeping with the spirit of the time, it was quite a lot:

  • Programmers would want to extend their grammars with new syntax, including new kinds of expressions.
  • Programmers would also want to use tools that automatically generated new syntax.
  • Programmers would not want to, and especially in the case of auto-generated syntax would usually not be able to, massage the syntax into very restricted forms. Instead, programmers would create grammars and languages which required unlimited lookahead to disambiguate, and they would require parsers which could handle these grammars.
  • Finally, programmers would need to be able to rely on all of this parsing being done in linear time.

Today, we think we know that Čulik and Cohen's vision was naive, because we think we know that parsing technology cannot support it. We think we know that parsing is much harder than they thought.

The eyeball grammars

As a thought problem, consider the "eyeball" class of grammars. The "eyeball" class of grammars contains all the grammars that a human can parse at a glance. If a grammar is in the eyeball class, but a computer cannot parse it, it presents an interesting choice. Either,

  • your computer is not using the strongest practical algorithm; or
  • your mind is using some power which cannot be reduced to a machine computation.

There are some people out there (I am one of them) who don't believe that everything the mind can do reduces to a machine computation. But even those people will tend to go for the choice in this case: There must be some practical computer parsing algorithm which can do at least as well at parsing as a human can do by "eyeball". In other words, the class of "reasonable grammars" should contain the eyeball class.

Čulik and Cohen's candidate for the class of "reasonable grammars" were the grammars that a deterministic parse engine could parse if it had a lookahead that was infinite, but restricted to distinguishing between regular expressions. They called these the LR-regular, or LRR, grammars. And the LRR grammars do in fact seem to be a good first approximation to the eyeball class. They do not allow lookahead that contains things that you have to count, like palindromes. And, while I'd be hard put to eyeball every possible string for every possible regular expression, intuitively the concept of scanning for a regular expression does seem close to capturing the idea of glancing through a text looking for a telltale pattern.

So what happened?

Alas, the algorithm in the Čulik and Cohen paper turned out to be impractical. But in 1991, Joop Leo discovered a way to adopt Earley's algorithm to parse the LRR grammars in linear time, without doing the lookahead. And Leo's algorithm does have a practical implementation: Marpa.

References, comments, etc.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: cgit syntax highlighting

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Ocean of Awareness: What parser do birds use?

"Here we provide, to our knowledge, the first unambiguous experimental evidence for compositional syntax in a non-human vocal system." -- "Experimental evidence for compositional syntax in bird calls", Toshitaka N. Suzuki, David Wheatcroft & Michael Griesser Nature Communications 7, Article number: 10986

In this post I look at a subset of the language of the Japanese great tit, also known as Parus major. The above cited article presents evidence that bird brains can parse this language. What about standard modern computer parsing methods? Here is the subset -- probably a tiny one -- of the language actually used by Parus major.

      S ::= ABC
      S ::= D
      S ::= ABC D
      S ::= D ABC
    

Classifying the Parus major grammar

Grammophone is a very handy new tool for classifying grammars. Its own parser is somewhat limited, so that it requires a period to mark the end of a rule. The above grammar is in Marpa's SLIF format, which is smart enough to use the "::=" operator to spot the beginning and end of rules, just as the human eye does. Here's the same grammar converted into a form acceptable to Grammophone:

      S -> ABC .
      S -> D .
      S -> ABC D .
      S -> D ABC .
    

Grammophone tells us that the Parus major grammar is not LL(1), but that it is LALR(1).

What does this mean?

LL(1) is the class of grammar parseable by top-down methods: it's the best class for characterizing most parsers in current use, including recursive descent, PEG, and Perl 6 grammars. All of these parsers fall short of dealing with the Parus major language.

LALR(1) is probably most well-known from its implementations in bison and yacc. While able to handle this subset of Parus's language, LALR(1) has its own, very strict, limits. Whether LALR(1) could handle the full complexity of Parus language is a serious question. But it's a question that in practice would probably not arise. LALR(1) has horrible error handling properties.

When the input is correct and within its limits, an LALR-driven parser is fast and works well. But if the input is not perfectly correct, LALR parsers produce no useful analysis of what went wrong. If Parus hears "d abc d", a parser like Marpa, on the other hand, can produce something like this:

# * String before error: abc d\s
# * The error was at line 1, column 7, and at character 0x0064 'd', ...
# * here: d
    

Parus uses its language in predatory contexts, and one can assume that a Parus with a preference for parsers whose error handling is on an LALR(1) level will not be keeping its alleles in the gene pool for very long.

References, comments, etc.

Those readers content with sub-Parus parsing methods may stop reading here. Those with greater parsing ambitions, however, may wish to learn more about Marpa. A Marpa test script for parsing the Parus subset is in a Github gist. Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Ocean of Awareness: Introduction to Marpa Book in progress

What follows is a summary of the features of the Marpa algorithm, followed by a discussion of potential applications. It refers to itself as a "monograph", because it is a draft of part of the introduction to a technical monograph on the Marpa algorithm. I hope the entire monograph will appear in a few weeks.

The Marpa project

The Marpa project was intended to create a practical and highly available tool to generate and use general context-free parsers. Tools of this kind had long existed for LALR and regular expressions. But, despite an encouraging academic literature, no such tool had existed for context-free parsing. The first stable version of Marpa was uploaded to a public archive on Solstice Day 2011. This monograph describes the algorithm used in the most recent version of Marpa, Marpa::R2. It is a simplification of the algorithm presented in my earlier paper.

A proven algorithm

While the presentation in this monograph is theoretical, the approach is practical. The Marpa::R2 implementation has been widely available for some time, and has seen considerable use, including in production environments. Many of the ideas in the parsing literature satisfy theoretical criteria, but in practice turn out to face significant obstacles. An algorithm may be as fast as reported, but may turn out not to allow adequate error reporting. Or a modification may speed up the recognizer, but require additional processing at evaluation time, leaving no advantage to compensate for the additional complexity.

In this monograph, I describe the Marpa algorithm as it was implemented for Marpa::R2. In many cases, I believe there are better approaches than those I have described. But I treat these techniques, however solid their theory, as conjectures. Whenever I mention a technique that was not actually implemented in Marpa::R2, I will always explicitly state that that technique is not in Marpa as implemented.

Features

General context-free parsing

As implemented, Marpa parses all "proper" context-free grammars. The proper context-free grammars are those which are free of cycles, unproductive symbols, and inaccessible symbols. Worst case time bounds are never worse than those of Earley's algorithm, and therefore never worse than O(n**3).

Linear time for practical grammars

Currently, the grammars suitable for practical use are thought to be a subset of the deterministic context-free grammars. Using a technique discovered by Joop Leo, Marpa parses all of these in linear time. Leo's modification of Earley's algorithm is O(n) for LR-regular grammars. Leo's modification also parses many ambiguous grammars in linear time.

Left-eidetic

The original Earley algorithm kept full information about the parse --- including partial and fully recognized rule instances --- in its tables. At every parse location, before any symbols are scanned, Marpa's parse engine makes available its information about the state of the parse so far. This information is in useful form, and can be accessed efficiently.

Recoverable from read errors

When Marpa reads a token which it cannot accept, the error is fully recoverable. An application can try to read another token. The application can do this repeatedly as long as none of the tokens are accepted. Once the application provides a token that is accepted by the parser, parsing will continue as if the unsuccessful read attempts had never been made.

Ambiguous tokens

Marpa allows ambiguous tokens. These are often useful in natural language processing where, for example, the same word might be a verb or a noun. Use of ambiguous tokens can be combined with recovery from rejected tokens so that, for example, an application could react to the rejection of a token by reading two others.

Using the features

Error reporting

An obvious application of left-eideticism is error reporting. Marpa's abilities in this respect are ground-breaking. For example, users typically regard an ambiguity as an error in the grammar. Marpa, as currently implemented, can detect an ambiguity and report specifically where it occurred and what the alternatives were.

Event driven parsing

As implemented, Marpa::R2 allows the user to define "events". Events can be defined that trigger when a specified rule is complete, when a specified rule is predicted, when a specified symbol is nulled, when a user-specified lexeme has been scanned, or when a user-specified lexeme is about to be scanned. A mid-rule event can be defined by adding a nulling symbol at the desired point in the rule, and defining an event which triggers when the symbol is nulled.

Ruby slippers parsing

Left-eideticism, efficient error recovery, and the event mechanism can be combined to allow the application to change the input in response to feedback from the parser. In traditional parser practice, error detection is an act of desperation. In contrast, Marpa's error detection is so painless that it can be used as the foundation of new parsing techniques.

For example, if a token is rejected, the lexer is free to create a new token in the light of the parser's expectations. This approach can be seen as making the parser's "wishes" come true, and I have called it "Ruby Slippers Parsing".

One use of the Ruby Slippers technique is to parse with a clean but oversimplified grammar, programming the lexical analyzer to make up for the grammar's short-comings on the fly. As part of Marpa::R2, the author has implemented an HTML parser, based on a grammar that assumes that all start and end tags are present. Such an HTML grammar is too simple even to describe perfectly standard-conformant HTML, but the lexical analyzer is programmed to supply start and end tags as requested by the parser. The result is a simple and cleanly designed parser that parses very liberal HTML and accepts all input files, in the worst case treating them as highly defective HTML.

Ambiguity as a language design technique

In current practice, ambiguity is avoided in language design. This is very different from the practice in the languages humans choose when communicating with each other. Human languages exploit ambiguity in order to design highly flexible, powerfully expressive languages. For example, the language of this monograph, English, is notoriously ambiguous.

Ambiguity of course can present a problem. A sentence in an ambiguous language may have undesired meanings. But note that this is not a reason to ban potential ambiguity --- it is only a problem with actual ambiguity.

Syntax errors, for example, are undesired, but nobody tries to design languages to make syntax errors impossible. A language in which every input was well-formed and meaningful would be cumbersome and even dangerous: all typos in such a language would be meaningful, and parser would never warn the user about errors, because there would be no such thing.

With Marpa, ambiguity can be dealt with in the same way that syntax errors are dealt with in current practice. The language can be designed to be ambiguous, but any actual ambiguity can be detected and reported at parse time. This exploits Marpa's ability to report exactly where and what the ambiguity is. Marpa::R2's own parser description language, the SLIF, uses ambiguity in this way.

Auto-generated languages

In 1973, Čulik and Cohen pointed out that the ability to efficiently parse LR-regular languages opens the way to auto-generated languages. In particular, Čulik and Cohen note that a parser which can parse any LR-regular language will be able to parse a language generated using syntax macros.

Second order languages

In the literature, the term "second order language" is usually used to describe languages with features which are useful for second-order programming. True second-order languages --- languages which are auto-generated from other languages --- have not been seen as practical, since there was no guarantee that the auto-generated language could be efficiently parsed.

With Marpa, this barrier is raised. As an example, Marpa::R2's own parser description language, the SLIF, allows "precedenced rules". Precedenced rules are specified in an extended BNF. The BNF extensions allow precedence and associativity to be specified for each RHS.

Marpa::R2's precedenced rules are implemented as a true second order language. The SLIF representation of the precedenced rule is parsed to create a BNF grammar which is equivalent, and which has the desired precedence. Essentially, the SLIF does a standard textbook transformation. The transformation starts with a set of rules, each of which has a precedence and an associativity specified. The result of the transformation is a set of rules in pure BNF. The SLIF's advantage is that it is powered by Marpa, and therefore the SLIF can be certain that the grammar that it auto-generates will parse in linear time.

Notationally, Marpa's precedenced rules are an improvement over similar features in LALR-based parser generators like yacc or bison. In the SLIF, there are two important differences. First, in the SLIF's precedenced rules, precedence is generalized, so that it does not depend on the operators: there is no need to identify operators, much less class them as binary, unary, etc. This more powerful and flexible precedence notation allows the definition of multiple ternary operators, and multiple operators with arity above three.

Second, and more important, a SLIF user is guaranteed to get exactly the language that the precedenced rule specifies. The user of the yacc equivalent must hope their syntax falls within the limits of LALR.

References, comments, etc.

Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Header image by Tambako the Jaguar. Some rights reserved.