Sawyer X: Perl 5 Porters Mailing List Summary: March 22nd-27th

Hey everyone,

Following is the p5p (Perl 5 Porters) mailing list summary for the past week.

Enjoy!

March 22nd-27th

News and updates

Yves Orton provided a branch with new hash functions and turnable hash function support. Fascinating!

Kent Fredric requests testers to adjust PERL_USE_UNSAFE_INC=0 in order to surface real bugs in tests. Kent also provided a summary of the steps he takes to verify bugs.

Dave Mitchell shared his attempt (proof-of-concept short-string PVs) at a speed up by storing short strings in the SV head.

Grant reports

Issues

New issues

  • Perl #131050: Non-finite repeat count warning isn't emitted consistently.
  • Perl #131054: Perl 5.24 makes nama FTBFS due to segfault.

Resolved issues

  • Perl #130949: calling C function perl_clone of libperl causes segfault in perl 5.22 and and 5.24 but fine on perl 5.20.
  • Perl #131031: t/run/runenv.t: Guard against possible presence of PERL_USE_UNSAFE_INC=1.
  • Perl #131033: t/op/range.t fails.

Ovid: Using the Perl debugger with DBIx::Class

I wrote about using the Perl debugger with Moose. In that post, I showed how to use DB::Skip to make it easier to use the Perl debugger with Moose and other modules whose guts you don't want to wade through.

Today's debugger hack will make using the debugger with DBIx::Class much easier.

If you're a fan of the debugger, one thing which makes life miserable is being in the debugger and accidentally doing something like this:

121:        local config(qw'logging LogEvent')->{Exchange} = 1;
122:    $DB::single = 1;
123==>      my $wallet = $character->wallet;
124 
125:        my $exchange = $class->new(
126             principal => $character,
127             slug      => 'test_check_success',
128             steps     => [ [ TestWallet => check => $character => $wallet ] ]
129         );
  DB<2> x $character

With that, you get hundreds of lines of output scrolling past, too fast to read. And forget about going back in your buffer trying to find the data you need. It's a nightmare.

So I hijacked the p command in the debugger. In perldoc perldebug, we see that p is for this:

p expr

Same as print {$DB::OUT} expr in the current package. In particular, because this is just Perl's own print function, this means that nested data structures and objects are not dumped, unlike with the x command. The DB::OUT filehandle is opened to /dev/tty, regardless of where STDOUT may be redirected to.

To be honest, I don't find that very useful. So I altered it to make a quick "dump" of a data structure, with one interesting caveat. Unless it's an array or hash reference, it's stringified. So the above debugging session becomes this (I've truncated some data to hide some important gameplay):

121:        local config(qw'logging LogEvent')->{Exchange} = 1;
122:    $DB::single = 1;
123==>      my $wallet = $character->wallet;
124 
125:        my $exchange = $class->new(
126             principal => $character,
127             slug      => 'test_check_success',
128             steps     => [ [ TestWallet => check => $character => $wallet ] ]
129         );
DB<2> p $character
Veure::Model::DB::Character=HASH(0x7fcd6feb5de8)
DB<3> p {%$character}
{
'_column_data' => {
    'agility' => '10',
    'arrival' => '2017-03-27 09:44:33.546391000+0000',
    'career_path_id' => '',
    'character_id' => '15538',
    'conscious' => '1964-01-22 00:00:00+0000',
    'curr_agility' => '1',
    'curr_intelligence' => '1',
    'curr_social' => '1',
    'curr_stamina' => '1',
    'curr_strength' => '1',
    'current_ship_id' => '',
    'experience' => '800',
    'focus' => '1',
    'genotype_id' => '1',
    'intelligence' => '10',
    'inventory_id' => '17762',
    'last_activity' => '2017-03-27 09:44:33+0000',
    'last_station_area_id' => '1199',
    'modifier_list_id' => '',
    'name' => 'winston',
    'slug' => 'winston',
    'social' => '10',
    'stamina' => '10',
    'station_area_id' => '1199',
    'strength' => '10',
    'user_id' => '14554',
    'wallet' => '10'
},
'_dirty_columns' => {},
'_ignore_messages' => '0',
'_in_storage' => '1',
'_inflated_column' => {
    'arrival' => '2017-03-27T09:44:33',
    'conscious' => '1964-01-22T00:00:00',
    'last_activity' => '2017-03-27T09:44:33'
},
'_rel_in_storage' => {},
'_relationship_data' => {
    'bank_account' => 'Veure::Model::DB::BankAccount=HASH(0x7fcd6fec3460)',
    'character_user' => 'Veure::Model::DB::User=HASH(0x7fcd6fed6f08)',
    'inventory' => 'Veure::Model::DB::Inventory=HASH(0x7fcd71137110)'
},
'_result_source' => 'DBIx::Class::ResultSource::Table=HASH(0x7fcd6b7ce350)',
'app' => 'Veure::App=HASH(0x7fcd6b79aa08)',
'related_resultsets' => {}
}

Now dumping that data structure is actually useful and it's trivial for me to find the information I need!

You can get the source of my debugger hack here. Save that as .perldb in your home directory.

Hire Us!

As always, if you want top-notch software development or training, or help building your test suite, drop me a line at ovid at allaroundtheworld.fr. Most of us are Perl devs, but we also do front-end development, Golang, C++, Lua, and Python. Due to our extensive links in multiple software communities, if you need an expert in another language, let me know.

Perlgeek.de : Perl 6 By Example: Stacked Plots with Matplotlib

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


In a previous episode, we've explored plotting git statistics in Perl 6 using matplotlib.

Since I wasn't quite happy with the result, I want to explore using stacked plots for presenting the same information. In a regular plot, the y coordiante of each plotted value is proportional to its value. In a stacked plot, it is the distance to the previous value that is proportional to its value. This is nice for values that add up to a total that is also interesting.

Matplotlib offers a method called stackplot for that. Contrary to multiple plot calls on subplot object, it requires a shared x axis for all data series. So we must construct one array for each author of git commits, where dates with no value come out as zero.

As a reminder, this is what the logic for extracting the stats looked like in the first place:

my $proc = run :out, <git log --date=short --pretty=format:%ad!%an>;
my (%total, %by-author, %dates);
for $proc.out.lines -> $line {
    my ( $date, $author ) = $line.split: '!', 2;
    %total{$author}++;
    %by-author{$author}{$date}++;
    %dates{$date}++;
}
my @top-authors = %total.sort(-*.value).head(5)>>.key;

And some infrastructure for plotting with matplotlib:

my $py = Inline::Python.new;
$py.run('import datetime');
$py.run('import matplotlib.pyplot');
sub plot(Str $name, |c) {
    $py.call('matplotlib.pyplot', $name, |c);
}
sub pydate(Str $d) {
    $py.call('datetime', 'date', $d.split('-').map(*.Int));
}

my ($figure, $subplots) = plot('subplots');
$figure.autofmt_xdate();

So now we have to construct an array of arrays, where each inner array has the values for one author:

my @dates = %dates.keys.sort;
my @stack = $[] xx @top-authors;

for @dates -> $d {
    for @top-authors.kv -> $idx, $author {
        @stack[$idx].push: %by-author{$author}{$d} // 0;
    }
}

Now plotting becomes a simple matter of a method call, followed by the usual commands adding a title and showing the plot:

$subplots.stackplot($[@dates.map(&pydate)], @stack);
plot('title', 'Contributions per day');
plot('show');

The result (again run on the zef source repository) is this:

Stacked plot of zef contributions over time

Comparing this to the previous visualization reveals a discrepancy: There were no commits in 2014, and yet the stacked plot makes it appear this way. In fact, the previous plots would have shown the same "alternative facts" if we had chosen lines instead of points. It comes from matplotlib (like nearly all plotting libraries) interpolates linearly between data points. But in our case, a date with no data points means zero commits happened on that date.

To communicate this to matplotlib, we must explicitly insert zero values for missing dates. This can be achieved by replacing

my @dates = %dates.keys.sort;

with the line

my @dates = %dates.keys.minmax;

The minmax method finds the minimal and maximal values, and returns them in a Range. Assigning the range to an array turns it into an array of all values between the minimal and the maximal value. The logic for assembling the @stack variable already maps missing values to zero.

The result looks a bit better, but still far from perfect:

Stacked plot of zef contributions over time, with missing dates mapped to zero

Thinking more about the problem, contributions from separate days should not be joined together, because it produces misleading results. Matplotlib doesn't support adding a legend automatically to stacked plots, so this seems to be to be a dead end.

Since a dot plot didn't work very well, let's try a different kind of plot that represents each data point separately: a bar chart, or more specifically, a stacked bar chart. Matplotlib offers the bar plotting method, and a named parameter bottom can be used to generate the stacking:

my @dates = %dates.keys.sort;
my @stack = $[] xx @top-authors;
my @bottom = $[] xx @top-authors;

for @dates -> $d {
    my $bottom = 0;
    for @top-authors.kv -> $idx, $author {
        @bottom[$idx].push: $bottom;
        my $value = %by-author{$author}{$d} // 0;
        @stack[$idx].push: $value;
        $bottom += $value;
    }
}

We need to supply color names ourselves, and set the edge color of the bars to the same color, otherwise the black edge color dominates the result:

my $width = 1.0;
my @colors = <red green blue yellow black>;
my @plots;

for @top-authors.kv -> $idx, $author {
    @plots.push: plot(
        'bar',
        $[@dates.map(&pydate)],
        @stack[$idx],
        $width,
        bottom => @bottom[$idx],
        color => @colors[$idx],
        edgecolor => @colors[$idx],
    );
}
plot('legend', $@plots, $@top-authors);

plot('title', 'Contributions per day');
plot('show');

This produces the first plot that's actually informative and not misleading (provided you're not color blind):

Stacked bar plot of zef contributions over time

If you want to improve the result further, you could experiment with limiting the number of bars by lumping together contributions by week or month (or maybe $n-day period).

Next, we'll investigate ways to make the matplotlib API more idiomatic to use from Perl 6 code.

Subscribe to the Perl 6 book mailing list

* indicates required

Gabor Szabo: Perl related crowdfunding on Kickstarter and Tilt

Fundraising or crowdfunding is not easy. Here I collect all the Perl-realted crowdfunding campaigns I can find. (Still work in process)

For the full article visit Perl related crowdfunding on Kickstarter and Tilt

perlbuzz.com: ack 2.18 has been released; ack 3 starting development

I&#8217;ve just uploaded ack 2.18 to CPAN and to https://beyondgrep.com/. ack 2.18 will probably be the final release in the ack 2.x series. I&#8217;m going to be starting work on ack 3.000 in earnest.  Still, if you discover problems with ack 2, please report them to https://github.com/petdance/ack2/issues If you&#8217;re interested in ack 3 development, please [&#8230;]

Perl Foundation News: Call for Grant Proposals (March 2017 Round)

The Grants Committee is accepting grant proposals all the time. We evaluate them every two months and another evaluation period has come. This month's round is a little later than usual, due to the selection of a new GC Secretary (myself). This round will be slightly compressed, and we'll strive to get back on track for the next round.

If you have an idea for doing some Perl work that will benefit the Perl community, consider sending a grant application. The application deadline for this round is 23:59 April 2nd UTC. We will publish the received applications, get community feedback and conclude acceptance by April 12th.

To apply, please read How to Write a Proposal. Rules of Operation and Running Grants List will also help you understand how the grant process works. We also got some grant ideas from the community. The format is the same as the previous rounds in 2014-2016.

We will confirm the receipt of application within 24 hours.

If you have further questions, please contact me at tpf-grants-secretary at perl-foundation.org.

Perl Foundation News: New Grant Committee Member & Secretary

Please join me in welcoming John SJ Anderson (genehack) as the newest voting member of The Perl Foundation's Grants Committee. John has helped organize several recent YAPCs, given talks and training at YAPCs, and maintains several modules on CPAN.

Additionally, as Makoto Nozaki is transitioning to the secretary of the TPF Board, he is vacating the position of GC Secretary, and I have been selected to to fill the role.

My apologies for the delays in the transition. Look for a posting shortly about the March call for proposals.

Perl Foundation News: Perl 6 IO Grant: February 2017 Report

Zoffix Znet provided this report on February 26, 2016


This document is the February, 2017 progress report for TPF Standardization, Test Coverage, and Documentation of Perl 6 I/O Routines grant

Timing

I'm currently running slightly behind the schedule outlined in the grant. I expect to complete the Action Plan and have it ratified by other core members by March 18th, which is the date of the 2017.03 compiler release. Then, I'll implement all of the Action Plan (and complete the grant) by the 2017.04 compiler release on April 15th. This is also the release the next Rakudo Star distribution will be based on, and so the regular end users will receive better IO there and then.

Some members of the Core Team voiced concerns over implementing any changes that can break users' code, even if the changes do not break 6.c-errata specification tests. Once the full set of changes is known, they will be reviewed on a case-by-case basis, and some of them may be implemented under 6.d.PREVIEW pragma, to be included in 6.d language version, leaving 6.c language versions untouched. Note that changes that are decided to be 6.d material may delay the completion of this grant due to not fully-fleshed out infrastructure for supporting multiple language versions. The April 15th deadline stated above applies only to changes to 6.c language and new deadline will be ascertained for completion of the 6.d changes.

User Communications

I wrote and disseminated advanced notice of the changes to be made due to this grant, to prepare the users to expect some code to break (some routines were found to be documented, despite being absent entirely from the Specification and not officially part of the language).

The notice can be seen at: http://rakudo.org/2017/02/26/advance-notice-of-significant-changes/

It is possible the Core Team will decide to defer all breaking changes to 6.d language version, to be currently implemented under v6.d.PREVIEW pragma.

Bonus Deliverable

The bonus deliverable—The Map of Perl 6 Routines—is now usable. The code is available in perl6/routine-map repository, and the rendered version is available on map.perl6.party. Its current state is sufficient to serve the intended purpose for this grant, but I'll certainly add improvements to it sometime in the future, such as linking to docs, linking to routines' source code, having an IRC bot looking stuff up in it, etc.

It'll also be fairy easy to use the Map to detect undocumented routines or ones that are documented under the incorrect type.

Identified Issues/Deficiencies with IO Routines

These points, issues, and ideas were identified this month and will be included for consideration in the Action Plan.

  • Calling practically any method on a closed IO::Handle results in an LTA (Less Than Awesome) error message that reads <something> requires an object with REPR MVMOSHandle where <something> is sometimes the name of the method called by the user and others is some internal method invoked indirectly. We need better errors for closed file handles; and not something that would require a is-fh-closed() type of conditional called in all the methods, which would be a hefty performance hit.
  • Several routines have been identified which in other languages return useful information: number of bytes actually written or current file position, whereas in Perl 6 they just return a Bool (.print, .say, .write) or a Mu type object (.seek). Inconsistently, .printf does appear to return the number of bytes written. It should be possible to make other routines similarly useful, although I suspect some of it may have to wait until 6.d language release.
  • The .seek routine takes the seek location as one of three Enum values. Not only are they quite lengthy to type, they're globally available for no good reason and .seek is virtually unique in using this calling convention. I will seek to standardize this routine to take mutually-exclusive named arguments instead, preferably with much shorter names, but those are yet to be bikeshed.
  • IO.umask routine simply shells out to umask. This fails terribly on OSes that don't have that command, especially since the code still tries to decode the received input as an octal string, even after the failure. Needs improvement.
  • link's implementation and documentation confuses what a "target" is. Luckily (or sadly?) there are exactly zero tests for this routine in the Perl 6 Specification, so we can change it to match the behaviour of ln Linux command and the foo $existing-thing, $new-thing argument order of move, rename, and other similar routines.
  • When using run(:out, 'some-non-existant-command').out.slurp-rest it will silently succeed and return an empty string. If possible, this should be changed to return the failure or throw at some point.
  • chdir's :test parameter for directory permissions test is taken as a single string parameter. This makes it extremely easy to mistakenly write broken code: for example, "/".IO.chdir: "tmp", :test<r w> succeeds, while "/".IO.chdir: "tmp", :test<w r> fails with a misleading error message saying the directory is not readable/writable. I will propose for :test parameter to be deprecated in favour of using multiple named arguments to indicate desired tests. By extension, similar change will be applied to indir, tmpdir, and homedir routines (if they remain in the language).
  • Documentation: several inaccuracies in the documentation were found. I won't be identifying these in my reports/Action Plan, but will simply ensure the documentation matches the implementation once the Action Plan is fully implemented.

Discovered Bugs

The hunt for 6-legged friends has these finds so far:

Will (attempt to) fix as part of the grant

  • indir() has a race condition where the actual dir it runs in ends up being wrong. Using indir '/tmp/useless', { qx/rm -fr */ } in one thread and backing up your precious data in another has the potential to result in some spectacular failurage.
  • perl6 -ne '@ = lines' crashes after first iteration, crying about MVMOSHandle REPR. I suspect the code is failing to follow iterator protocol somewhere and is attempting to read on an already closed handle. I expect to be able to resolve this and the related RT#128047 as part of the grant.
  • .tell incorrectly always returns 0 on files opened in append mode
  • link mixes up target and link name in its error message

Don't think I will be able to fix these as part of the grant

  • seek() with SeekFromCurrent as location fails to seek correctly if called after .readchars, but only on MoarVM. This appears to occur due to some sort of buffering. I filed this as RT#130843.
  • On JVM, .readchars incorrectly assumes all chars are 2 bytes long. This appears to be just a naive substitute for nqp::readcharsfh op. I filed this as RT#130840.

Already Fixed

  • While making the Routine Map, I discovered .WHICH and .Str methods on IO::Special were only methods defined only for the :D subtype, resulting in a crash when using, say, infix:<eqv> operator on the type object, instead Mu.WHICH/.Str candidates getting invoked. This bug was easy and I already commited fix in radudo/dd4dfb14d3 and tests to cover it in roast/63370fe054

Auxiliary Bugs

While doing the work for this grant, I also discovered some non-IO related bugs (some of which I fixed):

Ovid: Saving your test suite history

If you saw my FOSDEM talk about Building a Universe in Perl, I show some examples of the code we use to model behaviors in our universe. I start talking about out test suite at one point and that's probably the most exciting part. You see, we've been fixing our leaderboard system in part because I can do this:

That's a truncated version of our test failure report for tests failing on master. Leaderboard tests show up twice in there.

Ever have an failing test and trying to remember if it failed before? Is it fragile code? Is it fragile tests? Has it ever failed on your master branch? What percentage of your tests fail? Well, now you can find out.

Test::Class::Moose::History works like this.

my $runner = Test::Class::Moose::Runner->new( %opts );
$runner->runtests;

# get current branch and latest commit id
# example assumes `git`, but you can use any info you need here
chomp( my $branch = `git rev-parse --abbrev-ref HEAD` );
chomp( my $commit = `git rev-parse HEAD` );

my $history = Test::Class::Moose::History->new(
    runner => $runner,
    branch => $branch,
    commit => $commit,
);
$history->save;
my $report = $history->report;

my %report_methods = (
    last_test_status => [qw/Class Method Runtime Passed/],
    last_failures    => [qw/Class Method/],
    top_failures     => [qw/Class Method First Last Errors/],
);

my $builder = Test::Builder->new;
foreach my $method ( sort keys %report_methods ) {
    my $results = $report->$method;
    my @rows = ( $report_methods{$method}, @$results );
    $builder->diag("\nReport for $method");
    $builder->diag( generate_table( rows => \@rows, header_row => 1 ) );
}

Alternatively, if you've already saved report data, you can do this:

my $report = Test::Class::Moose::History::Report->new;
my $failures = $report->last_failures;
...

So on the Tau Station test suite, I can do thing like:

prove t/tcm.t :: --failures

And that will rerun the last set of failing tests rather than the entire test suite.

Why Test::Class::Moose?

If you wanted to build a single page web site and you just printed the HTML in a large heredoc, no one's going to care. If you build a large-scale application like that, people are going to care. In fact, as projects get larger, many people reach for frameworks because many of the bits and pieces you need are already in place.

Similarly for test suites: if you're just writing a small test suite for a module, that's fine. If you're building a large test suite for a huge application, you should be more careful with your test suite. I may be biased as I wrote it, but I recommend Test::Class::Moose (now maintained by Dave Rolsky). More and more companies are using because it makes test suites much easier to organize and maintain. Set up your test runner, your test base class, and start writing great tests.

It also has great reporting, but until now, that information hasn't been saved anywhere. Now with Test::Class::Moose::History, that's changed. It's not on the CPAN yet as I'd like to iron out the kinks, but if you use Test::Class::Moose, I think you'll love saving your test history.

Note: If you have trouble getting up and running with Test::Class::Moose, the Test::Class::Moose::History test suite has a small test suite that's a great example to start from. Also, you can download that module and run prove t/tcm.t :: --report to see some sample reports showing my test suite development for the code.

While the code is a proof of concept, we've been using it with the Tau Station team for over half a year. It's also running on our Jenkins box so we can see what tests fail there, too.

Hire Us!

As always, if you want top-notch software development or training, or help building your test suite, drop me a line at ovid at allaroundtheworld.fr. Most of us are Perl devs, but we also do front-end development, Golang, C++, Lua, and Python. Due to our extensive links in multiple software communities, if you need an expert in another language, let me know.

Perl Foundation News: TPF Committee Updates

We've been reviewing Perl Foundation committees over the last few months and I'm happy to report some new people have stepped into committee leadership roles.

David Oswald is the new conferences committee chair. This position had gone vacant for a period as TPF Treasurer Dan Wright along with others took a more active role in planning for The Perl Conference. The board is happy to once again have someone in this role to help spread the work and responsibilities.

David was one of the lead organizers of YAPC::NA 2015 in Salt Lake City. He is also on the organizing team for the OpenWest open source conference. We're confident David's experience will help with this year's Perl Conference and with planning going forward.

Will Coleda is the new Secretary of the Grants Committee. Will has been a member of the Grants Committee since 2008, and involved with the Perl 6 community and development for over a decade. Will's selection is a great example of the cycle of involvement for a TPF member, starting as a committee member and then taking on a bigger role over time.

The outgoing secretary, Makoto Nozaki, has held this position since February 2014 and he will continue his work as Perl Foundation board Secretary.

As part of our committee review, I have also been looking into other committees that have recently been less active or inactive. As a result, the Steering Committee is being sunsetted.

The Steering Committee was originally created as a forum to pull in more active community members to carry out TPF business in the earlier years of the foundation when the board was much less active. When Karen Pauley, who was Steering Committee Chair, took over as president, this activity moved to the board. New committees also were created to handle specific areas of activity like conferences, grants, and marketing. With TPF activity moving to these other groups, the Steering Committee has been largely dormant so we're dissolving it. Thanks to all of the members who contributed their time to TPF through this committee.

Another committee we are evaluating is the Community Advocacy Committee. This committee was driven for many years by Ya'akov Sloman and we thank him for all of this efforts. He stepped down last year and the committee has been mostly inactive, so we are considering dissolving it as well. However, the board is open to proposals to revive the committee if community members are available to come forward. The committee charter is available for review and could be revised by new members if desired.

Please join us in congratulating and thanking David and Will as they take on their new roles. You can expect to hear more from them in the near future.

Sawyer X: Perl 5 Porters Mailing List Summary: March 13th-21st

Hey everyone,

Following is the p5p (Perl 5 Porters) mailing list summary for the past week.

Enjoy!

March 13th-21st

News and updates

Perl 5.25.11 is now available!

Grant reports

Issues

New issues

Resolved issues

  • Perl #115810: $_[$ref] warnings discrepancy.
  • Perl #130126: Bleadperl v5.25.6-260-ga65dc09 breaks DCONWAY/IO-Prompt-0.997003.tar.gz.
  • Perl #130633: Bleadperl v5.25.8-223-ga0e213fcb5 breaks JV/EekBoek-2.02.05.1.tar.gz.
  • Perl #130640: Bleadperl v5.25.8-47-gd1f8d421df breaks SPROUT/JE-0.066.tar.gz.
  • Perl #130921: Bleadperl v5.25.5-100-g2b5e7bc2e6 breaks JDDPAUSE/re-engine-GNU-0.021.tar.gz.

NeilB: TVPM Tech Talks in Reading, UK

On Monday 27th March, the Thames Valley Perl Mongers (TVPM) are having a mini tech talks session in Reading. Talks are going to be about 15 minutes each. Speakers and topics are given below, along with details of the venue.

Any and all are welcome to join us.

  • Chad Hanna is talking about synchronising a MySQL database on a shop web site with a local Access database using RESTful interface. Design considerations and implementation experiences. Limitations of the hosting environment, a reverse-engineered database schema and other perils.
  • Roger Bell_West will give an introduction to the PDF::API2 distribution: how to use it to create PDF files with text, vector graphics amd embedded bitmaps, and for editing existing PDF documents.
  • Oliver Gorwits is giving a whistle-stop tour of Netdisco's Custom Reports implementation. This feature exploits the power in Perl's dynamic, run-time nature, allowing users to specify DB connections and SQL queries in YAML, which are rendered to HTML tables supporting CSV download. We will touch on Dancer, DBIx::Class, Safe, Template::Toolkit, and jQuery DataTables.
  • Dominic Hargreaves is presenting an introduction to the perl packaging efforts (both interpreter and CPAN dists) in Debian. We will discuss packaging workflows and tools, things that can makes things easy and difficult for both Debian and upstreams, and what Debian can contribute to the wider Perl community.
  • Neil Bowers is giving a short overview of the River of CPAN as a model for talking about and analysing inter-distribution dependencies on CPAN. This model came out of discussions at the annual Toolchain Summit, where we also talked about how things can change for authors as more and more CPAN distributions are relying on yours.

We're meeting at the RISC in Reading, with talks starting from 8pm. We'll probably go for a pint at a local hostelry afterwards.

Perl.org NOC: Fast and secure Perl docs and CPAN modules with help from Fastly


Fastly

We'd like to thank Fastly for hosting www.cpan.org and perldoc.perl.org on their world class global CDN. 

As of last week, the sites are also available using https://!

   So you can now securely read the Perl core documentation or download over 180,000 modules at lightning fast speeds.

Perlgeek.de : Perl 6 By Example: Plotting using Matplotlib and Inline::Python

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, either in this book project or any other Perl 6 book news, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


Occasionally I come across git repositories, and want to know how active they are, and who the main developers are.

Let's develop a script that plots the commit history, and explore how to use Python modules in Perl 6.

Extracting the Stats

We want to plot the number of commits by author and date. Git makes it easy for us to get to this information by giving some options to git log:

my $proc = run :out, <git log --date=short --pretty=format:%ad!%an>;
my (%total, %by-author, %dates);
for $proc.out.lines -> $line {
    my ( $date, $author ) = $line.split: '!', 2;
    %total{$author}++;
    %by-author{$author}{$date}++;
    %dates{$date}++;
}

run executes an external command, and :out tells it to capture the command's output, and makes it available as $proc.out. The command is a list, with the first element being the actual executable, and the rest of the elements are command line arguments to this executable.

Here git log gets the options --date short --pretty=format:%ad!%an, which instructs it to print produce lines like 2017-03-01!John Doe. This line can be parsed with a simple call to $line.split: '!', 2, which splits on the !, and limits the result to two elements. Assigning it to a two-element list ( $date, $author ) unpacks it. We then use hashes to count commits by author (in %total), by author and date (%by-author) and finally by date. In the second case, %by-author{$author} isn't even a hash yet, and we can still hash-index it. This is due to a feature called autovivification, which automatically creates ("vivifies") objects where we need them. The use of ++ creates integers, {...} indexing creates hashes, [...] indexing and .push creates arrays, and so on.

To get from these hashes to the top contributors by commit count, we can sort %total by value. Since this sorts in ascending order, sorting by the negative value gives the list in descending order. The list contains Pair objects, and we only want the first five of these, and only their keys:

my @top-authors = %total.sort(-*.value).head(5).map(*.key);

For each author, we can extract the dates of their activity and their commit counts like this:

my @dates  = %by-author{$author}.keys.sort;
my @counts = %by-author{$author}{@dates};

The last line uses slicing, that is, indexing an array with list to return a list elements.

Plotting with Python

Matplotlib is a very versatile library for all sorts of plotting and visualization. It's written in Python and for Python programs, but that won't stop us from using it in a Perl 6 program.

But first, let's take a look at a basic plotting example that uses dates on the x axis:

import datetime
import matplotlib.pyplot as plt

fig, subplots = plt.subplots()
subplots.plot(
    [datetime.date(2017, 1, 5), datetime.date(2017, 3, 5), datetime.date(2017, 5, 5)],
    [ 42, 23, 42 ],
    label='An example',
)
subplots.legend(loc='upper center', shadow=True)
fig.autofmt_xdate()
plt.show()

To make this run, you have to install python 2.7 and matplotlib. You can do this on Debian-based Linux systems with apt-get install -y python-matplotlib. The package name is the same on RPM-based distributions such as CentOS or SUSE Linux. MacOS users are advised to install a python 2.7 through homebrew and macports, and then use pip2 install matplotlib or pip2.7 install matplotlib to get the library. Windows installation is probably easiest through the conda package manager, which offers pre-built binaries of both python and matplotlib.

When you run this scripts with python2.7 dates.py, it opens a GUI window, showing the plot and some controls, which allow you to zoom, scroll, and write the plot graphic to a file:

Basic matplotlib plotting window

Bridging the Gap

The Rakudo Perl 6 compiler comes with a handy library for calling foreign functions, which allows you to call functions written in C, or anything with a compatible binary interface.

The Inline::Python library uses the native call functionality to talk to python's C API, and offers interoperability between Perl 6 and Python code. At the time of writing, this interoperability is still fragile in places, but can be worth using for some of the great libraries that Python has to offer.

To install Inline::Python, you must have a C compiler available, and then run

$ zef install Inline::Python

(or the same with panda instead of zef, if that's your module installer).

Now you can start to run Python 2 code in your Perl 6 programs:

use Inline::Python;

my $py = Inline::Python.new;
$py.run: 'print("Hello, Pyerl 6")';

Besides the run method, which takes a string of Python code and execute it, you can also use call to call Python routines by specifying the namespace, the routine to call, and a list of arguments:

use Inline::Python;

my $py = Inline::Python.new;
$py.run('import datetime');
my $date = $py.call('datetime', 'date', 2017, 1, 31);
$py.call('__builtin__', 'print', $date);    # 2017-01-31

The arguments that you pass to call are Perl 6 objects, like three Int objects in this example. Inline::Python automatically translates them to the corresponding Python built-in data structure. It translate numbers, strings, arrays and hashes. Return values are also translated in opposite direction, though since Python 2 does not distinguish properly between byte and Unicode strings, Python strings end up as buffers in Perl 6.

Object that Inline::Python cannot translate are handled as opaque objects on the Perl 6 side. You can pass them back into python routines (as shown with the print call above), or you can also call methods on them:

say $date.isoformat().decode;               # 2017-01-31

Perl 6 exposes attributes through methods, so Perl 6 has no syntax for accessing attributes from foreign objects directly. If you try to access for example the year attribute of datetime.date through the normal method call syntax, you get an error.

say $date.year;

Dies with

'int' object is not callable

Instead, you have to use the getattr builtin:

say $py.call('__builtin__', 'getattr', $date, 'year');

Using the Bridge to Plot

We need access to two namespaces in python, datetime and matplotlib.pyplot, so let's start by importing them, and write some short helpers:

my $py = Inline::Python.new;
$py.run('import datetime');
$py.run('import matplotlib.pyplot');
sub plot(Str $name, |c) {
    $py.call('matplotlib.pyplot', $name, |c);
}

sub pydate(Str $d) {
    $py.call('datetime', 'date', $d.split('-').map(*.Int));
}

We can now call pydate('2017-03-01') to create a python datetime.date object from an ISO-formatted string, and call the plot function to access functionality from matplotlib:

my ($figure, $subplots) = plot('subplots');
$figure.autofmt_xdate();

my @dates = %dates.keys.sort;
$subplots.plot:
    $[@dates.map(&pydate)],
    $[ %dates{@dates} ],
    label     => 'Total',
    marker    => '.',
    linestyle => '';

The Perl 6 call plot('subplots') corresponds to the python code fig, subplots = plt.subplots(). Passing arrays to python function needs a bit extra work, because Inline::Python flattens arrays. Using an extra $ sigil in front of an array puts it into an extra scalar, and thus prevents the flattening.

Now we can actually plot the number of commits by author, add a legend, and plot the result:

for @top-authors -> $author {
    my @dates = %by-author{$author}.keys.sort;
    my @counts = %by-author{$author}{@dates};
    $subplots.plot:
        $[ @dates.map(&pydate) ],
        $@counts,
        label     => $author,
        marker    =>'.',
        linestyle => '';
}


$subplots.legend(loc=>'upper center', shadow=>True);

plot('title', 'Contributions per day');
plot('show');

When run in the zef git repository, it produces this plot:

Contributions to zef, a Perl 6 module installer

Summary

We've explored how to use the python library matplotlib to generate a plot from git contribution statistics. Inline::Python provides convenient functionality for accessing python libraries from Perl 6 code.

In the next installment, we'll explore ways to improve both the graphics and the glue code between Python and Perl 6.

Subscribe to the Perl 6 book mailing list

* indicates required

Perl Hacks: What Training Should I Run In Amsterdam?

The Perl Conference (formerly known as YAPC) in Amsterdam is getting closer. Oh, sure, it’s not imminent, but in five months time it will all be over. And there’s a lot to get done in those five months. I’m glad I’m not one of the organisers.

But there is something that I need to get organised over the next couple of months. It looks likely that there will be training courses running before or after the main conference and, assuming that happens, I’d like to be running one of those courses.

Last year, the “Modern Web Development with Perl” course that I ran in Cluj-Napoca seemed to be very successful (it certainly had the most attendees of any course I’ve run alongside a YAPC) and I think that was down to two factors:

  1. We planned and announced the course nice and early.
  2. I asked you what course I should run.

I’m not doing to mess with a successful formula, so I’m going to take the same approach this year. Consider this my “what course should I run?” post.

This is how it will work. In this post I’ll make a few suggestions of courses. We can discuss them in the comments and you can add your own suggestions. In a few weeks time, I’ll pull out the most popular suggestions and put it to a public vote. I’ll run the course that gets the most votes.

So what courses could I run?

There are courses that I’ve run many times and that would only need light updating. I have a course on DBIx::Class (I ran that in Granada in 2015), one on Moose and one on testing. I’d be happy to do any of those.

At the LPW last year, I ran a “Modern Perl Update” session which seemed to go down pretty well. I went through the last few Perl releases and explained the new and changed features. It was only a couple of hours long, but I could expand it. Perhaps I could add some stuff about CPAN modules that people don’t seem to know about.

I could re-run the Dancer course from last year. In a day, the class went from nothing to writing a functional and useful Dancer application. Perhaps there’s a big enough audience to do that again.

Or, perhaps, some kind of extension to last year’s course. I don’t mean that you would need to have done the previous course in order to find it useful, but maybe something about integrating Perl web tools with a modern web development toolkit. Using Angular or React as the front end to a Perl backend. Or how about writing APIs in Perl?

I’ll should point out that there are some things that I’m not the right person to teach. Perl 6 is top of that list. Not only have I not had the time to really explore Perl 6 yet, but given that Damian Conway is going to be at the conference and I fully expect him to clean up on the Perl 6 training front.

So there are half a dozen suggestions. What do you think? Are you coming to Amsterdam? Would you (or your company) pay extra for a training course? What course would you like to see?

Let me know your thoughts.

The post What Training Should I Run In Amsterdam? appeared first on Perl Hacks.

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Perl isn't dieing

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Ocean of Awareness: Top-down parsing is guessing

Top-down parsing is guessing. Literally. Bottom-up parsing is looking.

The way you'll often hear that phrased is that top-down parsing is looking, starting at the top, and bottom-up parsing is looking, starting at the bottom. But that is misleading, because the input is at the bottom -- at the top there is nothing to look at. A usable top-down parser must have a bottom-up component, even if that component is just lookahead.

A more generous, but still accurate, way to describe the top-down component of parsers is "prediction". And prediction is, indeed, a very useful component of a parser, when used in combination with other techniques.

Of course, if a parser does nothing but predict, it can predict only one input. Top-down parsing must always be combined with a bottom-up component. This bottom-up component may be as modest as lookahead, but it must be there or else top-down parsing is really not parsing at all.

So why is top-down parsing used so much?

Top-down parsing may be unusable in its pure form, but from one point of view that is irrelevant. Top-down parsing's biggest advantage is that it is highly flexible -- there's no reason to stick to its "pure" form.

A top-down parser can be written as a series of subroutine calls -- a technique called recursive descent. Recursive descent allows you to hook in custom-written bottom-up logic at every top-down choice point, and it is a technique which is completely understandable to programmers with little or no training in parsing theory. When dealing with recursive descent parsers, it is more useful to be a seasoned, far-thinking programmer than it is to be a mathematician. This makes recursive descent very appealing to seasoned, far-thinking programmers, and they are the audience that counts.

Switching techniques

You can even use the flexibility of top-down to switch away from top-down parsing. For example, you could claim that a top-down parser could do anything my own parser (Marpa) could do, because a recursive descent parser can call a Marpa parser.

A less dramatic switchoff, and one that still leaves the parser with a good claim to be basically top-down, is very common. Arithmetic expressions are essential for a computer language. But they are also among the many things top-down parsing cannot handle, even with ordinary lookahead. Even so, most computer languages these days are parsed top-down -- by recursive descent. These recursive descent parsers deal with expressions by temporarily handing control over to an bottom-up operator precedence parser. Neither of these parsers is extremely smart about the hand-over and hand-back -- it is up to the programmer to make sure the two play together nicely. But used with caution, this approach works.

Top-down parsing and language-oriented programming

But what about taking top-down methods into the future of language-oriented programming, extensible languages, and grammars which write grammars? Here we are forced to confront the reality -- that the effectiveness of top-down parsing comes entirely from the foreign elements that are added to it. Starting from a basis of top-down parsing is literally starting with nothing. As I have shown in more detail elsewhere, top-down techniques simply do not have enough horsepower to deal with grammar-driven programming.

Perl 6 grammars are top-down -- PEG with lots of extensions. These extensions include backtracking, backtracking control, a new style of tie-breaking and lots of opportunity for the programmer to intervene and customize everything. But behind it all is a top-down parse engine.

One aspect of Perl 6 grammars might be seen as breaking out of the top-down trap. That trick of switching over to a bottom-up operator precedence parser for expressions, which I mentioned above, is built into Perl 6 and semi-automated. (I say semi-automated because making sure the two parsers "play nice" with each other is not automated -- that's still up to the programmer.)

As far as I know, this semi-automation of expression handling is new with Perl 6 grammars, and it may prove handy for duplicating what is done in recursive descent parsers. But it adds no new technique to those already in use. And features like

  • mulitple types of expression, which can be told apart based on their context,
  • n-ary expressions for arbitrary n, and
  • the autogeneration of multiple rules, each allowing a different precedence scheme, for expressions of arbitrary arity and associativity,

all of which are available and in current use in Marpa, are impossible for the technology behind Perl 6 grammars.

I am a fan of the Perl 6 effort. Obviously, I have doubts about one specific set of hopes for Perl 6 grammars. But these hopes have not been central to the Perl 6 effort, and I will be an eager student of the Perl 6 team's work over the coming months.

Comments

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Ocean of Awareness: Parsing: an expanded timeline

The fourth century BCE: In India, Pannini creates a sophisticated description of the Sanskrit language, exact and complete, and including pronunciation. Sanskrit could be recreated using nothing but Pannini's grammar. Pannini's grammar is probably the first formal system of any kind, predating Euclid. Even today, nothing like it exists for any other natural language of comparable size or corpus. Pannini is the object of serious study today. But in the 1940's and 1950's Pannini is almost unknown in the West. His work has no direct effect on the other events in this timeline.

1943: Emil Post defines and studies a formal rewriting system using productions. With this, the process of reinventing Pannini in the West begins.

1948: Claude Shannon publishes the foundation paper of information theory. Andrey Markov's finite state processes are used heavily.

1952: Grace Hopper writes a linker-loader and describes it as a "compiler". She seems to be the first person to use this term for a computer program. Hopper uses the term "compiler" in its original sense: "something or someone that brings other things together".

1954: At IBM, a team under John Backus begins working on the language which will be called FORTRAN. The term "compiler" is still being used in Hopper's looser sense, instead of its modern one. In particular, there is no implication that the output of a "compiler" is ready for execution by a computer. The output of one 1954 "compiler", for example, produces relative addresses, which need to be translated by hand before a machine can execute them.

1955: Noam Chomsky is awarded a Ph.D. in linguistics and accepts a teaching post at MIT. MIT does not have a linguistics department and Chomsky, in his linguistics course, is free to teach his own approach, highly original and very mathematical.

1956: Chomsky publishes the paper which is usually considered the foundation of Western formal language theory. The paper advocates a natural language approach that involves

  • a bottom layer, using Markov's finite state processes;
  • a middle, syntactic layer, using context-free grammars and context-sensitive grammars; and
  • a top layer, which involves mappings or "transformations" of the output of the syntactic layer.

These layers resemble, and will inspire, the lexical, syntactic and AST transformation phases of modern parsers. For finite state processes, Chomsky acknowledges Markov. The other layers seem to be Chomsky's own formulations -- Chomsky does not cite Post's work.

1957: Steven Kleene discovers regular expressions, a very handy notation for Markov's processes. Regular expressions turn out to describe exactly the mathematical objects being studied as finite state automata, as well as some of the objects being studied as neural nets.

1957: Noam Chomsky publishes Syntactic Structures, one of the most influential books of all time. The orthodoxy in 1957 is structural linguistics which argues, with Sherlock Holmes, that "it is a capital mistake to theorize in advance of the facts". Structuralists start with the utterances in a language, and build upward.

But Chomsky claims that without a theory there are no facts: there is only noise. The Chomskyan approach is to start with a grammar, and use the corpus of the language to check its accuracy. Chomsky's approach will soon come to dominate linguistics.

1957: Backus's team makes the first FORTRAN compiler available to IBM customers. FORTRAN is the first high-level language that will find widespread implementation. As of this writing, it is the oldest language that survives in practical use. FORTRAN is a line-by-line language and its parsing is primitive.

1958: John McCarthy's LISP appears. LISP goes beyond the line-by-line syntax -- it is recursively structured. But the LISP interpreter does not find the recursive structure: the programmer must explicitly indicate the structure herself, using parentheses.

1959: Backus invents a new notation to describe the IAL language (aka ALGOL). Backus's notation is influenced by his study of Post -- he seems not to have read Chomsky until later.

1960: Peter Naur improves the Backus notation and uses it to describe ALGOL 60. The improved notation will become known as Backus-Naur Form (BNF).

1960: The ALGOL 60 report specifies, for the first time, a block structured language. ALGOL 60 is recursively structured but the structure is implicit -- newlines are not semantically significant, and parentheses indicate syntax only in a few specific cases. The ALGOL compiler will have to find the structure. It is a case of 1960's optimism at its best. As the ALGOL committee is well aware, a parsing algorithm capable of handling ALGOL 60 does not yet exist. But the risk they are taking will soon pay off.

1960: A.E. Gleenie publishes his description of a compiler-compiler. Glennie's "universal compiler" is more of a methodology than an implementation -- the compilers must be written by hand. Glennie credits both Chomsky and Backus, and observes that the two notations are "related". He also mentions Post's productions. Glennie may have been the first to use BNF as a description of a procedure instead of as the description of a Chomsky grammar. Glennie points out that the distinction is "important".

Chomskyan BNF and procedural BNF: BNF, when used as a Chomsky grammar, describes a set of strings, and does not describe how to parse strings according to the grammar. BNF notation, if used to describe a procedure, is a set of instructions, to be tried in some order, and used to process a string. Procedural BNF describes a procedure first, and a language only indirectly.

Both procedural and Chomskyan BNF describe languages, but usually not the same language. That is,

  • Suppose D is some BNF description.
  • Let P(D) be D interpreted as a procedure,
  • Let L(P(D)) be the language which the procedure P(D) parses.
  • Let G(D) be D interpreted as a Chomsky grammar.
  • Let L(G(D)) be the language which the grammar G(D) describes.
  • Then, usually, L(P(D)) != L(G(D)).

The pre-Chomskyan approach, using procedural BNF, is far more natural to someone trained as a computer programmer. The parsing problem appears to the programmer in the form of strings to be parsed, exactly the starting point of procedural BNF and pre-Chomsky parsing.

Even when the Chomskyan approach is pointed out, it does not at first seem very attractive. With the pre-Chomskyan approach, the examples of the language more or less naturally lead to a parser. In the Chomskyan approach the programmer has to search for an algorithm to parse strings according to his grammar -- and the search for good algorithms to parse Chomskyan grammars has proved surprisingly long and difficult. Handling semantics is more natural with a Chomksyan approach. But, using captures, semantics can be added to a pre-Chomskyan parser and, with practice, this seems natural enough.

Despite the naturalness of the pre-Chomskyan approach to parsing, we will find that the first fully-described automated parsers are Chomskyan. This is a testimony to Chomsky's influence at the time. We will also see that Chomskyan parsers have been dominant ever since.

1961: In January, Ned Irons publishes a paper describing his ALGOL 60 parser. It is the first paper to fully describe any parser. The Irons algorithm is Chomskyan and top-down with a "left corner" element. The Irons algorithm is general, meaning that it can parse anything written in BNF. It is syntax-driven (aka declarative), meaning that the parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Peter Lucas publishes the first description of a purely top-down parser. This can be considered to be recursive descent, though in Lucas's paper the algorithm has a syntax-driven implementation, useable only for a restricted class of grammars. Today we think of recursive descent as a methodology for writing parsers by hand. Hand-coded approaches became more popular in the 1960's due to three factors:

  • Memory and CPU were both extremely limited. Hand-coding paid off, even when the gains were small.
  • Non-hand coded top-down parsing, of the kind Lucas's syntax-driven approach allowed, is a very weak parsing technique. It was (and still is) often necessary to go beyond its limits.
  • Top-down parsing is intuitive -- it essentially means calling subroutines. It therefore requires little or no knowledge of parsing theory. This makes it a good fit for hand-coding.

1963: L. Schmidt, Howard Metcalf, and Val Schorre present papers on syntax-directed compilers at a Denver conference.

1964: Schorre publishes a paper on the Meta II "compiler writing language", summarizing the papers of the 1963 conference. Schorre cites both Backus and Chomsky as sources for Meta II's notation. Schorre notes that his parser is "entirely different" from that of Irons 1961 -- in fact it is pre-Chomskyan. Meta II is a template, rather than something that readers can use, but in principle it can be turned into a fully automated compiler-compiler.

1965: Don Knuth invents LR parsing. The LR algorithm is deterministic, Chomskyan and bottom-up, but it is not thought to be practical. Knuth is primarily interested in the mathematics.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is Chomskyan, syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's algorithm is both top-down and bottom-up at once -- it uses dynamic programming and keeps track of the parse in tables. Earley's approach makes a lot of sense and looks very promising indeed, but there are three serious issues:

  • First, there is a bug in the handling of zero-length rules.
  • Second, it is quadratic for right recursions.
  • Third, the bookkeeping required to set up the tables is, by the standards of 1968 hardware, daunting.

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size. LALR looks practical.

1969: Ken Thompson writes the "ed" editor as one of the first components of UNIX. At this point, regular expressions are an esoteric mathematical formalism. Through the "ed" editor and its descendants, regular expressions will become an everyday part of the working programmer's toolkit.

Recognizers: In comparing algorithms, it can be important to keep in mind whether they are recognizers or parsers. A recognizer is a program which takes a string and produces a "yes" or "no" according to whether a string is in part of a language. Regular expressions are typically used as recognizers. A parser is a program which takes a string and produces a tree reflecting its structure according to a grammar. The algorithm for a compiler clearly must be a parser, not a recognizer. Recognizers can be, to some extent, used as parsers by introducing captures.

1972: Alfred Aho and Jeffrey Ullman publish a two volume textbook summarizing the theory of parsing. This book is still important. It is also distressingly up-to-date -- progress in parsing theory slowed dramatically after 1972. Aho and Ullman describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1972: Under the names TDPL and GTDPL, Aho and Ullman investigate the non-Chomksyan parsers in the Schorre lineage. They note that "it can be quite difficult to determine what language is defined by a TDPL parser". That is, GTDPL parsers do whatever they do, and that whatever is something the programmer in general will not be able to describe. The best a programmer can usually do is to create a test suite and fiddle with the GTDPL description until it passes. Correctness cannot be established in any stronger sense. GTDPL is an extreme form of the old joke that "the code is the documentation" -- with GTDPL nothing documents the language of the parser, not even the code.

GTDPL's obscurity buys nothing in the way of additional parsing power. Like all non-Chomskyan parsers, GTDPL is basically a extremely powerful recognizer. Pressed into service as a parser, it is comparatively weak. As a parser, GTDPL is essentially equivalent to Lucas's 1961 syntax-driven algorithm, which was in turn a restricted form of recursive descent.

At or around this time, rumor has it that the main line of development for GTDPL parsers is classified secret by the US government. GTDPL parsers have the property that even small changes in GTDPL parsers can be very labor-intensive. For some government contractors, GTDPL parsing provides steady work for years to come. Public interest in GTDPL fades.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed.

1979: Part of the V7 toolkit is Yet Another Compiler Compiler (YACC). YACC is LALR-powered. Despite its name, YACC is the first compiler-compiler in the modern sense. For some useful languages, the process of going from Chomskyan specification to executable is fully automated. Most practical languages, including the C language and YACC's own input language, still require manual hackery. Nonetheless, after two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses YACC and LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of the Earley algorithm.

But Earley parsing is almost forgotten. Twenty years will pass before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: John Aycock and R. Nigel Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2004: Bryan Ford publishes his paper on PEG. Implementers by now are avoiding YACC, and it seems as if there might soon be no syntax-driven algorithms in practical use. Ford fills this gap by repackaging the nearly-forgotten GTDPL. Ford adds packratting, so that PEG is always linear, and provides PEG with an attractive new syntax. But nothing has been done to change the problematic behaviors of GTDPL.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

Today: After five decades of parsing theory, the state of the art seems to be back where it started. We can imagine someone taking Ned Iron's original 1961 algorithm from the first paper ever published describing a parser, and republishing it today. True, he would have to translate its code from the mix of assembler and ALGOL into something more fashionable, say Haskell. But with that change, it might look like a breath of fresh air.

Marpa: an afterword

The recollections of my teachers cover most of this timeline. My own begin around 1970. Very early on, as a graduate student, I became unhappy with the way the field was developing. Earley's algorithm looked interesting, and it was something I returned to on and off.

The original vision of the 1960's was a parser that was

  • efficient,
  • practical,
  • general, and
  • syntax-driven.

By 2010 this vision seemed to have gone the same way as many other 1960's dreams. The rhetoric stayed upbeat, but parsing practice had become a series of increasingly desperate compromises.

But, while nobody was looking for them, the solutions to the problems encountered in the 1960's had appeared in the literature. Aycock and Horspool had solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and are probably no longer relevant in any case -- cache misses are now the bottleneck.

The programmers of the 1960's would have been prepared to trust a fully declarative Chomskyan parser. With the experience with LALR in their collective consciousness, modern programmers might be more guarded. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove." But I found it straightforward to rearrange the Earley parse engine to allow efficient event-driven handovers between procedural and syntax-driven logic. And Earley tables provide the procedural logic with full knowledge of the state of the parse so far, so that Earley's algorithm is a better platform for hand-written procedural logic than recursive descent.

References, comments, etc.

My implementation of Earley's algorithm is called Marpa. For more about Marpa, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Ocean of Awareness: What are the reasonable computer languages?

"You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?'" -- George Bernard Shaw

In the 1960's and 1970's computer languages were evolving rapidly. It was not clear which way they were headed. Would most programming be done with general-purpose languages? Or would programmers create a language for every task domain? Or even for every project? And, if lots of languages were going to be created, what kinds of languages would be needed?

It was in that context that Čulik and Cohen, in a 1973 paper, outlined what they thought programmers would want and should have. In keeping with the spirit of the time, it was quite a lot:

  • Programmers would want to extend their grammars with new syntax, including new kinds of expressions.
  • Programmers would also want to use tools that automatically generated new syntax.
  • Programmers would not want to, and especially in the case of auto-generated syntax would usually not be able to, massage the syntax into very restricted forms. Instead, programmers would create grammars and languages which required unlimited lookahead to disambiguate, and they would require parsers which could handle these grammars.
  • Finally, programmers would need to be able to rely on all of this parsing being done in linear time.

Today, we think we know that Čulik and Cohen's vision was naive, because we think we know that parsing technology cannot support it. We think we know that parsing is much harder than they thought.

The eyeball grammars

As a thought problem, consider the "eyeball" class of grammars. The "eyeball" class of grammars contains all the grammars that a human can parse at a glance. If a grammar is in the eyeball class, but a computer cannot parse it, it presents an interesting choice. Either,

  • your computer is not using the strongest practical algorithm; or
  • your mind is using some power which cannot be reduced to a machine computation.

There are some people out there (I am one of them) who don't believe that everything the mind can do reduces to a machine computation. But even those people will tend to go for the choice in this case: There must be some practical computer parsing algorithm which can do at least as well at parsing as a human can do by "eyeball". In other words, the class of "reasonable grammars" should contain the eyeball class.

Čulik and Cohen's candidate for the class of "reasonable grammars" were the grammars that a deterministic parse engine could parse if it had a lookahead that was infinite, but restricted to distinguishing between regular expressions. They called these the LR-regular, or LRR, grammars. And the LRR grammars do in fact seem to be a good first approximation to the eyeball class. They do not allow lookahead that contains things that you have to count, like palindromes. And, while I'd be hard put to eyeball every possible string for every possible regular expression, intuitively the concept of scanning for a regular expression does seem close to capturing the idea of glancing through a text looking for a telltale pattern.

So what happened?

Alas, the algorithm in the Čulik and Cohen paper turned out to be impractical. But in 1991, Joop Leo discovered a way to adopt Earley's algorithm to parse the LRR grammars in linear time, without doing the lookahead. And Leo's algorithm does have a practical implementation: Marpa.

References, comments, etc.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: cgit syntax highlighting

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Ocean of Awareness: What parser do birds use?

"Here we provide, to our knowledge, the first unambiguous experimental evidence for compositional syntax in a non-human vocal system." -- "Experimental evidence for compositional syntax in bird calls", Toshitaka N. Suzuki, David Wheatcroft & Michael Griesser Nature Communications 7, Article number: 10986

In this post I look at a subset of the language of the Japanese great tit, also known as Parus major. The above cited article presents evidence that bird brains can parse this language. What about standard modern computer parsing methods? Here is the subset -- probably a tiny one -- of the language actually used by Parus major.

      S ::= ABC
      S ::= D
      S ::= ABC D
      S ::= D ABC
    

Classifying the Parus major grammar

Grammophone is a very handy new tool for classifying grammars. Its own parser is somewhat limited, so that it requires a period to mark the end of a rule. The above grammar is in Marpa's SLIF format, which is smart enough to use the "::=" operator to spot the beginning and end of rules, just as the human eye does. Here's the same grammar converted into a form acceptable to Grammophone:

      S -> ABC .
      S -> D .
      S -> ABC D .
      S -> D ABC .
    

Grammophone tells us that the Parus major grammar is not LL(1), but that it is LALR(1).

What does this mean?

LL(1) is the class of grammar parseable by top-down methods: it's the best class for characterizing most parsers in current use, including recursive descent, PEG, and Perl 6 grammars. All of these parsers fall short of dealing with the Parus major language.

LALR(1) is probably most well-known from its implementations in bison and yacc. While able to handle this subset of Parus's language, LALR(1) has its own, very strict, limits. Whether LALR(1) could handle the full complexity of Parus language is a serious question. But it's a question that in practice would probably not arise. LALR(1) has horrible error handling properties.

When the input is correct and within its limits, an LALR-driven parser is fast and works well. But if the input is not perfectly correct, LALR parsers produce no useful analysis of what went wrong. If Parus hears "d abc d", a parser like Marpa, on the other hand, can produce something like this:

# * String before error: abc d\s
# * The error was at line 1, column 7, and at character 0x0064 'd', ...
# * here: d
    

Parus uses its language in predatory contexts, and one can assume that a Parus with a preference for parsers whose error handling is on an LALR(1) level will not be keeping its alleles in the gene pool for very long.

References, comments, etc.

Those readers content with sub-Parus parsing methods may stop reading here. Those with greater parsing ambitions, however, may wish to learn more about Marpa. A Marpa test script for parsing the Parus subset is in a Github gist. Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Ocean of Awareness: Introduction to Marpa Book in progress

What follows is a summary of the features of the Marpa algorithm, followed by a discussion of potential applications. It refers to itself as a "monograph", because it is a draft of part of the introduction to a technical monograph on the Marpa algorithm. I hope the entire monograph will appear in a few weeks.

The Marpa project

The Marpa project was intended to create a practical and highly available tool to generate and use general context-free parsers. Tools of this kind had long existed for LALR and regular expressions. But, despite an encouraging academic literature, no such tool had existed for context-free parsing. The first stable version of Marpa was uploaded to a public archive on Solstice Day 2011. This monograph describes the algorithm used in the most recent version of Marpa, Marpa::R2. It is a simplification of the algorithm presented in my earlier paper.

A proven algorithm

While the presentation in this monograph is theoretical, the approach is practical. The Marpa::R2 implementation has been widely available for some time, and has seen considerable use, including in production environments. Many of the ideas in the parsing literature satisfy theoretical criteria, but in practice turn out to face significant obstacles. An algorithm may be as fast as reported, but may turn out not to allow adequate error reporting. Or a modification may speed up the recognizer, but require additional processing at evaluation time, leaving no advantage to compensate for the additional complexity.

In this monograph, I describe the Marpa algorithm as it was implemented for Marpa::R2. In many cases, I believe there are better approaches than those I have described. But I treat these techniques, however solid their theory, as conjectures. Whenever I mention a technique that was not actually implemented in Marpa::R2, I will always explicitly state that that technique is not in Marpa as implemented.

Features

General context-free parsing

As implemented, Marpa parses all "proper" context-free grammars. The proper context-free grammars are those which are free of cycles, unproductive symbols, and inaccessible symbols. Worst case time bounds are never worse than those of Earley's algorithm, and therefore never worse than O(n**3).

Linear time for practical grammars

Currently, the grammars suitable for practical use are thought to be a subset of the deterministic context-free grammars. Using a technique discovered by Joop Leo, Marpa parses all of these in linear time. Leo's modification of Earley's algorithm is O(n) for LR-regular grammars. Leo's modification also parses many ambiguous grammars in linear time.

Left-eidetic

The original Earley algorithm kept full information about the parse --- including partial and fully recognized rule instances --- in its tables. At every parse location, before any symbols are scanned, Marpa's parse engine makes available its information about the state of the parse so far. This information is in useful form, and can be accessed efficiently.

Recoverable from read errors

When Marpa reads a token which it cannot accept, the error is fully recoverable. An application can try to read another token. The application can do this repeatedly as long as none of the tokens are accepted. Once the application provides a token that is accepted by the parser, parsing will continue as if the unsuccessful read attempts had never been made.

Ambiguous tokens

Marpa allows ambiguous tokens. These are often useful in natural language processing where, for example, the same word might be a verb or a noun. Use of ambiguous tokens can be combined with recovery from rejected tokens so that, for example, an application could react to the rejection of a token by reading two others.

Using the features

Error reporting

An obvious application of left-eideticism is error reporting. Marpa's abilities in this respect are ground-breaking. For example, users typically regard an ambiguity as an error in the grammar. Marpa, as currently implemented, can detect an ambiguity and report specifically where it occurred and what the alternatives were.

Event driven parsing

As implemented, Marpa::R2 allows the user to define "events". Events can be defined that trigger when a specified rule is complete, when a specified rule is predicted, when a specified symbol is nulled, when a user-specified lexeme has been scanned, or when a user-specified lexeme is about to be scanned. A mid-rule event can be defined by adding a nulling symbol at the desired point in the rule, and defining an event which triggers when the symbol is nulled.

Ruby slippers parsing

Left-eideticism, efficient error recovery, and the event mechanism can be combined to allow the application to change the input in response to feedback from the parser. In traditional parser practice, error detection is an act of desperation. In contrast, Marpa's error detection is so painless that it can be used as the foundation of new parsing techniques.

For example, if a token is rejected, the lexer is free to create a new token in the light of the parser's expectations. This approach can be seen as making the parser's "wishes" come true, and I have called it "Ruby Slippers Parsing".

One use of the Ruby Slippers technique is to parse with a clean but oversimplified grammar, programming the lexical analyzer to make up for the grammar's short-comings on the fly. As part of Marpa::R2, the author has implemented an HTML parser, based on a grammar that assumes that all start and end tags are present. Such an HTML grammar is too simple even to describe perfectly standard-conformant HTML, but the lexical analyzer is programmed to supply start and end tags as requested by the parser. The result is a simple and cleanly designed parser that parses very liberal HTML and accepts all input files, in the worst case treating them as highly defective HTML.

Ambiguity as a language design technique

In current practice, ambiguity is avoided in language design. This is very different from the practice in the languages humans choose when communicating with each other. Human languages exploit ambiguity in order to design highly flexible, powerfully expressive languages. For example, the language of this monograph, English, is notoriously ambiguous.

Ambiguity of course can present a problem. A sentence in an ambiguous language may have undesired meanings. But note that this is not a reason to ban potential ambiguity --- it is only a problem with actual ambiguity.

Syntax errors, for example, are undesired, but nobody tries to design languages to make syntax errors impossible. A language in which every input was well-formed and meaningful would be cumbersome and even dangerous: all typos in such a language would be meaningful, and parser would never warn the user about errors, because there would be no such thing.

With Marpa, ambiguity can be dealt with in the same way that syntax errors are dealt with in current practice. The language can be designed to be ambiguous, but any actual ambiguity can be detected and reported at parse time. This exploits Marpa's ability to report exactly where and what the ambiguity is. Marpa::R2's own parser description language, the SLIF, uses ambiguity in this way.

Auto-generated languages

In 1973, Čulik and Cohen pointed out that the ability to efficiently parse LR-regular languages opens the way to auto-generated languages. In particular, Čulik and Cohen note that a parser which can parse any LR-regular language will be able to parse a language generated using syntax macros.

Second order languages

In the literature, the term "second order language" is usually used to describe languages with features which are useful for second-order programming. True second-order languages --- languages which are auto-generated from other languages --- have not been seen as practical, since there was no guarantee that the auto-generated language could be efficiently parsed.

With Marpa, this barrier is raised. As an example, Marpa::R2's own parser description language, the SLIF, allows "precedenced rules". Precedenced rules are specified in an extended BNF. The BNF extensions allow precedence and associativity to be specified for each RHS.

Marpa::R2's precedenced rules are implemented as a true second order language. The SLIF representation of the precedenced rule is parsed to create a BNF grammar which is equivalent, and which has the desired precedence. Essentially, the SLIF does a standard textbook transformation. The transformation starts with a set of rules, each of which has a precedence and an associativity specified. The result of the transformation is a set of rules in pure BNF. The SLIF's advantage is that it is powered by Marpa, and therefore the SLIF can be certain that the grammar that it auto-generates will parse in linear time.

Notationally, Marpa's precedenced rules are an improvement over similar features in LALR-based parser generators like yacc or bison. In the SLIF, there are two important differences. First, in the SLIF's precedenced rules, precedence is generalized, so that it does not depend on the operators: there is no need to identify operators, much less class them as binary, unary, etc. This more powerful and flexible precedence notation allows the definition of multiple ternary operators, and multiple operators with arity above three.

Second, and more important, a SLIF user is guaranteed to get exactly the language that the precedenced rule specifies. The user of the yacc equivalent must hope their syntax falls within the limits of LALR.

References, comments, etc.

Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Header image by Tambako the Jaguar. Some rights reserved.