Perl Foundation News: Perl 6 IO Grant: April 2017 Report

Zoffix Znet provided this report on April 19, 2017

Perl 6 IO TPF Grant: Monthly Report (April, 2017)

This document is the April, 2017 progress report for TPF Standardization, Test Coverage, and Documentation of Perl 6 I/O Routines grant

Timing

As proposed to and approved by the Grant Manager, I've extended the due date for this grant by 1 extra month, in exchange for doing some extra optimization work on IO routines at no extra cost. The new completion date is May 22nd; right after the next Rakudo compiler release.

Communications

I've created and published three notices as part of this grant, informing the users on what is changing and how to best upgrade their code, where needed:

IO Action Plan Progress

Most of the IO Action Plan has been implemented and got shipped in Rakudo's 2017.04.2 release. The remaining items are:

  • Implement better way to signal closed handle status (was omited from release due to original suggestion to do this not being ideal; likely better to do this on the VM end)
  • Implement IO::CatHandle as a generalized IO::ArgFiles (was omited from release because it was decided to make it mostly-usable wherever IO::Handle can be used, and IO::ArgFiles is far from that, having only a handful of methods implemented)
  • Optimization of the way we perform stat calls for multiple file tests (entirely internal that requires no changes to users' code)

Documentation and Coverage Progress

In my spreadsheet with all the IO routines and their candidates, the totals show that 40% have been documented and tested. Some of the remaining 60% may already have tests/docs added when implementing IO Action Plan or ealier and merely need checking and verification.

Optimizations

Some of the optimizations I promised to deliver in exchange for grant deadline extension were already done on IO::Spec::Unix and IO::Path routines and have made it into the 2017.04.2 release. Most of the optimizations that will be done in the upcoming month will be done in IO::Spec::Win32 and will largely affect Windows users.

IO Optimizations in 2017.04.2 Done by Other Core Members:

  • Elizabeth Mattijsen made .slurp 2x faster rakudo/b4d80c0
  • Samantha McVey made nqp::index—which is used in path operations—2x faster rakudo/f1fc879
  • IO::Pipe.lines was made 3.2x faster by relying on work done by Elizabeth Mattijsen rakudo/0c62815

Tickets Resolved

The following tickets have been resolved as part of the grant:

Possibly more tickets were addressed by the IO Action Plan implementation, but they still need further review.

Bugs Fixed

  • Fixed a bug in IO::Path.resolve with combiners tucked on the path separator. Fix in rakudo/9d8e391f3b; tests in roast/92217f75ce. The bug was identified by Timo Paulssen while testing secure implementation of IO::Path.child

IO Bug Fixes in 2017.04.2 Done by Other Core Members:

  • Timo Paulssen fixed a bug with IO::Path types not being accepted by is native NativeCall trait rakudo/99840804
  • Elizabeth Mattijsen fixed an issue in assignment to dynamics. This made it possible to temp $*TMPDIR variable rakudo/1b9d53
  • Jonathan Worthington fixed a crash when slurping large files in binary mode with &slurp or IO::Path.slurp rakudo/d0924f1a2
  • Jonathan Worthington fixed a bug with binary slurp reading zero bytes when another thread is causing a lot of GC rakudo/756877e

Commits

So far, I've commited 192 IO grant commits to rakudo/roast/doc repos.

Rakudo

69 IO grant commits:

  • c6fd736 Make IO::Spec::Win32.is-absolute about 63x faster
  • 7112a08 Add :D on invocant for file tests
  • 8bacad8 Implement IO::Path.sibling
  • a98b285 Remove IO::Path.child-secure
  • 0b5a41b Rename IO::Path.concat-with to .add
  • 9d8e391 Fix IO::Path.resolve with combiners; timotimo++
  • 1887114 Implement IO::Path.child-secure
  • b8458d3 Reword method child for cleaner code
  • 51e4629 Amend rules for last part in IO::Path.resolve
  • 9a2446c Move Bool return value to signature
  • 214198b Implement proper args for IO::Handle.lock
  • c95c4a7 Make IO::Path/IO::Special do IO role
  • fd503f8 grant] Remove role IO and its .umask method"
  • 0e36bb2 Make IO::Spec::Win32!canon-cat 2.3x faster
  • 40217ed Swap .child to .concat-with in all the guts
  • 490ffd1 Do not use self.Str in IO::Path errors
  • 1f689a9 Fix up IO::Handle.Str
  • c01ebea Make IO::Path.mkdir return invocant on success
  • d46e8df Add IO::Pipe .path and .IO methods
  • 6ee71c2 Coerce mode in IO::Path.mkdir to Int
  • 0d9ecae Remove multi-dir &mkdir
  • ff97083 Straighten up rename, move, and copy
  • 7f73f92 Make IO::Path.new-from-absolute-path private
  • da1dea2 Fix &symlink and &link
  • 8c09c84 Fix symlink and link routines
  • 90da80f Rework read methods in IO::Path/IO::Handle
  • c13480c IO::Path.slurp: make 12%-35% faster; propagate Failures
  • f1b4af7 Implement IO::Handle.slurp
  • 184d499 Make IO::Handle.Supply respect handle's mode
  • b6838ee Remove .f check in .z
  • 6a8d63d Implement :completely param in IO::Path.resolve
  • e681498 Make IO::Path throw when path contains NUL byte
  • b4358af Delete code for IO::Spec::Win32.catfile
  • 0a442ce Remove type constraint in IO::Spec::Cygwin.canonpath
  • 0c8bef5 Implement :parent in IO::Spec::Cygwin.canonpath
  • a0b82ed Make IO::Path::* actually instantiate a subclass
  • 67f06b2 Run S32-io/io-special.t test file
  • 954e69e Fix return value of IO::Special methods
  • a432b3d Remove IO::Path.abspath (part 2)
  • 94a6909 Clean up IO::Spec::Unix.abs2rel a bit
  • 966a7e3 Implement IO::Path.concat-with
  • 50aea2b Restore IO::Handle.IO
  • 15a25da Fix ambiguity in empty extension vs no extension
  • b1e7a01 Implement IO::Path.extension 2.0
  • b62d1a7 Give $*TMPDIR a container
  • 099512b Clean up and improve all spurt routines
  • cb27bce Clean up &open and IO::Path.open
  • 4c31903 Add S32-io/chdir-process.t to list of test files to run
  • 5464b82 Improve &*chdir
  • 2483d68 Fix regression in &chdir's failure mode
  • ca1acb7 Fix race in &indir(IO::Path …)
  • a0ef2ed Improve &chdir, &indir, and IO::Path.chdir
  • aa62cd5 Remove &tmpdir and &homedir
  • a5800a1 Implement IO::Handle.spurt
  • 36ad92a Remove 15 methods from IO::Handle
  • 87987c2 Remove role IO and its .umask method
  • 9d8d7b2 Log all changes to plan made during review period
  • 0c7e4a0 Do not capture args in .IO method
  • c360ac2 Fix smartmatch of Cool ~~ IO::Path
  • fa9aa47 Make R::I::SET_LINE_ENDING_ON_HANDLE 4.1x Faster
  • 0111f10 Make IO::Spec::Unix.catdir 3.9x Faster
  • 4fdebc9 Make IO::Spec::Unix.split 36x Faster
  • dcf1bb2 Make IO::Spec::Unix.rel2abs 35% faster
  • 55abc6d Improve IO::Path.child perf on *nix
  • 4032953 Make IO::Handle.open 75% faster
  • a01d679 Remove IO::Path.pipe
  • 212cc8a Remove IO::Path.Bridge
  • 76f7187 Do not cache IO::Path.e results
  • dd4dfb1 Fix crash in IO::Special .WHICH/.Str

Perl 6 Specification

47 IO grant commits:

  • 3b36d4d Test IO::Path.sibling
  • 7a063b5 Fudge .child-secure tests
  • 39677c4 IO::Path.concat-with got renamed to .add
  • 92217f7 Test IO::Path.child-secure with combiners
  • f3c5dae Test IO::Path.child-secure
  • a716962 Amend rules for last part in IO::Path.resolve
  • 4194755 Test IO::Handle.lock/.unlock
  • 64ff572 Cover IO::Path/IO::Pipe's .Str/.path/.IO
  • 637500d Spec IO::Pipe.path/.IO returns IO::Path type object
  • 8fa49e1 Test link routines
  • 416b746 Test symlink routines
  • d4353b6 Rewrite .l on broken symlinks test
  • 7e4a2ae Swap .slurp-rest to .slurp
  • a4c53b0 Use bin IO::Handle to test its .Supply
  • feecaf0 Expand file tests
  • a809f0f Expand IO::Path.resolve tests
  • ee7f05b Move is-path sub to top so it can be reused
  • b16fbd3 Add tests to check nul byte is rejected
  • 8f73ad8 Change \0 roundtrip test to \t roundtrip test
  • 7c7fbb4 Cover :parent arg in IO::Spec::Cygwin.canonpath
  • 896033a Cover IO::Spec::QNX.canonpath
  • c3c51ed Cover IO::Spec::Win32.basename
  • d8707e7 Cover IO::Spec::Unix.basename
  • bd8d167 Test IO::Path::* instantiate a subclass
  • 43ec543 Cover methods of IO::Special
  • e5dc376 Expand IO::Path.accessed tests
  • 0e47f25 Test IO::Path.concat-with
  • 305f206 Test empty-string extensions in IO::Path.extension
  • 2f09f18 Fix incorrect test
  • b23e53e Test IO::Path.extension
  • 1d4e881 Test $*TMPDIR can be temped
  • 79ff022 Expand &spurt and IO::Path.spurt tests
  • ba3e7be Merge S32-io/path.t and S32-io/io-path.t
  • 3c4e81b Test IO::Path.Str works as advertised
  • 86c5f9c Delete qp{} tests
  • 430ab89 Test &*chdir
  • 86f79ce Expand &chdir tests
  • 73a5448 Remove two fudged &chdir tests
  • 04333b3 Test &indir fails with non-existent paths by default
  • bd46836 Amend &indir race tests
  • f48198f Test &indir
  • 5a7a365 Expand IO::Spec::*.tmpdir tests
  • 14b6844 Use Numeric instead of IO role in dispatch test
  • 8d6ca7a Cover IO::Path.ACCEPTS
  • 091931a Expand &open tests
  • 465795c Test IO::Path.lines(*) does not crash
  • 63370fe Test IO::Special .WHICH/.Str do not crash

Documentation

76 IO grant commits:

  • 61cb776 Document IO::Path.sibling
  • b9c9117 Toss IO::Path.child-secure
  • 6ca67e4 Start sketching out Definitive IO Guide™
  • 81a5806 Amend IO::Path.resolve: :completely
  • c5524ef Rename IO::Path.concat-with to .add
  • 3145979 Document IO::Path.child-secure
  • 160c6a2 Document IO::Handle.lock/.unlock
  • 53f2b99 Document role IO's new purpose
  • 2aaf12a Document IO::Handle.Str
  • bd4fa68 Document IO::Handle/IO::Pipe.IO
  • fd8a5ed Document IO::Pipe.path
  • 8d95371 Expand IO::Handle/IO::Pipe.path docs
  • 60b9227 Change return value for mkdir
  • 47b0526 Explicitly spell out caveats of IO::Path.Str
  • 923ea05 Straighten up mkdir docs
  • aeeec94 Straighten up copy, move, rename
  • fff866f Fix docs for symlink/link routines
  • f83f78c Use idiomatic Perl 6 in example
  • e60da5c List utf-* alias examples too since they're common
  • 0f49bb5 List Rakudo-supported encodings in open()
  • 017acd4 Improve docs for IO::Path.slurp
  • 56b50fe Document IO::Handle.slurp
  • 2aa3c9f Document new behaviour of IO::Handle.Supply
  • a30fae6 Avoid potential confusion with use of word "object"
  • 372545c Straighten up file test docs
  • 1527d32 Document :completely arg to IO::Path.resolve
  • 4090446 Improve chmod docs
  • d436f3c Document IO::Spec::* don't do any validation
  • 69b2082 Document IO::Path.chdir
  • b9de84f Remove DateTime tutorial from IO::Path docs
  • 45e84ad Move IO::Path.path to attributes
  • 0ca2295 Reword/expand IO::Path intro prose
  • dbdc995 Document IO::Spec::*.catpath
  • 50e5565 Document IO::Spec::*.catdir and .catfile
  • 28b6283 Document IO::Spec::*.canonpath
  • a1cb80b Document IO::Spec::Win32.basename
  • 5c1d3b6 Document IO::Spec::Unix.basename
  • 9102b51 Fix up IO::Path.basename
  • e9b6809 Document IO::Path::* subclasses
  • a43ecb9 Document IO::Path's $.SPEC and $.CWD
  • 7afd9c4 Remove unrelated related classes
  • 6bd0f98 Dissuade readers from using IO::Spec*
  • 184342c Document IO::Special.what
  • 56256d0 Minor formatting improvements in IO::Special
  • 1cd7de0 Fix up type graph
  • b3a9324 Expand/fix up IO::Path.accessed
  • 1973010 Document IO::Path.ACCEPTS
  • cc62dd2 Kill IO::Path.abspath
  • 1f75ddc Document IO::Spec*.abs2rel
  • 24a6ea9 Toss all of the TODO methods in IO::Spec*
  • 65cc372 Document IO::Path.concat-with
  • b9e692e Document new IO::Path.extension
  • d5abceb Write docs for all spurt routines
  • b8fba97 Point out my $*CWD = chdir … is an error
  • b78d4fd Include type names in links to methods
  • bdd18f1 Fix desc of IO::Path.Str
  • a53015a Clarify value of IO::Path.path
  • 5aa614f Improve suggestion for Perl 5's opendir
  • bf377c7 Document &indir
  • e5225be Fix URL to &*chdir
  • e1a299c Reword "defined as" for &*chdir
  • 3fdc6dc Document &*chdir
  • 1d0e433 Document &chdir
  • d050d4b Remove IO::Path.chdir prose
  • 839a6b3 Expand docs for $*HOME and $*TMPDIR
  • db36655 Remove tip to use $*SPEC to detect OS
  • 0511e07 Document IO::Spec::*.tmpdir
  • cc6539b Remove 8 methods from IO::Handle
  • 335a98d Remove mention of role IO
  • cc496eb Remove mention of IO.umask
  • 3cf943d Expand IO::Path.relative
  • ccae74a Fix incorrect information for IO::Path.absolute
  • d02ae7d Remove IO::Handle.rw and .rwx
  • 69d32da Remove IO::Handle.z
  • 110efb4 No need for .ends-with
  • fd7a41b Improve code example

brian d foy: Can anyone help update Rakudo in Chocolatey?


I need some help updating the Rakudo package on chocolatey. This package manager for Windows is easy to use from AppVeyor and would save the shenanigans I went through to install Rakudo there.

Anyone got some pull that could help to move this process along? I've run out of steps in their package triage process this month. I could try to simply make a different distribution, but I'd rather fix up the existing one. And, once fixed up, I'd like to distribute access so other people can step up. I'd be completely happy to not maintain this package and have him keep doing it as long as the most recent Rakudo is available. Or, if someone else would like to take on this responsibility, that's good too.


On its downloads page, Rakudo Star has a Microsoft Installer. That's what I use in my AppVeyor config. But, I have download it with PowerShell and jump through various hoops to install it and it's a bit fragile. Managing it through chocolatey would be not only easier, but supported by AppVeyor.

Jake Russo did a great job getting the first package onto chocolatey, and he merged my pull request to update the files to 2017.01, but 2017.04 needs another pull request. Even though he merged my changes, the next step is to get those onto Chocolatey. Although my changes are in the GitHub repo, I haven't heard from Jake.

The Chocolatey people have a process to take over unmaintained packages. This involves contacting the site admins through their webform, which I've done several times without response, and posting a note in the Chocolatey Google Groups. If anyone has experience with this or knows the people involved, I could use help nudging this along.

NeilB: Specifying the type of your CPAN dependencies

This is the third article in a series on CPAN distribution metadata. The first article was a general introduction, and the second article looked at dependencies, and in particular the different phases that you can specify dependencies for (configure, build, runtime, test, and develop). In this article, we'll cover the different types of dependencies and how you combine these with the phases (described in the previous article) to specify the dependencies (or prereqs) for your CPAN distribution.

This article is brought to you by MaxMind, a gold Sponsor for this year's Toolchain Summit, being held next month (May) in Lyon, France. The summit is only possible with the support of our sponsors.

Dependency Types

Most of the time, dependencies for CPAN distributions are required. This means that your distribution won't be installed unless all required dependencies could first be installed, as it's assumed that your module(s) won't work if one of the dependencies is missing.

But sometimes you can end up with optional dependencies. This might be to provide an optional feature, such as support for a number of export formats, or it might be an optional XS implementation for extra speed, falling back on a pure-perl implementation.

There are four different dependency types: requires, recommends, suggests, and conflicts. We'll look at each of them in turn. In the discussions below, we'll refer to the target distribution: This is the distribution being installed, and that prereqs are being checked for. To keep the examples below simple, we'll assume that the target distribution has a Makefile.PL based on ExtUtils::MakeMaker.

You should also be aware that the spec refers to these as dependency relationships, rather than types.

Requires

Required dependencies must be installed before the associated phase can be run for the target distribution. So if the target distribution has a required configure module, then that module must be installed before running

perl Makefile.PL

And if the required configure dependency can't be installed, then the installation of the target distribution must fail at that point.

Our example target module has a required configure dependency on ExtUtils::MakeMaker, so you'll see the following in META.json:

"prereqs" : {
    "configure" : {
        "requires" : {
            "ExtUtils::MakeMaker" : "6.3"
        }
    },
    ...
}

The great bulk of dependencies on CPAN have type "requires", so we'll not dwell on these any further.

Recommends

The spec says:

Recommended dependencies are strongly encouraged and should be satisfied except in resource constrained environments.

The interpretation of this is left up to individual CPAN clients:

  • For the CPAN module, and the cpan script front-end, it depends on whether you've set the configuration option "recommendspolicy.” If it's not set, or set to a false value, then any recommended prereqs are just ignored. If set to a true value, then CPAN will treat any recommended prereqs like requires: If they can't be installed, then the target distribution won't be installed. The default used to be for recommendspolicy to not be set, which is the same as a false value. In recent versions of CPAN it now defaults to true, but only if you're installing it for the first time. If you've already got CPAN installed, updating to a more recent version won't change this config setting.
  • By default, cpanm will similarly ignore any recommended prereqs, unless you give the --with-recommends command-line option. If you do give the option, cpan will try to install recommended prereqs, but if they fail, installation of the target dist will continue.

So by default, recommended prerequisites are ignored. More on that below.

Let's have a look at some distributions with recommended dependencies.

JSON::MaybeXS is a Perl module which will use the best available module for processing JSON. First it will use Cpanel::JSON::XS if that's available, then it will try JSON::XS, and if neither of those are available, it will fall back on JSON::PP, a pure-perl implementation that has been a core module since Perl 5.14.0.

If you look at the runtime prereqs in META.json, you'll see:

"runtime" : {
   "recommends" : {
      "Cpanel::JSON::XS" : "2.3310"
   },
   "requires" : {
      ...
      "JSON::PP" : "2.27300",
      ...
   }
},

Cpanel::JSON::XS is a recommended, but optional, dependency, and JSON::PP is a hard requirement. JSON::XS isn't listed as any kind of dependency: Cpanel::JSON::XS is a fork of JSON::XS. So, if the former couldn't be installed, it's unlikely a CPAN client will be able to install JSON::XS (but if JSON::XS is already installed, then it will be used in preference to JSON::PP).

But this isn't the whole picture! If you look at META.json, you'll see

"dynamic_config": 1,

As we covered in the first article of this series, this tells the CPAN client that dependencies need to be generated on the target platform, so it must first run:

perl Makefile.PL

And then use the metadata generated in MYMETA.json, rather than the META.json included in the release. And if you look at Makefile.PL, you'll see that it checks to see whether the target platform can handle XS modules. If it can, and JSON::XS isn't already installed, then Cpanel::JSON::XS is changed to be a required dependency.

David Golden's CPAN::Visitor is an implementation of the visitor pattern, for iterating over a CPAN repository. In its test prereqs, we see:

"recommends" : {
    "CPAN::Meta" : "2.120900"
},

This is an optional requirement for the test t/00-report-prereqs.t, which will still run without CPAN::Meta, but not do everything. This test is generated by the Test::ReportPrereqs plugin, which is included in DAGOLDEN's Dist::Zilla plugin bundle. When you run "make test” this test runs first, and lists the distribution's prereqs: output.

Suggests

The spec says:

These dependencies are optional, but are suggested for enhanced operation of the described distribution.

Again, the interpretation is left up to CPAN clients, with the two main clients having the same behaviour as for recommends:

  • By default, CPAN/cpan won't try and install these dependencies, but you can change that behaviour with the "suggests_policy" configuration option.
  • cpanm similarly won't try and install these dependencies, unless you give the --with-suggests command-line option.

Let's look at some suggested dependencies.

The Time::Stamp module provides functions for generating and parsing timestamps. If you have Time::HiRes installed, then you'll get fractional seconds in timestamps, but integer seconds otherwise. So Time::HiRes is specified as a suggested runtime prerequisite. It's been a core module since Perl 5.8.0, so for most people the suggested dependency is irrelevant (though an appropriate use of "suggests").

Module::Pluggable provides a mechanism for a module to easily support plugins. This works with App::FatPacker (a module which lets you roll up all the module dependencies for a script into a "packed" version of the script), and has a test that confirms this. So you can see in the metadata that App::FatPacker is a suggested test dependency.

Conflicts

This one is different. All the previous relationships are for expressing a module that will, or may, be used by the target distribution. A positive assertion, if you like. But a conflicts prerequisite says that the module must not be present. Here's what the spec says:

These libraries cannot be installed when the phase is in operation. This is a very rare situation, and the conflicts relationship should be used with great caution, or not at all.

With the regular type of prerequisites, it's common to see no version specific (a "0" given instead of a version). But with conflicts, you will almost always see it with a version. It is generally used to identify a version of a dependency that had a critical bug.

A good example is DBI's metadata, where you'll see the following:

"runtime" : {
    "conflicts" : {
        "DBD::Amazon" : "0.10",
        "DBD::AnyData" : "0.110",
        "DBD::CSV" : "0.36",
        "DBD::Google" : "0.51",
        "DBD::PO" : "2.10",
        "DBD::RAM" : "0.072",
        "SQL::Statement" : "1.33"
    },
}

Normally when you see a version number in prereqs, it means "this version or later", but here it means "not this version, or earlier".

The spec doesn't mention it, but you can also use expressions, rather than just a simple version number. For example, in META.json for Catmandu-PICA you'll see:

"runtime" : {
    "conflicts" : {
        "Catmandu::SRU" : "< 0.032"
    },
    ...
}

Which says that 0.032 and later are ok.

And in MooX-ClassAttribute's META.json you'll see:

"runtime" : {
    "conflicts" : {
        "Moo" : "== 1.001000",
        "MooseX::ClassAttribute" : "<= 0.26"
    }
},

This identifies a conflict with one specific release of Moo, and all releases of MooseX::ClassAttribute up to and including 0.26 conflict with MooX-ClassAttribute.

There are a number of problems with the "conflicts" type, though:

  • In a prereqs statement, a version number normally means "this version or later", but here it's assumed to mean "this version or earlier".
  • When DBI is installed, the user might not have DBD::Google installed. If the user later installs DBD::Google 0.51, will the CPAN client let the user know about the conflict? I'm pretty sure the answer is currently "no".
  • I don't know if all CPAN clients handle expressions instead of version numbers.

The "conflicts" mechanism is now seen as an experiment that didn't work out. It was discussed at the Lancaster QA Hackathon, and again at the Berlin QA Hackathon. The currently proposed solution is an x_breaks key in distribution metadata. No CPAN client implements this as yet, though. No CPAN client implements this yet, but the [Test::CheckBreaks] plugin for Dist::Zilla lets a distribution enforce it, and the Moose distribution includes a moose-outdated script that checks for conflicts.

Conclusion

At the time of writing, there are just over 35,450 distributions on CPAN. Of those, only 2,994 (8.4%) have a recommended dependency, and just 436 (1.2%) have a suggested dependency. And a mere 24 distributions (0.07%) have a "conflicts" dependency.

Given the definition of "recommends" in CPAN::Meta::Spec ("strongly encouraged and should be satisfied except in resource constrained environments"), I think the behaviour of the two most popular CPAN clients (CPAN and cpanm) could be improved. I think a better default behaviour would be:

  • Try to install recommended dependencies, but don't abort the phase if they can't be installed.
  • Don't try to install suggested dependencies, unless the user has explicitly requested that.

I wondered if there is some reason for the current behaviour, or is it just that when it was spec'd out, no-one was quite sure exactly how they'd be used? I asked Tatsuhiko Miyagawa, the author of cpanm, and he said that circular dependencies were a problem, which he solved with this behaviour.

One approach to avoiding circular dependencies is to install recommended dependencies after the target distribution has been installed. This was also discussed at the Berlin QA Hackathon.

Writing this article has been a real journey. I discovered things I didn't know about, or fully understand. And some of those where I thought I had then found out the full picture, it turned out I hadn't! And no doubt still haven't. I'll try and improve some of my remaining gaps at the toolchain summit, and possibly submit some pull requests for documentation updates!

My thanks to everyone who helped with this article, particularly David Golden, Andreas König, Tatsuhiko Miyagawa, and Karen Etheridge.

Specifying dependencies for a CPAN distribution

In the next article in this series, we'll look at the different ways you can specify dependencies for a distribution, depending on which builder you're using.

About MaxMind

Founded in 2002, MaxMind is an industry-leading provider of IP intelligence and online fraud detection services. Thousands of companies of all sizes rely on MaxMind's GeoIP geolocation and end-user data for fraud prevention, ad targeting, content customization, digital rights management, and more. MaxMind's minFraud Network leverages this data along with IP and email reputations established from over 100 million transactions per month to help merchants prevent fraudulent online transactions and account registrations.

A number of CPAN authors work at MaxMind, including Olaf Alders (OALDERS), Florian Ragwitz (FLORA), Mark Fowler (MARKF), Chris Weyl (RSRCHBOY), TJ Mather (TJMATHER), Mateu Hunter (MATEU), Greg Oschwald (OSCHWALD), Andy Jack (ANDYJACK), Pat Cronin (PCRONIN), and Ruben Navarro (RUBEN).

Olaf Alders, who was the founder of the MetaCPAN project, will be attending the summit.

Our thanks again to MaxMind for supporting the Toolchain Summit.

Perl Foundation News: Perl 6 Performance and Reliability Engineering: Grant Report

This is a grant report by Jonathan Worthington on his grant under Perl 6 Core Development Fund. We thank the TPF sponsors to make this grant possible.

I have completed the second 200 hours of my Perl 6 performance and reliability engineering grant, funded by the Perl 6 core development fund. This report summarizes the work done during those 200 hours. In accordance with community feedback, the vast majority of effort has been put into reliability rather than performance.

Concurrency robustness

The main area of focus in this grant period has been making Perl 6's concurrency support more robust. While work remains to be done, the improvement over the last several months has been noticeable. It is also an important area for me to focus on, given the small number of people in the community with the skills, time, and patience (or, perhaps, stubbornness) to track down and resolve these problems. Here is a summary of the issues resolved.

  • Fixed a bug affecting use of callwith in multiple threads
  • Fixed RT #128809 (closure-related bug involving s/// construct, which showed up in concurrent scenarios)
  • Fixed RT #129213 (synchronous socket accept could block GC in other threads, thus blocking program progess)
  • Determined RT #128694 fixed, added test (zip-latest on two intervals would hang)
  • Eliminated use of in-place rope flattening, which violated the immutability of strings and could thus cause various crashes (especially in hash access); this resolved many failures, amongst them the one reported in RT #129781, and also made hash lookups using rope strings keys more efficient as a bonus
  • Fixed warnings due to over-sharing of $/ between threads when using grammars in parallel (mostly fixed RT #128833)
  • Fixed a Lock.protect bug when we unwound the stack due to control exceptions (discovered independently of, but also resolved, RT #126774)
  • Fixed RT #129949 (GC crash resulting from missing rooting of sent value in concurrent blocking queue)
  • Fixed RT #129834 (sporadic behavior when concurrently creating Proc::Async objects and obtaining handles)
  • Audited and fixed vulnerable cases of the once construct
  • Fixed RT #129994 (long-lived native call on one thread could block GC in other threads)
  • Fixed RT #125782 (uninformative error reporting when a Promise is broken)
  • Examined RT #127960; concluded it is fixed, but added it as a stress test since it's a bit distinct from the other test for the same underlying bug
  • Fixed a bug where method caches could be revealed to other threads before they were fully deserialized, causing falsely missed lookups
  • Fixed a data race inside of the NativeCall setup code
  • Fixed RT #130064 (trying to rethrow an exception that was never thrown before leaked an internal error; this showed up in Promise.break("foo"))
  • Fixed scoping/cloning problem with LAST/NEXT/QUIT phasers in supply, react, whenever, and for constructs
  • Fixed a bug with QUIT phasers mishandling exceptions thrown synchronously with the .tap
  • Switched to using Supplier::Preserving on the taps of stdout/stderr in Proc::Async, to avoid various innocent-looking usage patterns losing output
  • Fixed RT #128991 (messages could seep out of a supply block even after it was considered done)
  • Fixed a GC corruption bug involving Proc::Async that caused occasional crashes
  • Tracked down and fixed two data races in the supply/whenever implementation and in Supply.interval
  • Fixed RT #130168 (Supply.interval(...) with very small interval would only ever emit 1 value)
  • Fixed interaction of native callbacks, GC blocking, and embedding, which afflicted Inline::Perl6
  • Fixed use-after-free that could occur as part of inlining fixups when in a multi-threaded program
  • Fixed precompilation of the OO::Monitors module
  • Fixed RT #130266 (premature frees of handles in various async socket error handling cases)
  • Fixed SEGVs when GC stressing was applied to S15-nfg/many-threads.t and S17-supply/syntax.t
  • Fixed incorrect reporting of some errors on threads, which could show up as if they were compile-time errors
  • Fixed thread safety issue in the >>.foo implementation
  • Fixed a miscompilation of ||=, &&=, and //=, making them a good bit more efficient along the way
  • Add various bits of missing concurrency control in precompilation management, thus fixing parallel use of precompilation (this will help towards faster p6doc builds)

String decoding improvements in asynchronous I/O

Previously, decoding of bytes to strings for both IO::Socket::Async and Proc::Async was done at the VM level. This created a number of fragilities with regard to decoding errors. Due to time constraints, different encodings besides UTF-8 had not been implemented for these classes either, leaving users of them to do decoding manually if they needed anything else.

To rectify these issues, I first made the VM-backed decoders directly available to userland. These will, in the future, be exposed as a Perl 6-level API, and we'll support user-space encodings. For now, it meant I could move the code that orchestrates the decoding of strings in async I/O into Perl 6 space, fixing the robustness issues. This also means that string decoding for different spawned processes and socket connections can be done in the thread pool, rather than using the event-processing thread. Along the way, I added support for different encodings.

Finally, there were some issues around the way async sockets and processes worked with regard to NFG. I resolved these issues and made sure there was test coverage of the various edge cases.

Non-blocking await and react support

I did the initial round of work to provide support for non-blocking await and react. At present, these constructs will block a real OS thread, even if used on a thread in the thread pool. The changes, available via. use v6.d.PREVIEW, mean that thread-pool threads will be returned to the thread pool to do other work, and the code following the await and react will be scheduled once the result is available or processing is complete. This is implemented using continuations (much like gather/take, except in this case the continuation may be resumed on a different OS thread). The result is that Perl 6 programs will be able to have hundreds or thousands of outstanding reacts and awaits, with just a handful of real OS-threads required to process them.

This is just the initial implementation; further work will be required to make this feature ready to be the default in Perl 6.d.

Memory leak fixes and memory use improvements

The highlight of the memory management improvements was a simplification to the lifetime management of register working sets in MoarVM. This resulted from the elimination of a couple of speculative features that were not yet being utilized by Rakudo, and in one case never would have been anyway. Coupled with a range of cleanups and some code streamlining, the result was a 10% reduction in peak memory use for CORE.setting compilation, and 20% off the compilation runtime. I also:

  • Fixed a bug that caused bogus multi-dispatch cache misses for calls with many named arguments, leading to the cache growing indefinitely with duplicate entries
  • Fixed a regex interpolation memory leak; it boiled down to unclaimed entries left behind in the serialization context weakhash
  • Fixed leaks of asynchronous task handles
  • Fixed a leak in decode stream cleanup
  • Improved memory allocation measurement in I/O, meaning that full GC collection decisions are made more accurately in I/O-heavy programs
  • Fixed a memory leak involving Proc::Async
  • Fixed a memory leak when a synchronous socket failed to connect
  • Tracked down and resolved the last remaining leaks that showed up in perl6-valgrind-m -e '', meaning it is now clean. (Previously, some cleanup was missed at VM shutdown)

Unicode-related work

I did a number of Unicode improvements, as well as discussing with and reviewing Pull Requests from a new contributor who is now doing a bunch of Unicode work for Perl 6. My own contributions code wise were:

  • Initial support for Unicode 9 (updating the character database, basic NFG tweaks)
  • A rewrite of the UTF8-C8 encoding to eliminate various bugs (some falling under RT #128184), including a buffer overrun and not properly round-tripping valid but non-NFC input

Other assorted bugs

I also took care of a range of other bugs, which don't fit into any of the previously mentioned areas of work.

  • Fixed RT #128703 (1 R, 2 R, 3 lost values)
  • Fixed RT #129088 (lack of backtaces for sprintf and friends)
  • Fixed RT #129249 (mis-compilation of /$<cat>=@(...)/)
  • Fixed RT #129306 (erorr reporting bug involving sub-signatures)
  • Partially fixed and tested RT #129278 (native attributive parameter binding broken) and noted on the RT the less common cases that remain to be fixed
  • Fixed RT #129430 (sigilless parameters were declared too late to use in where clauses)
  • Fixed RT #129827 (sub { 42.return }() ended up being code-gen'd without a return handler)
  • Fixed RT #129772 (poor error reporting when you tried to invoke a native parameter; it blew up code-gen and gave no location info)
  • Tracked down and fixed a pre-comp management bug on Windows due to a file not being closed and then trying to rename over it
  • Fixed RT #129968 (error-reporting crash in the case of redeclaration errors in nested packages)
  • Fixed a bug with augmenting nested classes
  • Fixed RT #129921 (internal warning when producing exception for my $!a)
  • Hunted down a bug in the JIT-compilation of the nqp::exception() op and fixed it
  • Fixed RT #130107 (accidentally treated nread == 0 as an error in a couple of places in MoarVM)
  • Fixed RT #130081 (did not backtrack into a regex TOP in a grammar to try and make it match until the end of the string)
  • Fixed RT #130294 (SEGV that occasionally occurred during some cases of deep recursion)
  • Fixed RT #128516 (SEGV when composing meta-object held in an attribute)
  • Fixed RT #130465 (ignoremark not applied with backslashed literals)
  • Fixed RT #130208 (putting multi-line Pod documentation on a role would pun it)
  • Fixed RT #130615 (code-gen of $a++ in sink context for native $a was a lot slower than in non-sink context)
  • Fixed RT #130637 (two grammar constructs produced malformed NFAs, which gave wrong results or could even SEGV in MoarVM; MoarVM was made to validate NFAs more strongly, which shook out the second issue besides the reported one)
  • Investigated RT #129291 (SEGV involving processes with output of one fed into input of the other); applied likely fix
  • Investigated and fixed an issue with $/ setting, arising from changes to the implementation of match
  • Fixed a bug that occasionally caused spectest crashes; was an interaction between dynamic lexical caching, inlining, deoptimization, and garbage collection
  • Fixed a rare crash related to assignment when the type constraint was a refinement type
  • Fixed MoarVM #120 and #426, in which a failed debug annotation lookup led to boxing a NULL string
  • Fixed a couple of places where dynamic optimization could accidentally trigger garbage collection; the optimizer assumes this can never happen
  • Fixed RT #123989 and RT #125135 (callsame et al could sometimes have the dispatcher stolen by the wrong target invokee)

Other tasks

On top of this, some time was spent reviewing pull requests to Rakudo, NQP, and MoarVM, providing feedback, and merging them when appropriate. I also commented on a range of RT tickets besides those I fixed myself. Various other small cleanups and additions resulted from this work, ranging from typo fixes in comments up to improvements to GC debugging macros added while finding bugs.

Perl Foundation News: Final Grant Report : Migrating blogs.perl.org - April 2017

Work on the blogs.perl.org grant, started in November 2015, has stalled. With no progress reports from the grantee since November 2016, and after a number of attempts on all sides to jumpstart the work, the Grants Committee has voted to cancel the grant, as provided in the rules of operation.

Many on the Committee and in the community would like to see a successful update of blogs.perl.org. With that in mind, the Grants Committee encourages interested parties to consider applying for blogs.perl.org improvement grants in upcoming rounds. The next round of decisions will happen in May. See How to write a proposal for tips, and feel free to reach out to the committee via tpf-grants-secretary at perl-foundation.org.

MAJ

Gabor Szabo: Programming language popularity - Stack Overflow

There are many ways to compare the popularity of programming languages. None of them is perfect. Let's look at one of those imperfect comparisions: Tags on StackOverflow

For the full article visit Programming language popularity - Stack Overflow

Perl Foundation News: Grant Extension Request: Maintaining the Perl 5

Tony Cook has requested an extension of $20,000 for his Maintaining the Perl 5 grant. This will allow him to dedicate another 400 hours to this work. During this grant he sent regular reports to the p5p mailing list as well as providing monthly summary reports that have been published on this site, the most recent of which are linked below:

Before we make a decision on this extension, we would like to have a period of community consultation. Please leave feedback in the comments field below or if you prefer, send email with your comments to makoto at perlfoundation.org. We request that all the feedback should be sent by April 25th.

If successful this extension will be funded from the Perl 5 Core Maintenance Fund.

Note: The request was received in February and TPF's internal process took time to post this. Apologies.

Aristotle: Alan Kay’s critique of the TPF grants program

Alan Kay:

A few principles:

  1. Visions not goals
  2. Fund people not projects — the scientists find the problems not the funders. So, for many reasons, you have to have the best researchers.
  3. Problem Finding — not just Problem Solving
  4. Milestones not deadlines
  5. It’s “baseball” not “golf” — batting .350 is very good in a high aspiration high risk area. Not getting a hit is not failure but the overhead for getting hits. (As in baseball, an “error” is failing to pull off something that is technically feasible.)
  6. […]
  7. […]
  8. An important part of the research results are researchers. This extends the “baseball” idea to human development. The grad schools, especially, generally admitted people who “seemed interesting” and judgements weren’t made until a few years down the road. Many of the researchers who ultimately solved most of the many problems of personal computing and networking were created by the ARPA community.

(This is from an answer about what set Xerox PARC apart and allowed it to originate so many inventions that define what computing is today (e.g. Ethernet, laser printers, the mouse, GUIs, WYSIWYG). The entirety of the answer is good, and quoting selectively does not do it justice (as brief as it is), but I’m doing so here because it struck me just how precisely that list constitutes the opposite of our grants program. Food for thought.)

NeilB: The Perl Toolchain Summit Project List

The Perl Toolchain Summit (PTS) is the annual event where we assemble the people who are actively working on the Perl toolchain, and give them 4 days to work together. In this blog post, we'll look at how we decide what everyone will work on, and give you a chance to make suggestions.

This blog post is brought to you by Perl Jobs by Perl Careers, which as well as helping Perl programmers find jobs, supports a number of community events, including the QA Hackathon last year.

Background

In one of Philippe's blog posts about the summit, he described how the attendees are selected: we pick the people who have been working on the most important toolchain projects over the last year or so. We start off with the core tools, and ask their developers who they've been working with. The focus is tools that support Perl in Production.

As new tools emerge and become popular / important, the basis of selection evolves. Furthermore, people move on to new projects, or stop contributing to open source, and others step into their shoes. So over time the attendee list gradually changes. I think there's only one person who has been to every QA Hackathon / Toolchain Summit: H. Merijn Brand ("Tux"), who you may also know as the maintainer of Text::CSV_XS.

One of the benefits of the summit is having four days surrounded by others who have similar interests and motivations, with related and sometimes overlapping experience. As well as coding, this makes discussions one of the big values of attending: the chance to try and solve tricky problems that cross multiple systems, and to bounce around ideas for moving things forward.

The Project List

Everyone invited to the summit is expected to turn up with a pretty full backlog of things that need working on. You can see who's coming this year on the attendees page.

To facilitate preparation, collaboration, and avoid any surprises, we set up a project list. This is on the wiki that is part of the summit's web site.

All attendees are asked to list the projects they would like to work on at the summit. Hopefully this has a number of benefits:

  • Other attendees may be interested in helping on one of your projects, perhaps having relevant experience, or even having worked on it already. This can lead to discussions before the summit, so people arrived raring to go. Over time you'll see multiple names getting added to some projects.
  • Seeing other people's ideas can remind of you of things that are on your mental "must get around to" list.
  • Triggering "oh, while you're doing that, maybe you could also do this" type comments.
  • There are certain areas of the toolchain where multiple ideas get listed. As a result we can get everyone together for a chat at the start of the summit, to minimise people tripping over each other, and ensuring things are done in a sensible order, etc.

Items on the list aren't necessarily coding-related tasks, they might be discussions, or even a presentation.

What do you think?

The project list is public, so the entire community can have a look and suggest new ideas. Are there specific things you'd like to see addressed in the toolchain?

  • Bugs in specific tools that bite you on a regular basis?
  • Features you'd like to see added to cpanm, MetaCPAN.
  • New tools you think we'd all benefit from.
  • Good ideas we could borrow from other programming language communities.

Take a look at the list, and let us know if you'd like to suggest things for the list. You can add comments on this blog post, or reddit. Or you can email one of the organisers: neilb at cpan dot org, or book at cpan dot org.

Obviously there's no guarantee that your idea, bug, or request will actually get worked on at the summit, but throwing it into the mix can't hurt, and may prompt others to chip in.

About Perl Careers

Perl Careers is a Perl-focused recruitment consultancy, run by a Peter Sergeant, a CPAN contributor (PAUSE id SARGIE) with a recruitment background, rather than by a non-technical person. Peter works with clients and candidates in London, the US, and Australia. Peter is not only sponsoring the Toolchain Summit, he's attending it as well, to work on some of his Test modules.

Sawyer X: Perl 5 Porters Mailing List Summary: April 11th-16th

Hey everyone,

Following is the p5p (Perl 5 Porters) mailing list summary for the past week.

Enjoy!

April 11th-16th

News

  • Dave Mitchell TPF Grant 2 reports #170, #171.

Issues

New issues

Resolved issues

  • Perl #130801: SvNV() does not store computed value to NV slot.
  • Perl #131100: The ../ relative path misbehaving with regard to default_inc_excludes_dot.
  • Perl #131110: CORE/util.h breaks if multiply included (patch).
  • Perl #131124: commit 2e6f1ae9c: breaks blead.

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Perl isn't dieing

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Ocean of Awareness: Top-down parsing is guessing

Top-down parsing is guessing. Literally. Bottom-up parsing is looking.

The way you'll often hear that phrased is that top-down parsing is looking, starting at the top, and bottom-up parsing is looking, starting at the bottom. But that is misleading, because the input is at the bottom -- at the top there is nothing to look at. A usable top-down parser must have a bottom-up component, even if that component is just lookahead.

A more generous, but still accurate, way to describe the top-down component of parsers is "prediction". And prediction is, indeed, a very useful component of a parser, when used in combination with other techniques.

Of course, if a parser does nothing but predict, it can predict only one input. Top-down parsing must always be combined with a bottom-up component. This bottom-up component may be as modest as lookahead, but it must be there or else top-down parsing is really not parsing at all.

So why is top-down parsing used so much?

Top-down parsing may be unusable in its pure form, but from one point of view that is irrelevant. Top-down parsing's biggest advantage is that it is highly flexible -- there's no reason to stick to its "pure" form.

A top-down parser can be written as a series of subroutine calls -- a technique called recursive descent. Recursive descent allows you to hook in custom-written bottom-up logic at every top-down choice point, and it is a technique which is completely understandable to programmers with little or no training in parsing theory. When dealing with recursive descent parsers, it is more useful to be a seasoned, far-thinking programmer than it is to be a mathematician. This makes recursive descent very appealing to seasoned, far-thinking programmers, and they are the audience that counts.

Switching techniques

You can even use the flexibility of top-down to switch away from top-down parsing. For example, you could claim that a top-down parser could do anything my own parser (Marpa) could do, because a recursive descent parser can call a Marpa parser.

A less dramatic switchoff, and one that still leaves the parser with a good claim to be basically top-down, is very common. Arithmetic expressions are essential for a computer language. But they are also among the many things top-down parsing cannot handle, even with ordinary lookahead. Even so, most computer languages these days are parsed top-down -- by recursive descent. These recursive descent parsers deal with expressions by temporarily handing control over to an bottom-up operator precedence parser. Neither of these parsers is extremely smart about the hand-over and hand-back -- it is up to the programmer to make sure the two play together nicely. But used with caution, this approach works.

Top-down parsing and language-oriented programming

But what about taking top-down methods into the future of language-oriented programming, extensible languages, and grammars which write grammars? Here we are forced to confront the reality -- that the effectiveness of top-down parsing comes entirely from the foreign elements that are added to it. Starting from a basis of top-down parsing is literally starting with nothing. As I have shown in more detail elsewhere, top-down techniques simply do not have enough horsepower to deal with grammar-driven programming.

Perl 6 grammars are top-down -- PEG with lots of extensions. These extensions include backtracking, backtracking control, a new style of tie-breaking and lots of opportunity for the programmer to intervene and customize everything. But behind it all is a top-down parse engine.

One aspect of Perl 6 grammars might be seen as breaking out of the top-down trap. That trick of switching over to a bottom-up operator precedence parser for expressions, which I mentioned above, is built into Perl 6 and semi-automated. (I say semi-automated because making sure the two parsers "play nice" with each other is not automated -- that's still up to the programmer.)

As far as I know, this semi-automation of expression handling is new with Perl 6 grammars, and it may prove handy for duplicating what is done in recursive descent parsers. But it adds no new technique to those already in use. And features like

  • mulitple types of expression, which can be told apart based on their context,
  • n-ary expressions for arbitrary n, and
  • the autogeneration of multiple rules, each allowing a different precedence scheme, for expressions of arbitrary arity and associativity,

all of which are available and in current use in Marpa, are impossible for the technology behind Perl 6 grammars.

I am a fan of the Perl 6 effort. Obviously, I have doubts about one specific set of hopes for Perl 6 grammars. But these hopes have not been central to the Perl 6 effort, and I will be an eager student of the Perl 6 team's work over the coming months.

Comments

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Ocean of Awareness: Parsing: an expanded timeline

The fourth century BCE: In India, Pannini creates a sophisticated description of the Sanskrit language, exact and complete, and including pronunciation. Sanskrit could be recreated using nothing but Pannini's grammar. Pannini's grammar is probably the first formal system of any kind, predating Euclid. Even today, nothing like it exists for any other natural language of comparable size or corpus. Pannini is the object of serious study today. But in the 1940's and 1950's Pannini is almost unknown in the West. His work has no direct effect on the other events in this timeline.

1943: Emil Post defines and studies a formal rewriting system using productions. With this, the process of reinventing Pannini in the West begins.

1948: Claude Shannon publishes the foundation paper of information theory. Andrey Markov's finite state processes are used heavily.

1952: Grace Hopper writes a linker-loader and describes it as a "compiler". She seems to be the first person to use this term for a computer program. Hopper uses the term "compiler" in its original sense: "something or someone that brings other things together".

1954: At IBM, a team under John Backus begins working on the language which will be called FORTRAN. The term "compiler" is still being used in Hopper's looser sense, instead of its modern one. In particular, there is no implication that the output of a "compiler" is ready for execution by a computer. The output of one 1954 "compiler", for example, produces relative addresses, which need to be translated by hand before a machine can execute them.

1955: Noam Chomsky is awarded a Ph.D. in linguistics and accepts a teaching post at MIT. MIT does not have a linguistics department and Chomsky, in his linguistics course, is free to teach his own approach, highly original and very mathematical.

1956: Chomsky publishes the paper which is usually considered the foundation of Western formal language theory. The paper advocates a natural language approach that involves

  • a bottom layer, using Markov's finite state processes;
  • a middle, syntactic layer, using context-free grammars and context-sensitive grammars; and
  • a top layer, which involves mappings or "transformations" of the output of the syntactic layer.

These layers resemble, and will inspire, the lexical, syntactic and AST transformation phases of modern parsers. For finite state processes, Chomsky acknowledges Markov. The other layers seem to be Chomsky's own formulations -- Chomsky does not cite Post's work.

1957: Steven Kleene discovers regular expressions, a very handy notation for Markov's processes. Regular expressions turn out to describe exactly the mathematical objects being studied as finite state automata, as well as some of the objects being studied as neural nets.

1957: Noam Chomsky publishes Syntactic Structures, one of the most influential books of all time. The orthodoxy in 1957 is structural linguistics which argues, with Sherlock Holmes, that "it is a capital mistake to theorize in advance of the facts". Structuralists start with the utterances in a language, and build upward.

But Chomsky claims that without a theory there are no facts: there is only noise. The Chomskyan approach is to start with a grammar, and use the corpus of the language to check its accuracy. Chomsky's approach will soon come to dominate linguistics.

1957: Backus's team makes the first FORTRAN compiler available to IBM customers. FORTRAN is the first high-level language that will find widespread implementation. As of this writing, it is the oldest language that survives in practical use. FORTRAN is a line-by-line language and its parsing is primitive.

1958: John McCarthy's LISP appears. LISP goes beyond the line-by-line syntax -- it is recursively structured. But the LISP interpreter does not find the recursive structure: the programmer must explicitly indicate the structure herself, using parentheses.

1959: Backus invents a new notation to describe the IAL language (aka ALGOL). Backus's notation is influenced by his study of Post -- he seems not to have read Chomsky until later.

1960: Peter Naur improves the Backus notation and uses it to describe ALGOL 60. The improved notation will become known as Backus-Naur Form (BNF).

1960: The ALGOL 60 report specifies, for the first time, a block structured language. ALGOL 60 is recursively structured but the structure is implicit -- newlines are not semantically significant, and parentheses indicate syntax only in a few specific cases. The ALGOL compiler will have to find the structure. It is a case of 1960's optimism at its best. As the ALGOL committee is well aware, a parsing algorithm capable of handling ALGOL 60 does not yet exist. But the risk they are taking will soon pay off.

1960: A.E. Gleenie publishes his description of a compiler-compiler. Glennie's "universal compiler" is more of a methodology than an implementation -- the compilers must be written by hand. Glennie credits both Chomsky and Backus, and observes that the two notations are "related". He also mentions Post's productions. Glennie may have been the first to use BNF as a description of a procedure instead of as the description of a Chomsky grammar. Glennie points out that the distinction is "important".

Chomskyan BNF and procedural BNF: BNF, when used as a Chomsky grammar, describes a set of strings, and does not describe how to parse strings according to the grammar. BNF notation, if used to describe a procedure, is a set of instructions, to be tried in some order, and used to process a string. Procedural BNF describes a procedure first, and a language only indirectly.

Both procedural and Chomskyan BNF describe languages, but usually not the same language. That is,

  • Suppose D is some BNF description.
  • Let P(D) be D interpreted as a procedure,
  • Let L(P(D)) be the language which the procedure P(D) parses.
  • Let G(D) be D interpreted as a Chomsky grammar.
  • Let L(G(D)) be the language which the grammar G(D) describes.
  • Then, usually, L(P(D)) != L(G(D)).

The pre-Chomskyan approach, using procedural BNF, is far more natural to someone trained as a computer programmer. The parsing problem appears to the programmer in the form of strings to be parsed, exactly the starting point of procedural BNF and pre-Chomsky parsing.

Even when the Chomskyan approach is pointed out, it does not at first seem very attractive. With the pre-Chomskyan approach, the examples of the language more or less naturally lead to a parser. In the Chomskyan approach the programmer has to search for an algorithm to parse strings according to his grammar -- and the search for good algorithms to parse Chomskyan grammars has proved surprisingly long and difficult. Handling semantics is more natural with a Chomksyan approach. But, using captures, semantics can be added to a pre-Chomskyan parser and, with practice, this seems natural enough.

Despite the naturalness of the pre-Chomskyan approach to parsing, we will find that the first fully-described automated parsers are Chomskyan. This is a testimony to Chomsky's influence at the time. We will also see that Chomskyan parsers have been dominant ever since.

1961: In January, Ned Irons publishes a paper describing his ALGOL 60 parser. It is the first paper to fully describe any parser. The Irons algorithm is Chomskyan and top-down with a "left corner" element. The Irons algorithm is general, meaning that it can parse anything written in BNF. It is syntax-driven (aka declarative), meaning that the parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Peter Lucas publishes the first description of a purely top-down parser. This can be considered to be recursive descent, though in Lucas's paper the algorithm has a syntax-driven implementation, useable only for a restricted class of grammars. Today we think of recursive descent as a methodology for writing parsers by hand. Hand-coded approaches became more popular in the 1960's due to three factors:

  • Memory and CPU were both extremely limited. Hand-coding paid off, even when the gains were small.
  • Non-hand coded top-down parsing, of the kind Lucas's syntax-driven approach allowed, is a very weak parsing technique. It was (and still is) often necessary to go beyond its limits.
  • Top-down parsing is intuitive -- it essentially means calling subroutines. It therefore requires little or no knowledge of parsing theory. This makes it a good fit for hand-coding.

1963: L. Schmidt, Howard Metcalf, and Val Schorre present papers on syntax-directed compilers at a Denver conference.

1964: Schorre publishes a paper on the Meta II "compiler writing language", summarizing the papers of the 1963 conference. Schorre cites both Backus and Chomsky as sources for Meta II's notation. Schorre notes that his parser is "entirely different" from that of Irons 1961 -- in fact it is pre-Chomskyan. Meta II is a template, rather than something that readers can use, but in principle it can be turned into a fully automated compiler-compiler.

1965: Don Knuth invents LR parsing. The LR algorithm is deterministic, Chomskyan and bottom-up, but it is not thought to be practical. Knuth is primarily interested in the mathematics.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is Chomskyan, syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's algorithm is both top-down and bottom-up at once -- it uses dynamic programming and keeps track of the parse in tables. Earley's approach makes a lot of sense and looks very promising indeed, but there are three serious issues:

  • First, there is a bug in the handling of zero-length rules.
  • Second, it is quadratic for right recursions.
  • Third, the bookkeeping required to set up the tables is, by the standards of 1968 hardware, daunting.

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size. LALR looks practical.

1969: Ken Thompson writes the "ed" editor as one of the first components of UNIX. At this point, regular expressions are an esoteric mathematical formalism. Through the "ed" editor and its descendants, regular expressions will become an everyday part of the working programmer's toolkit.

Recognizers: In comparing algorithms, it can be important to keep in mind whether they are recognizers or parsers. A recognizer is a program which takes a string and produces a "yes" or "no" according to whether a string is in part of a language. Regular expressions are typically used as recognizers. A parser is a program which takes a string and produces a tree reflecting its structure according to a grammar. The algorithm for a compiler clearly must be a parser, not a recognizer. Recognizers can be, to some extent, used as parsers by introducing captures.

1972: Alfred Aho and Jeffrey Ullman publish a two volume textbook summarizing the theory of parsing. This book is still important. It is also distressingly up-to-date -- progress in parsing theory slowed dramatically after 1972. Aho and Ullman describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1972: Under the names TDPL and GTDPL, Aho and Ullman investigate the non-Chomksyan parsers in the Schorre lineage. They note that "it can be quite difficult to determine what language is defined by a TDPL parser". That is, GTDPL parsers do whatever they do, and that whatever is something the programmer in general will not be able to describe. The best a programmer can usually do is to create a test suite and fiddle with the GTDPL description until it passes. Correctness cannot be established in any stronger sense. GTDPL is an extreme form of the old joke that "the code is the documentation" -- with GTDPL nothing documents the language of the parser, not even the code.

GTDPL's obscurity buys nothing in the way of additional parsing power. Like all non-Chomskyan parsers, GTDPL is basically a extremely powerful recognizer. Pressed into service as a parser, it is comparatively weak. As a parser, GTDPL is essentially equivalent to Lucas's 1961 syntax-driven algorithm, which was in turn a restricted form of recursive descent.

At or around this time, rumor has it that the main line of development for GTDPL parsers is classified secret by the US government. GTDPL parsers have the property that even small changes in GTDPL parsers can be very labor-intensive. For some government contractors, GTDPL parsing provides steady work for years to come. Public interest in GTDPL fades.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed.

1979: Part of the V7 toolkit is Yet Another Compiler Compiler (YACC). YACC is LALR-powered. Despite its name, YACC is the first compiler-compiler in the modern sense. For some useful languages, the process of going from Chomskyan specification to executable is fully automated. Most practical languages, including the C language and YACC's own input language, still require manual hackery. Nonetheless, after two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses YACC and LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of the Earley algorithm.

But Earley parsing is almost forgotten. Twenty years will pass before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: John Aycock and R. Nigel Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2004: Bryan Ford publishes his paper on PEG. Implementers by now are avoiding YACC, and it seems as if there might soon be no syntax-driven algorithms in practical use. Ford fills this gap by repackaging the nearly-forgotten GTDPL. Ford adds packratting, so that PEG is always linear, and provides PEG with an attractive new syntax. But nothing has been done to change the problematic behaviors of GTDPL.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

Today: After five decades of parsing theory, the state of the art seems to be back where it started. We can imagine someone taking Ned Iron's original 1961 algorithm from the first paper ever published describing a parser, and republishing it today. True, he would have to translate its code from the mix of assembler and ALGOL into something more fashionable, say Haskell. But with that change, it might look like a breath of fresh air.

Marpa: an afterword

The recollections of my teachers cover most of this timeline. My own begin around 1970. Very early on, as a graduate student, I became unhappy with the way the field was developing. Earley's algorithm looked interesting, and it was something I returned to on and off.

The original vision of the 1960's was a parser that was

  • efficient,
  • practical,
  • general, and
  • syntax-driven.

By 2010 this vision seemed to have gone the same way as many other 1960's dreams. The rhetoric stayed upbeat, but parsing practice had become a series of increasingly desperate compromises.

But, while nobody was looking for them, the solutions to the problems encountered in the 1960's had appeared in the literature. Aycock and Horspool had solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and are probably no longer relevant in any case -- cache misses are now the bottleneck.

The programmers of the 1960's would have been prepared to trust a fully declarative Chomskyan parser. With the experience with LALR in their collective consciousness, modern programmers might be more guarded. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove." But I found it straightforward to rearrange the Earley parse engine to allow efficient event-driven handovers between procedural and syntax-driven logic. And Earley tables provide the procedural logic with full knowledge of the state of the parse so far, so that Earley's algorithm is a better platform for hand-written procedural logic than recursive descent.

References, comments, etc.

My implementation of Earley's algorithm is called Marpa. For more about Marpa, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Ocean of Awareness: What are the reasonable computer languages?

"You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?'" -- George Bernard Shaw

In the 1960's and 1970's computer languages were evolving rapidly. It was not clear which way they were headed. Would most programming be done with general-purpose languages? Or would programmers create a language for every task domain? Or even for every project? And, if lots of languages were going to be created, what kinds of languages would be needed?

It was in that context that Čulik and Cohen, in a 1973 paper, outlined what they thought programmers would want and should have. In keeping with the spirit of the time, it was quite a lot:

  • Programmers would want to extend their grammars with new syntax, including new kinds of expressions.
  • Programmers would also want to use tools that automatically generated new syntax.
  • Programmers would not want to, and especially in the case of auto-generated syntax would usually not be able to, massage the syntax into very restricted forms. Instead, programmers would create grammars and languages which required unlimited lookahead to disambiguate, and they would require parsers which could handle these grammars.
  • Finally, programmers would need to be able to rely on all of this parsing being done in linear time.

Today, we think we know that Čulik and Cohen's vision was naive, because we think we know that parsing technology cannot support it. We think we know that parsing is much harder than they thought.

The eyeball grammars

As a thought problem, consider the "eyeball" class of grammars. The "eyeball" class of grammars contains all the grammars that a human can parse at a glance. If a grammar is in the eyeball class, but a computer cannot parse it, it presents an interesting choice. Either,

  • your computer is not using the strongest practical algorithm; or
  • your mind is using some power which cannot be reduced to a machine computation.

There are some people out there (I am one of them) who don't believe that everything the mind can do reduces to a machine computation. But even those people will tend to go for the choice in this case: There must be some practical computer parsing algorithm which can do at least as well at parsing as a human can do by "eyeball". In other words, the class of "reasonable grammars" should contain the eyeball class.

Čulik and Cohen's candidate for the class of "reasonable grammars" were the grammars that a deterministic parse engine could parse if it had a lookahead that was infinite, but restricted to distinguishing between regular expressions. They called these the LR-regular, or LRR, grammars. And the LRR grammars do in fact seem to be a good first approximation to the eyeball class. They do not allow lookahead that contains things that you have to count, like palindromes. And, while I'd be hard put to eyeball every possible string for every possible regular expression, intuitively the concept of scanning for a regular expression does seem close to capturing the idea of glancing through a text looking for a telltale pattern.

So what happened?

Alas, the algorithm in the Čulik and Cohen paper turned out to be impractical. But in 1991, Joop Leo discovered a way to adopt Earley's algorithm to parse the LRR grammars in linear time, without doing the lookahead. And Leo's algorithm does have a practical implementation: Marpa.

References, comments, etc.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: cgit syntax highlighting

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Ocean of Awareness: What parser do birds use?

"Here we provide, to our knowledge, the first unambiguous experimental evidence for compositional syntax in a non-human vocal system." -- "Experimental evidence for compositional syntax in bird calls", Toshitaka N. Suzuki, David Wheatcroft & Michael Griesser Nature Communications 7, Article number: 10986

In this post I look at a subset of the language of the Japanese great tit, also known as Parus major. The above cited article presents evidence that bird brains can parse this language. What about standard modern computer parsing methods? Here is the subset -- probably a tiny one -- of the language actually used by Parus major.

      S ::= ABC
      S ::= D
      S ::= ABC D
      S ::= D ABC
    

Classifying the Parus major grammar

Grammophone is a very handy new tool for classifying grammars. Its own parser is somewhat limited, so that it requires a period to mark the end of a rule. The above grammar is in Marpa's SLIF format, which is smart enough to use the "::=" operator to spot the beginning and end of rules, just as the human eye does. Here's the same grammar converted into a form acceptable to Grammophone:

      S -> ABC .
      S -> D .
      S -> ABC D .
      S -> D ABC .
    

Grammophone tells us that the Parus major grammar is not LL(1), but that it is LALR(1).

What does this mean?

LL(1) is the class of grammar parseable by top-down methods: it's the best class for characterizing most parsers in current use, including recursive descent, PEG, and Perl 6 grammars. All of these parsers fall short of dealing with the Parus major language.

LALR(1) is probably most well-known from its implementations in bison and yacc. While able to handle this subset of Parus's language, LALR(1) has its own, very strict, limits. Whether LALR(1) could handle the full complexity of Parus language is a serious question. But it's a question that in practice would probably not arise. LALR(1) has horrible error handling properties.

When the input is correct and within its limits, an LALR-driven parser is fast and works well. But if the input is not perfectly correct, LALR parsers produce no useful analysis of what went wrong. If Parus hears "d abc d", a parser like Marpa, on the other hand, can produce something like this:

# * String before error: abc d\s
# * The error was at line 1, column 7, and at character 0x0064 'd', ...
# * here: d
    

Parus uses its language in predatory contexts, and one can assume that a Parus with a preference for parsers whose error handling is on an LALR(1) level will not be keeping its alleles in the gene pool for very long.

References, comments, etc.

Those readers content with sub-Parus parsing methods may stop reading here. Those with greater parsing ambitions, however, may wish to learn more about Marpa. A Marpa test script for parsing the Parus subset is in a Github gist. Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Ocean of Awareness: Introduction to Marpa Book in progress

What follows is a summary of the features of the Marpa algorithm, followed by a discussion of potential applications. It refers to itself as a "monograph", because it is a draft of part of the introduction to a technical monograph on the Marpa algorithm. I hope the entire monograph will appear in a few weeks.

The Marpa project

The Marpa project was intended to create a practical and highly available tool to generate and use general context-free parsers. Tools of this kind had long existed for LALR and regular expressions. But, despite an encouraging academic literature, no such tool had existed for context-free parsing. The first stable version of Marpa was uploaded to a public archive on Solstice Day 2011. This monograph describes the algorithm used in the most recent version of Marpa, Marpa::R2. It is a simplification of the algorithm presented in my earlier paper.

A proven algorithm

While the presentation in this monograph is theoretical, the approach is practical. The Marpa::R2 implementation has been widely available for some time, and has seen considerable use, including in production environments. Many of the ideas in the parsing literature satisfy theoretical criteria, but in practice turn out to face significant obstacles. An algorithm may be as fast as reported, but may turn out not to allow adequate error reporting. Or a modification may speed up the recognizer, but require additional processing at evaluation time, leaving no advantage to compensate for the additional complexity.

In this monograph, I describe the Marpa algorithm as it was implemented for Marpa::R2. In many cases, I believe there are better approaches than those I have described. But I treat these techniques, however solid their theory, as conjectures. Whenever I mention a technique that was not actually implemented in Marpa::R2, I will always explicitly state that that technique is not in Marpa as implemented.

Features

General context-free parsing

As implemented, Marpa parses all "proper" context-free grammars. The proper context-free grammars are those which are free of cycles, unproductive symbols, and inaccessible symbols. Worst case time bounds are never worse than those of Earley's algorithm, and therefore never worse than O(n**3).

Linear time for practical grammars

Currently, the grammars suitable for practical use are thought to be a subset of the deterministic context-free grammars. Using a technique discovered by Joop Leo, Marpa parses all of these in linear time. Leo's modification of Earley's algorithm is O(n) for LR-regular grammars. Leo's modification also parses many ambiguous grammars in linear time.

Left-eidetic

The original Earley algorithm kept full information about the parse --- including partial and fully recognized rule instances --- in its tables. At every parse location, before any symbols are scanned, Marpa's parse engine makes available its information about the state of the parse so far. This information is in useful form, and can be accessed efficiently.

Recoverable from read errors

When Marpa reads a token which it cannot accept, the error is fully recoverable. An application can try to read another token. The application can do this repeatedly as long as none of the tokens are accepted. Once the application provides a token that is accepted by the parser, parsing will continue as if the unsuccessful read attempts had never been made.

Ambiguous tokens

Marpa allows ambiguous tokens. These are often useful in natural language processing where, for example, the same word might be a verb or a noun. Use of ambiguous tokens can be combined with recovery from rejected tokens so that, for example, an application could react to the rejection of a token by reading two others.

Using the features

Error reporting

An obvious application of left-eideticism is error reporting. Marpa's abilities in this respect are ground-breaking. For example, users typically regard an ambiguity as an error in the grammar. Marpa, as currently implemented, can detect an ambiguity and report specifically where it occurred and what the alternatives were.

Event driven parsing

As implemented, Marpa::R2 allows the user to define "events". Events can be defined that trigger when a specified rule is complete, when a specified rule is predicted, when a specified symbol is nulled, when a user-specified lexeme has been scanned, or when a user-specified lexeme is about to be scanned. A mid-rule event can be defined by adding a nulling symbol at the desired point in the rule, and defining an event which triggers when the symbol is nulled.

Ruby slippers parsing

Left-eideticism, efficient error recovery, and the event mechanism can be combined to allow the application to change the input in response to feedback from the parser. In traditional parser practice, error detection is an act of desperation. In contrast, Marpa's error detection is so painless that it can be used as the foundation of new parsing techniques.

For example, if a token is rejected, the lexer is free to create a new token in the light of the parser's expectations. This approach can be seen as making the parser's "wishes" come true, and I have called it "Ruby Slippers Parsing".

One use of the Ruby Slippers technique is to parse with a clean but oversimplified grammar, programming the lexical analyzer to make up for the grammar's short-comings on the fly. As part of Marpa::R2, the author has implemented an HTML parser, based on a grammar that assumes that all start and end tags are present. Such an HTML grammar is too simple even to describe perfectly standard-conformant HTML, but the lexical analyzer is programmed to supply start and end tags as requested by the parser. The result is a simple and cleanly designed parser that parses very liberal HTML and accepts all input files, in the worst case treating them as highly defective HTML.

Ambiguity as a language design technique

In current practice, ambiguity is avoided in language design. This is very different from the practice in the languages humans choose when communicating with each other. Human languages exploit ambiguity in order to design highly flexible, powerfully expressive languages. For example, the language of this monograph, English, is notoriously ambiguous.

Ambiguity of course can present a problem. A sentence in an ambiguous language may have undesired meanings. But note that this is not a reason to ban potential ambiguity --- it is only a problem with actual ambiguity.

Syntax errors, for example, are undesired, but nobody tries to design languages to make syntax errors impossible. A language in which every input was well-formed and meaningful would be cumbersome and even dangerous: all typos in such a language would be meaningful, and parser would never warn the user about errors, because there would be no such thing.

With Marpa, ambiguity can be dealt with in the same way that syntax errors are dealt with in current practice. The language can be designed to be ambiguous, but any actual ambiguity can be detected and reported at parse time. This exploits Marpa's ability to report exactly where and what the ambiguity is. Marpa::R2's own parser description language, the SLIF, uses ambiguity in this way.

Auto-generated languages

In 1973, Čulik and Cohen pointed out that the ability to efficiently parse LR-regular languages opens the way to auto-generated languages. In particular, Čulik and Cohen note that a parser which can parse any LR-regular language will be able to parse a language generated using syntax macros.

Second order languages

In the literature, the term "second order language" is usually used to describe languages with features which are useful for second-order programming. True second-order languages --- languages which are auto-generated from other languages --- have not been seen as practical, since there was no guarantee that the auto-generated language could be efficiently parsed.

With Marpa, this barrier is raised. As an example, Marpa::R2's own parser description language, the SLIF, allows "precedenced rules". Precedenced rules are specified in an extended BNF. The BNF extensions allow precedence and associativity to be specified for each RHS.

Marpa::R2's precedenced rules are implemented as a true second order language. The SLIF representation of the precedenced rule is parsed to create a BNF grammar which is equivalent, and which has the desired precedence. Essentially, the SLIF does a standard textbook transformation. The transformation starts with a set of rules, each of which has a precedence and an associativity specified. The result of the transformation is a set of rules in pure BNF. The SLIF's advantage is that it is powered by Marpa, and therefore the SLIF can be certain that the grammar that it auto-generates will parse in linear time.

Notationally, Marpa's precedenced rules are an improvement over similar features in LALR-based parser generators like yacc or bison. In the SLIF, there are two important differences. First, in the SLIF's precedenced rules, precedence is generalized, so that it does not depend on the operators: there is no need to identify operators, much less class them as binary, unary, etc. This more powerful and flexible precedence notation allows the definition of multiple ternary operators, and multiple operators with arity above three.

Second, and more important, a SLIF user is guaranteed to get exactly the language that the precedenced rule specifies. The user of the yacc equivalent must hope their syntax falls within the limits of LALR.

References, comments, etc.

Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Header image by Tambako the Jaguar. Some rights reserved.