Perl.com: Perl foreach loops

A foreach loop runs a block of code for each element of a list. No big whoop, “perl foreach” continues to be one of the most popular on Google searches for the language. So we thought we’d see what’s happened in 20 years. I expand on Tom Christiansen’s slide that’s part of his longer presentation then add a new but experimental feature at the end. If you want more, there’s plenty to read in perlsyn or my book Learning Perl.

Going through a list

Unless you say otherwise, foreach aliases the current element to the topic variable $_. You can specify that list directly in the parentheses after foreach, use an array variable, or use the result of a subroutine call (amongst other ways to get a list):

foreach ( 1, 3, 7 ) {
	print "\$_ is $_";
	}
my @numbers = ( 1, 3, 7 );
foreach ( @numbers ) {
	print "\$_ is $_";
	}
sub numbers{ return ( 1, 3, 7 ) }
foreach ( numbers() ) {
	print "\$_ is $_";
	}
sub numbers{ keys %some_hash }
foreach ( numbers() ) {
	print "\$_ is $_";
	}

Some people like to use the synonym for. There’s a proper C-style for that has three semicolon-separated parts in the parentheses. If Perl doesn’t see the two semicolons it treats for just like a foreach:

for ( my $i = 0; $i < 5; $i++ ) {  # C style
	print "\$i is $i";
	}

for ( 0 .. 4 ) {  # foreach synonym
	print "\$_ is $_";
	}

Element source gotchas

The aliasing is only temporary. After the foreach the topic variable returns to its original value:

$_ = "Original value";
my @numbers = ( 1, 3, 7 );
print "\$_ before: $_\n";
foreach ( @numbers ) {
	print "\$_ is $_\n";
	$_ = $_ * 2;
	}
print "\$_ after: $_\n";

The output shows that $_ appears unaffected by the foreach:

$_ before: Original value
$_ is 1
$_ is 3
$_ is 7
$_ after: Original value

This is an alias instead of a copy, which is a shortcut that allows your program to be a little faster by not moving data around. If you change the topic you change the original value if the list source is an array (the values are read-only otherwise and you’ll get an error):

my @numbers = ( 1, 3, 7 );
print "Before: @numbers\n";  # Before: 1 3 7
foreach ( @numbers ) {
	print "\$_ is $_\n";
	$_ = $_ * 2;
	}
print "After: @numbers\n";   # After: 2 6 14

Not only that, but if you change the source by adding or removing elements you can screw up the foreach. This loops infinitely processing the same element because each go through the block moves the array elements over one position; when the iterator moves onto the next position it finds the same one it just saw:

my @numbers = ( 1, 3, 7 );
print "\$number before: $number\n";
foreach $number ( @numbers ) {
	print "\$number is $number\n";
	unshift @numbers, "Added later";
	}

This output will go on forever:

$number is 1
$number is 1
$number is 1
$number is 1

Naming your own topic variable

The $_ is often handy because it’s the default variable for several Perl functions, such as chomp or split. You can use your own name by specifying a scalar variable between the foreach and the parentheses. Usually you don’t want to use that variable for something other than the loop so the usual style declares it inline with the foreach:

foreach my $number ( 1, 3, 7 ) {
	print "\$number is $number";
	}

Since Perl flattens lists into one big list, you can use more than one list source in the parentheses:

my @numbers      = ( 1, 3, 7 );
my @more_numbers = ( 5, 8, 13 );
foreach my $number ( @numbers, @more_numbers ) {
	print "\$number is $number";
	}

Or a mix of source types:

my @numbers      = ( 1, 3, 7 );
my @more_numbers = ( 5, 8, 13 );
foreach my $number ( @numbers, numbers(), keys %hash ) {
	print "\$number is $number";
	}

Using your own named topic variable acts just like what you saw with $_:

my @numbers      = ( 1, 3, 7 );

my $number = 'Original value';
say "Before: $number";
foreach $number ( @numbers ) {
	say "\$number is $number";
	}
say "After: $number";

The output shows the aliasing effect and that the original value is restored after the foreach:

Before: Original value
$number is 1
$number is 3
$number is 7
After: Original value

Controlling

There are three keywords that let you control the operation of the foreach (and other looping structures): last, next, and redo.

The last stops the current iteration. It’s as if you immediately go past the last statement in the block then breaks out of the loop. It does not look at the next item. You often use this with a postfix conditional:

foreach $number ( 0 .. 5 ) {
	say "Starting $number";
	last if $number > 3;
	say "\$number is $number";
	say "Ending $number";
	}
say 'Past the loop';

You start the block for element 3 but end the loop there and continue the program after the loop:

Starting 0
$number is 0
Ending 0
Starting 1
$number is 1
Ending 1
Starting 2
$number is 2
Ending 2
Starting 3
Past the loop

The next stops the current iteration and moves on to the next one. This makes it easy to skip elements that you don’t want to process:

foreach my $number ( 0 .. 5 ) {
	say "Starting $number";
	next if $number % 2;
	say "\$number is $number";
	say "Ending $number";
	}

The output shows that you run the block with each element but only the even numbers make it past the next:

Starting 0
$number is 0
Ending 0
Starting 1
Starting 2
$number is 2
Ending 2
Starting 3
Starting 4
$number is 4
Ending 4
Starting 5

The redo restarts the current iteration of a block. You can use it with a foreach although it’s more commonly used with looping structures that aren’t meant to go through a list of items.

Here’s an example where you want to get three “good” lines of input. You iterate through the number of lines that you want and read standard input each time. If you get a blank line, you restart the same loop with

my $lines_needed = 3;
my @lines;
foreach my $animal ( 1 .. $lines_needed ) {
	chomp( my $line = <STDIN> );
	redo if $line =~ /\A \s* \z/x;  # skip "blank" lines
	push @lines, $line;
	}

say "Lines are:\n\t", join "\n\t", @lines;

The output shows that the loop effectively ignore the blank lines and goes back to the top of the loop. It does not use the next item in the list though. After getting a blank line when it tries to read the second line, it tries the second line again:

Reading line 1
First line
Reading line 2

Reading line 2

Reading line 2
Second line
Reading line 3

Reading line 3

Reading line 3
Third line
Lines are:
    First line
    Second line
    Third line

That’s not very Perly though but this is an article about foreach. A better style might be to read lines with while to the point that @lines is large enough:

my $lines_needed = 3;
my @lines;
while( <STDIN> ) {
	next if /\A \s* \z/x;
	chomp;
	push @lines, $_;
	last if @lines == $lines_needed;
	}
say "Lines are:\n\t", join "\n\t", @lines;

There’s more that you can do with these. The work with labels and nested loops. You can read more about them in perlsyn or Learning Perl.

A common file-reading gotcha

Since foreach goes through each element of a list, some people reach for it when they want to go through each line in a file:

foreach my $line ( <STDIN> ) { ... }

This is usually not a good idea. The foreach needs to have to entire list all at once. This isn’t a lazy construct like you’d see in some other languages. This means that the foreach reads in all of standard input before it does anything. And, if standard input doesn’t close, the program appears to hang. Or worse, it tries to completely read terabytes of data from that filehandle. Memory is cheap, but not that cheap.

A suitable replacement is the while idiom that reads and processes one line at a time:

while( <STDIN> ) { ... }

This is really a shortcut for an assignment in scalar context. That reads only one line from the filehandle:

while( defined( $_ = <STDIN> ) ) { ... }

An experimental convenience

Perl v5.22 added an experimental refaliasing feature. Assigning to a reference makes the thing on the right an alias for the thing on the left. Here’s a small demonstration where you assign an anonymous hash to a reference to a named hash variable. Now %h is another name (the alias) for that hash reference:

use feature qw(refaliasing);
use Data::Dumper;

\my %h = { qw(a 1 b 2) };
say Dumper( \%h );

This is handy in a foreach where the elements of the list are hash references. First, here’s how you might do this without the feature. Inside the block you interact the $hash as a reference; you must dereference it to get to a value:

my @mascots = (
	{
		type => 'camel',
		name => 'Amelia',
	},
	{
		type => 'butterfly',
		name => 'Camelia',
	},
	{
		type  => 'go',
		name  => 'Go Gopher',
	},
	{
		type  => 'python',
		name  => 'Monty',
	},
	);
foreach my $hash ( @mascots ) {
	say $hash->{'name'}
	}

With v5.22’s refaliasing feature you can use a named hash variable as the topic. Inside the block you interact with the current element as a named hash. There’s no -> for a dereference:

use v5.22;
use feature qw(refaliasing);
use Data::Dumper;

my @mascots = (
	{
		type => 'camel',
		name => 'Amelia',
	},
	{
		type => 'butterfly',
		name => 'Camelia',
	},
	{
		type  => 'go',
		name  => 'Go Gopher',
	},
	{
		type  => 'python',
		name  => 'Monty',
	},
	);

foreach \my %hash ( @mascots ) {
	say $hash{'name'}
	}

The output is the same in both programs:

Amelia
Camelia
Go Gopher
Monty
Aliasing via reference is experimental at ...

There’s a warning from this experimental feature (and, all such features). The feature might change or even disappear according to Perl’s feature policy. Disable the warning if you are comfortable with that:

no warnings qw(experimental::refaliasing);

Conclusion

The foreach is a handy way to go through a list an element at a time. Use it when you already have the list completely constructed (and not to process a filehandle). Define your own topic variable to choose a descriptive name.

Perl Foundation News: Maintaining the Perl 5 Core (Dave Mitchell): Grant Report for August 2018

This is a monthly report by Dave Mitchell on his grant under Perl 5 Core Maintenance Fund. We thank the TPF sponsors to make this grant possible.

I didn't spend all that many hours during August on perl work.

I spent most of my time looking at a bug related to restoring of captures
within regex repeats. During the course of that, I took the opportunity to
simplify and cleanup some of the code in S_regmatch() which deals with
captures, and in particular, make it consistently use macros which produce
debugging output when opening or closing or restoring capture indices.

SUMMARY:
     18:03 RT #133352 Ancient Regex Regression
      1:30 RT #133429 Time-HiRes/t/itimer.t: intermittent failures
      1:36 RT #133441 no assignment to "my" variable  
    ------
     21:09 TOTAL (HH::MM)  


 254.7 weeks
3151.4 total hours
  12.4 average hours per week

There are 315 hours left on the grant

Sebastian Riedel about Perl and the Web: Mojolicious 8.0 released: Perl real-time web framework

I’m excited to announce the release of Mojolicious 8.0 (Supervillain).

This release marks the culmination of a 2 year development cycle, reaching its conclusion last week at Mojoconf in Norway. Where we were fortunate enough to have the whole core team present, for many very productive discussions and some crazy fun experiments to get async/await working with Perl and Mojolicious.

The project has been growing steadily in the past few years, with many companies relying on Mojolicious to develop new code bases, and even 20 year old behemoths like Bugzilla getting ported to Mojolicious. To support the continued growth we’ve decided to make a few organizational changes. From now on all new development will be consolidated in a single GitHub organization. And our official IRC channel (say hi!) with almost 200 regulars will be moving to Freenode (#mojo on irc.freenode.net), to make it easier for people not yet part of the Perl community to get involved.

When it comes to new features, this is probably our biggest release yet with 26 major features. For a complete overview you can take a look at the slides from my Mojoconf talk or watch the video. Here’s a list of the highlights:

As usual there is a lot more to discover, see Changes on GitHub for the full list of improvements.

Have fun!

Perl Foundation News: Call for Grant Proposals (September 2018 Round)

The Grants Committee is accepting grant proposals all the time. We evaluate them every two months and another evaluation period is upon us.

If you have an idea for doing some Perl work that will benefit the Perl community, please consider submitting a grant application. The application deadline for this round is 23:59 September 30th UTC. We will publish the received applications, get community feedback through October 7th, and conclude acceptance by October 14th.

To apply, please read How to Write a Proposal. GC Charter, Rules of Operation and Running Grants List will also help you understand how the grant process works. We also got some grant ideas from the community.

We will confirm the receipt of application by October 1st.

If you have further questions, please contact me at tpf-grants-secretary at perl-foundation.org.

Update! Many of the links on this page were just fixed to match the recent TPF site refresh, sorry about that!

Perl Foundation News: Rakudo Perl 6 performance analysis tooling - Grant Report #4

The first public release! Code is now hosted in GitHub. Please see the instructions on how to install and run.

The release features a renewed "Routines" tab. Please read Timo's blog post to know how it compares to the previous profiler: The first public release!.

Perl Foundation News: Maintaining Perl 5 (Tony Cook): August 2018 Grant Report

This is a monthly report by Tony Cook on his grant under Perl 5 Core Maintenance Fund. We thank the TPF sponsors to make this grant possible.

Approximately 44 tickets were reviewed, and 4 patches were applied.

[Hours]         [Activity]
  1.23          #p5p discussion with khw on shared hash problem
  4.73          #132158 testing, debugging, work on a fix, more testing
                #132158 consider other fixes, testing, comment with
                patches
  1.60          #132655 (sec) work on test code, produce a fix and a
                patch, comment with patch
  0.72          #132683 re-check patch, test and apply to blead
  0.35          #133002 review, make public, merge into 132609
  1.71          #133204 (sec) research, comment
                #133204 (sec) comment
  0.30          #133250 (sec) review, comment
  0.37          #133292 review and briefly comment
  1.87          #133314 work on test code
                #133314 more testing, apply to blead
  8.73          #133326 debugging, work on fixes
                #133326 more work on fixes, testing
                #133326 debugging, clean up, testing
                #133326 polish, commit message, comment with patch
                #133326 re-test, apply to blead
  0.40          #133331 (sec) review, ask for more information
                #133331 (sec) briefly comment
  2.45          #133334 (sec) debugging
                #133334 (sec) debugging
                #133334 (sec) comment
  0.05          #133335 (sec) brief testing and comment
  0.95          #133345 (sec) testing, comment
  0.95          #133376 review patch, test build, testing, apply to blead
  0.07          #133406 (sec) comment and close
  0.55          #133417 review patches, original discussion, comment about
                the commit messages
                #133417 review, research and comment
  2.29          #133422 review ticket, look over code
                #133422 work on a fix
                #133422 more work on fix, testing
  0.38          #133423 (sec) private IRC discussion with khw
  0.30          #133431 review and briefly comment
  0.55          apm821xx cross build thread, research, comment
  0.47          Briefly review khw’s khw-core branch
  0.95          hacking on feature.h thread, review code, testing, comment
  2.07          reply sawyer email on unicode filenames on Win32
  1.88          review maint-votes, add some commits, cherry pick/test two
                commits
  0.37          review security issues, make 133345 public
  0.77          review security queue
  1.53          security queue review/clean up
  0.33          security ticket round up
  1.17          utf8_readline: consider optimizations
  1.38          utf8_readline: croak tests, add another test and find a
                bug
  1.45          utf8_readline: debug/fix surrogate bug, more tests
  2.98          utf8_readline: debugging, fix short handling bug, more
                testing, find surrogate handling issue, debugging, croak
                on surrogate bug
  1.87          utf8_readline: debugging, possible fix, testing
  1.72          utf8_readline: debugging, testing, debug a test failure
                and consider possible fixes
  1.87          utf8_readline: more optimizing non-mutating stream
  2.45          utf8_readline: more optimizing, building/testing + #p5p
                discussion of tainted \p{} property names with khw
  2.08          utf8_readline: more tests
  1.27          utf8_readline: polish, testing
  0.58          utf8_readline: rebase, testing, review
  1.83          utf8_readline: simplify some code based (oldish)
                discussions with khw, testing
  1.90          utf8_readline: work on fix for optimized code consuming
                input incorrectly
  1.53          utf8_readline: work on optimizing non-mutating stream
  2.63          utf8_readline: write more tests,clean up some of the
                polish
======
 65.63 hours total

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Perl isn't dieing

Ocean of Awareness: Parsing left recursions

Left recursion

A lot has been written about parsing left recursion. Unfortunately, much of it simply adds to the mystery. In this post, I hope to frame the subject clearly and briefly.

I expect the reader has some idea of what left recursion is, and perhaps some experience of it as an issue. Informally, left recursion occurs when a symbol expands to something with itself on the left. This can happen directly, for example, if production (1) is in a grammar.


    (1) A ::= A B
    
Indirect left recursion happens when, for example, (2) and (3) are productions in a grammar.

    (2) A ::= B C
    (3) B ::= A D
    
A grammar with productions (4) and (5) has a "hidden" left recursion.

    (4) A ::= B A C
    (5) B ::= # empty
    
This is because <A> ends up leftmost in derivations like:

    (6) A  ⟶ B A C ⟶ A C
    
In derivation (6), production (4) was applied, then production (5).

For those into notation, a grammar is left recursive if and only if it allows a derivation of the form


    (7) A  ⟶+ β A γ  where  β = ε
    
In (7) ε is the empty string, and α ⟶+ β indicates that α derives β in one or more rule applications.

So, OK, what is the problem?

The problem with parsing left recursions is that if you are parsing using a derivation like


    (8) A  ⟶ A B 
    
then you have defined <A> in terms of <A>. All recursions can be a problem, but left recursions are a particular problem because almost all practical parsing methods[1] proceed left to right, and derivations like (8) will lead many of the most popular algorithms straight into an infinite regress.

Why do some algorithms not have a problem?

In a sense, all algorithms which solve the left recursion problem do it in the same way. It is just that in some, the solution appears in a much simpler form.

The solution is at most simple in Earley's algorithm. That is no coincidence -- as Pingali and Bilardi[2] show, Earley's, despite its daunting reputation, is actually the most basic Chomskyan context-free parsing algorithm, the one from which all others derive.

Earley's builds a table. The Earley table contains an initial Earley set and an Earley set for each token. The Earley set for each token describes the state of the parse after consuming that token. The basic idea is not dissimilar to that of the Might/Darais/Spiewak (MDS) idea of parsing by derivatives, and the logic for building the Earley sets resembles that of MDS.[3]

For the purpose of studying left recursion, what matters is that each Earley set contains Earley "items". Some of the items are called predictions because they predict the occurrence of a symbol at that location in the input.

To record a left recursion in an Earley set, the program adds a prediction item for the left recursive symbol. It is that simple.[4]

Multiple occurrences of a prediction item would be identical, and therefore useless. Therefore subsequent attempts to add the same prediction item are ignored, and recursion does not occur.

If some have no problem, why do others?

Besides Earley's, a number of other algorithms handle left recursion without any issue -- notably LALR (aka yacc or bison) and LR. This re-raises the original question: why do some algorithms have a left recursion problem?

The worst afflicted algorithms are the "top-down" parsers. The best known of these is recursive descent -- a parsing methodology which, essentially, does parsing by calling a subroutine to handle each symbol. In the traditional implementation of recursive descent, left recursion is very problematic. Suppose that, as part of a recursive descent implementation, you are writing the function to parse the symbol <A>, which you are calling parse_A(). If your grammar has a rule


    (9) A ::= A B
    
the first thing you need to do in a naive implementation of parse_A() is to call parse_A(). And parse_A() will then call parse_A(). And so, in the naive implementation, on and on forever.

The fixed-point solution to left recursion

Over the years, many ways to solve the top-down left recursion issue have been announced. The MDS solution is one of the more interesting -- interesting because it actually works[5], and because it describes all the others, including the Earley algorithm solution. MDS reduces the problem to the more general one of finding a "fixed point" of the recursion.

In math, the "fixed point" of a function is an argument of the function which is equal to its value for that argument -- that is, an x such that f(x) = x. MDS describes an algorithm which "solves" the left recursion for its fixed point. That "fixed point" can then be memoized. For example the value of parse_A can be the memoized "fixed point" value of <A>.

The Earley solution of left recursion was, in fact, an optimized "fixed point". The computation of an Earley is the application of a set of rules for adding Earley items. This continues until no more Earley items can be added. In other words, the rules for building an Earley set are applied until they find their "fixed point".[6]

Other top-down solutions

The MDS fixed point solution does the job, but as described in their paper it requires a functional programming language to implement, and it is expensive. In the worst case, the MDS approach is exponential, although they conjecture that it is linear for a large class of practical grammars.

Top-down algorithms can take an "in-between strategy" -- they can tackle those left recursions that are cheap to find, without computing the full "fixed point". Here a well-defined boundary is crucial: A programmer wants to know if their particular grammar will work, and whether small tweaks to their grammar will break it.

Top-down can be seen as a "guessing" strategy with hacks. Hacks are always needed in top-down, because the input is at the bottom, not the top, and a useful top-down algorithm needs to look at the input. But the hacks can be as simple as lookahead, and lookahead can be implemented without seriously compromising the simplicity and flexibility of the original top-down approach.

With detection and fixing of left-recursion, the "hack" part of the top-down strategy becomes very complicated. The attraction of top-down is its simplicity, and its resulting adapability to procedural logic. The point can be reached where the original strategy comes into question.

After all, a recursive descent parser can straightforwardly take care of left recursion issues by calling an Earley parser. But in that case, why not simply use Earley's?

Comments, etc.

Marpa is my own implementation of an Earley parser.[7] Comments on this post can be made in Marpa's Google group, or on its IRC channel: #marpa at freenode.net.

Footnotes

1. I probably could have said "all practical parsing methods" instead of "almost all". Right-to-left parsing methods exist, but they see little use. In any case, they only reverse the problem. Parsing in both directions is certainly possible but, as I will show, we do not have to go to quite that much trouble.

2. Keshav Pingali and Gianfranco Bilardi, UTCS tech report TR-2012. 2012. PDF accessed 9 Junk 2018. Video accessed 9 June 2018. Less accessible is Keshav Pingali and Gianfranco Bilardi, "A graphical model for context-free grammar parsing." Compiler Construction - 24th International Conference, CC 2015. Lecture Notes in Computer Science, Vol. 9031, pp. 3-27, Springer Verlag, 2015. PDF accessed 9 June 2018.

3. Matthew Might, David Darais and Daniel Spiewak. "Functional Pearl: Parsing with Derivatives." International Conference on Functional Programming 2011 (ICFP 2011). Tokyo, Japan. September, 2011. pages 189-195. PDF accessed 9 Jun 2018. Slides accessed 9 June 2018. Video accessed 9 June 2018.

4. Those familiar with Earley's algorithm may note that Earley items are traditionally in terms of productions, not symbols. Symbol predictions are therefore recorded indirectly.

Specifically, Earley items are traditionally duples of (dr, origin), where dr is a dotted production -- a production with a "dot location" marked; and origin is a location in the input. In all predictions origin is the current location, and the dot location is at the start of the production, so there can be at most one prediction per rule. A prediction of a symbol <A> is recorded as a prediction of every production which has <A> on its LHS.

The argument in the main text is made the way it is because it is simpler to speak of "predicted symbols" than to repeatedly refer to "sets of predictions of productions sharing a common LHS".

5. There have been many more attempts than implementations over the years, and even some of the most-widely used implementations have their issues.

6. Recall that potential left recursions are recorded as "predictions" in Earley's algorithm. Predictions recurse, but since they do not depend on the input, they can be precomputed. This means that Earley implementations can bring each Earley set to its fixed point very quickly.

7. Marpa has a stable implementation. For it, and for more information on Marpa, there are these resources:
Marpa website, accessed 25 April 2018.
Kegler's website, accessed 25 April 2018.
Github repo, accessed 25 April 2018.
MetaCPAN, accessed 30 April 2018..
There is also a theory paper for Marpa: Kegler, Jeffrey. "Marpa, A Practical General Parser: The Recognizer.", 2013. PDF accessed 24 April 2018.

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Ocean of Awareness: Undershoot: Parsing theory in 1965

The difference between theory and practice is that in theory there is no difference between theory and practice, but in practice, there is.[1]

Once it was taken seriously that humans might have the power to, for example, "read" a chessboard in a way that computers could not beat. This kind of "computational mysticism" has taken a beating. But it survives in one last stronghold -- parsing theory.

In a previous post, I asked "Why is parsing considered solved?" If the state of the art of computer parsing is taken as anything close to its ultimate solution, then it is a case of "human exceptionalism" -- the human brain has some power that makes it much better at parsing than computers can be. It is very unlikely resorting to human exceptionalism as an explanation would be accepted for any other problem in computer science. Why is it accepted for parsing theory?[2]

The question really requires two separate answers:

  • "Why do practitioners accept the current state of the art as the solution?" and
  • "Why do the theoreticians accept the current state of the art as the solution?"

In one sense, the answer to both questions is the same -- because of the consensus created by Knuth's 1965 paper "On the translation of languages from left to right". In a previous post, I looked at Knuth 1965 and I answered the practitioner question in detail. But, for the sake of brevity, I answered the question about the theoreticians in outline. This post expands on that outline.

The Practitioners

To summarize, in 1965, practitioners accepted the parsing problem as solved for the following reasons.

  • In 1965, every practical parser was stack-driven.
  • As of 1965, stacks themselves were quite leading edge. As recently as 1961, a leading edge article[3] could not assume that its readers knew what "pop" and "push" operations were.
  • An algorithm that combined state transitions and stack operations was already a challenge to existing machines. In 1965, any more complicated algorithm was likely to be unuseable in practice.
  • Last, but not least, the theoreticians assured the practitioners that LR-parsing was either state-of-the-art or beyond, so making more agressive use of hardware would be futile.

What about the theorists?

The practitioners of 1965, then, were quite reasonable in feeling that LR-parsing was as good as anything they were likely to be able to implement any time soon. And they were being told by the theorists that, in fact, it never would get any better -- there were theoretical limits on parsers that faster hardware could not overcome.

We now know that the theorists were wrong -- there are non-LR parsers which are better than the LR parsers are at LR grammars. What made the theorists go astray?

How theorists work

As the epigraph for this post reminds us, theorists who hope to guide practitioners have to confront a big problem -- theory is practice only in theory. Theoreticians (or at least the better ones, like Knuth) know this, but they try to make theory as reliable a guide to practice as possible.

One of the most important examples of the theoretician's successes is asymptotic notation, which we owe to Knuth[4]. Asymptotic notation is more commonly referred to as big-O notation. The term "asymptotic notation" emphasizes its most dangerous aspect from a practical point of view: Asymptotic notation assumes that the behavior of most interest is the behavior for arbitrarily large inputs.

Practical inputs can be very large but, by definition, they are never arbitrarily large. Results in asymptotic terms might be what is called "galactic" -- they might have relevance only in situations which cannot possibly occur in practice.

Fortunately for computer science, asymptotic results usually are not "galactic". Most often asymptotic results are not only relevant to practice -- they are extremely relevant. Wikipedia pages for algorithms put the asymptotic complexities in special displays, and these displays are one of the first things that some practitioners look at.

Bracketing

Since coming up with a theoretical model that is equivalent to "practical" is impossible, theoreticians often work like artillerists. Artillerists often deliberately overshoot and undershoot, before they "fire for effect". "Bracketing" their target in this way has disadvantages -- it reduces the element of surprise, and can even allow the enemy to get their counter-fire in first. But, nasty as these consequences could be, the advantage in accuracy is usually held to outweigh them.

The practice of theoretical computer science is less risky, which makes "bracketing" a very attractive approach to tricky problems. Theoreticians often try to "bracket" practice between an "undershoot" and an "overshoot". The undershoots are models simple and efficient enough to be practical, but too weak to capture all the needs of practice. The overshoots are models which capture everything a practitioner needs, but which are too complicated and/or too resource-intensive for practice.

The P vs. NP problem is an active example of a bracketing technique. You will sometimes read that the P/NP boundary is expected to be that between practical and impractical, but this is an extreme simplification. P includes complexities like O(n^1000000), where the complexity for even n == 2 is a nunber which, in decimal form, fills many pages. Modulo bold advances in quantum computing, I cannot imagine that O(n^1000000) will ever be practical. And you can make the complexities much harder than O(n^1000000) without ever reaching P-hard.

So P-hard is beyond any reasonable definition of "practical" -- it is an "overshoot". But the P vs. NP question is almost certainly very relevant to what is "practical". Resolving the P vs. NP question is likely to be an important or even necessary step. It is a mystery that such a seemingly obvious question has resisted the best efforts of the theoreticians for so long, and the solution of P vs. NP is likely to bring new insights into asymptotic complexity.

Bracketing practical parsing

When Knuth published his 1965, "practical parsing" was already bracketed. On the overshoot side, Irons had already published a parser for context-free grammars. Worst case, this ran in exponential time, and it was, and remains, expected that general context-free parsing was not going to be practical.[5]

On the undershoot side, there were regular expressions and recursive descent. Regular expressions are fast and very practical, but parse a very limited set of grammars. Recursive descent is also fast and, since it parses a larger set of grammars, was the closest undershoot.

Mistake 1: The misdefinition of "language"

To curry respect from the behaviourists, American linguistics for many years banned any reference to meaning. Behaviorists looked down on hypothesized mental states as not worthy of "science", and it's hard to have a theory of meaning without conjectures about mental states. Without mental states, language was just a set of utterances. So in 1926 the linguist Leonard Bloomfield dutifully defined a "language" as a set of "utterances" (for our purposes, "strings"), and through the 30s and 40s most American linguists followed him.

After a brief nod to this tradition, Noam Chomsky restored sanity to linguistics. But it was too late for computer science. Automata theory adopted the semantics-free definition. In 1965, Knuth inherited a lot of prior work, almost all of which ignored, not just meaning or semantics, but even syntax and structure.[6]

Language extension versus language intension

Knuth, of course, wanted to make contact with prior art. The definition he had inherited seemed to work well enough and Knuth's 1965 defines a language as a set of strings. Most subsequent work has refused to breach this tradition.

In most people's idea of what a language is, the utterances/strings mean something -- you cannot take just any set of meaningless strings and call it a language. So the parsing theorists and everybody else had two different definitions of language.

But parsing theory also hoped to produce results relevant to practice, and few people are interested in recognizing meaningless strings -- almost everybody who parses is interested in (at a minimum) finding some kind of structure in what they parse, in order to do something with the result of the parse. Parsing theorists ended up using the word "language" in one sense, but implying that results they found worked for the word "language" in the usual sense.

At this point both senses of the word "language" have gotten entrenched in parsing theory. Instead of making up a new terminology for this blog post, I will borrow a distinction from linguistics and speak of the extension of a language and the intension of a language. The extension of a language is the Bloomfieldian defintion -- the set of utterances/strings in the language. The intension of a language, for our purposes here, can be regarded as its BNF grammar. Each language intension will have (if it is well-defined) exactly one extension. But multiple language intensions can have the same extension.

Red Herring 1: The stack machine model as a natural boundary

The temptation to use language extensions as a proxy for LR-grammars must have been overwhelming. It turns out that the language extension of deterministic stack machines is exactly that of the LR grammars. Further, the language extension of the context-free grammars is exactly that of the non-deterministic stack machines. (Non-deterministic stack machines are stack machines which can "fork" new instances of themselves on the fly.)

If you take language extensions as the proxy for grammars, things fall into place very neatly: the LR-parsers are the deterministic subset of the context-free parsers. And "deterministic" seemed like a very good approximation of practical. Certainly non-deterministic parsing is probably not practical. And the best practical parsers in 1965 were deterministic stack parsers.

Viewed this way, LR-parsing looked like the equivalent of practical parsing. It was a "direct hit", or as close to a exact equivalent of practical parsing as theory was going to get.

As we shall see, with this red herring, the reasoning went astray. But disaster was not inevitable. The whole point of bracketing, after all, is that it allows you to correct errors. Another red herring, however, resulted in parsing theory going on a decades-long wrong turn.

Red Herring 2: LR parsers are not good at LR grammars

The second red herring led to the mis-bracketing of practical parsing. Having seemingly established that LR-parsing is a natural boundary in the hierarchy of languages, Knuth discovered that general LR-parsers were very far from practical. LR parsing goes out to LR(k) for arbitrary k, but even LR(1) parsing was impractical in 1965 -- in fact, it is rare in practical use today. As the k in LR(k) grows, the size of the tables grows exponentially, while the value of the additional lookahead rapidly diminishes. It is not likely that LR(2) parsing will ever see much practical use, never mind LR(k) for any k greater than 2.

From this it was concluded that LR-parsing is an overshoot. In reality, as Joop Leo was to show, it is an undershoot, and in practical terms a very large one. If you mistake an undershoot for an overshoot, bracketing no longer works, and you are not likely to hit your target.

The Wrong Turn

Summing up, parsing theorists concluded, based on the results of Knuth 1965, that

  • LR-parsing is a good approximation to practical parsing -- it brackets it closely.
  • LR-parsing is an overshoot.
  • A subset of LR-parsing will be the solution to the parsing problem.

Signs of trouble ignored

There were, in hindsight, clear signs that LR language extensions were not a good proxy for LR grammars. LR grammars form a hierarchy -- for every k≥0, there is an LR grammar which is LR(k+1), but which is not LR(k).

But if you look at extensions instead of grammars, the hierarchy immediately collapses -- every LR(k) language extension is also an LR(1) language extension, as long as k≥1. Only LR(0) remains distinct.

It gets worse. In most practical applications, you can add an end-of-input marker to a grammar. If you do this the LR extension hierarchy collapses totally -- every LR(k) language extension is also an LR(0) language extension.

In short, it seems that, as a proxy for LR grammars, LR language extensions are likely to be completely worthless.

Why didn't Knuth see the problem?

Why didn't Knuth see the problem? Knuth certainly noted the strange behavior of the LR hierarchy in extensional terms -- he discovered it, and devoted several dense pages of his 1965 to laying out the complicated mathematics involved.

So why did Knuth expect to get away with punning intension and extension, even in the face of some very unsettling results? Here, the answer is very simple -- "punning" had always worked before.

Regular expressions are easily turned into parsers[7], so the language extension of a regular grammar is an adequate approximation to its intension. Context-free recognition has the same complexity, and in practice uses the same algorithms, as context-free parsing, so here again, language extension is a good approximation of language intension.

And the LL language extensions follow a strict hierarchy -- for every k≥0, LL(k+1) is a proper superset of LL(k). This fact forces LL grammars to follow the same hierarchy[8]. So, when studying complexity, LL language extensions are an excellent proxy for LL grammars.

Based on past experience, Knuth had reason to believe he could use language extensions as a proxy for grammars, and that the result would be a theory that was a reliable guide to practice.

Aftermath

In my timeline of parsing, I describe what happened next. Briefly, theory focused on finding a useful subset of LR(1). One, LALR, became the favorite and the basis of the yacc and bison tools.

Research into parsing of supersets of LR became rare. The theorists were convinced the LR parsing was the solution. These were so convinced that when, in 1991, Joop Leo discovered a practical way to parse an LR superset, the result went unimplemented for decades.

In 1965, the theoreticians gave a lot of weight to the evidence from the world of practice, but probably not undue weight. Going forward, it was a different story.

Leo had, in essence, disproved the implied conjecture of Knuth 1965. But the question is not an explicit mathematical question, like that of P vs. NP. It is a slipprier one -- capturing practice. Practitioners left it to the theoreticians to keep up with the literature. But theoreticians, as long as LR-superset methods did not come into use in the world of practice, felt no need to revisit their conclusions.

Comments, etc.

I encourage those who want to know more about the story of Parsing Theory to look at my Parsing: a timeline 3.0. To learn about Marpa, my Earley/Leo-based parsing project, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

1. Attributed to Jan L. A. van de Snepscheut and Yogi Berra. See https://en.wikiquote.org/wiki/Jan_L._A._van_de_Snepscheut, accessed 1 July 2018. I quote my preferred form of this -- the one it takes in Doug Rosenberg and Matt Stephens, Use Case Driven Object Modeling with UML: Theory and Practice, 2007, p. xxvii. Rosenberg and Stephens is also the accepted authority for its attribution.

2. As an aside, I am open to the idea that the human mind has abilities that Turing machines cannot improve on or even duplicate. When it comes to survival heuristics tied to the needs of human bodies, for example, it seems very reasonable to at least entertain the conjecture that the human mind might be near-optimal, particularly in big-O terms. But when it comes to ability to solve problems which can be formalized as "puzzles" -- and syntactic analysis is one of these -- I think that resort to human exceptionalism is a sign of desperation.

3. Oettinger, Anthony. "Automatic Syntactic Analysis and the Pushdown Store", Proceedings of Symposia in Applied Mathematics, Volume 12, American Mathematical Society, 1961. Oettinger describes "push" and "pop" stack operations in "Iversion notation" -- what later became APL. See the discussion of Oettinger in my "Why is parsing considered solved?" post.

4. Knuth did not invent asymptotic notation -- it comes from calculus -- but he introduced it to computer science and motivated its use.

5. The best lower bound for context-free parsing is still O(n). So it is even possible that there is a practical linear-time general context-free parser. But its discovery would be a big surprise.

6. In another blog post, I talk about the use of the word "language" in parsing theory in much more detail.

7. For example, regular expressions can be extended with "captures". Captures cannot handle recursion, but neither can regular expressions, so captures are usually sufficient to provide all the structure an application wants.

8. The discussion of the LL(k) hierarchy is in a sense anachronistic -- the LL(k) hierachy was not studied until after 1965. But Knuth certainly was aware of recursive descent, and it seems reasonable to suppose that, even in 1965, he had a sense of what the LL hierarchy would look like.

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: cgit syntax highlighting

Ocean of Awareness: A Haskell challenge

The challenge

A recent blog post by Michael Arntzenius ended with a friendly challenge to Marpa. Haskell list comprehensions are something that Haskell's own parser handles only with difficulty. A point of Michael's critique of Haskell's parsing was that Haskell's list comprehension could be even more powerful if not for these syntactic limits.

Michael wondered aloud if Marpa could do better. It can.

The problem syntax occurs with the "guards", a very powerful facility of Haskell's list comprehension. Haskell allows several kinds of "guards". Two of these "guards" can have the same prefix, and these ambiguous prefixes can be of arbitrary length. In other words, parsing Haskell's list comprehension requires either lookahead of arbitrary length, or its equivalent.

To answer Michael's challenge, I extended my Haskell subset parser to deal with list comprehension. That parser, with its test examples, is online.[1] I have run it for examples thousands of tokens long and, more to the point, have checked the Earley sets to ensure that Marpa will stay linear, no matter how long the ambiguous prefix gets.[2]

Earley parsing, which Marpa uses, accomplishes the seemingly impossible here. It does the equivalent of infinite lookahead efficiently, without actually doing any lookahead or backtracking. That Earley's algorithm can do this has been a settled fact in the literature for some time. But today Earley's algorithm is little known even among those well acquainted with parsing, and to many claiming the equivalent of infinite lookahead, without actually doing any lookahead at all, sounds like a boast of magical powers.

In the rest of this blog post, I hope to indicate how Earley parsing follows more than one potential parse at a time. I will not describe Earley's algorithm in full.[3] But I will show that no magic is involved, and that in fact the basic ideas behind Earley's method are intuitive and reasonable.

A quick cheat sheet on list comprehension

List comprehension in Haskell is impressive. Haskell allows you to build a list using a series of "guards", which can be of several kinds. The parsing issue arises because two of the guard types -- generators and boolean expressions -- must be treated quite differently, but can look the same over an arbitrarily long prefix.

Generators

Here is one example of a Haskell generator, from the test case for this blog post:


          list = [ x | [x, 1729,
		      -- insert more here
		      99
		   ] <- xss ] [4]

This says to build a lists of x's such that the guard [x, 1729, 99 ] <- xss holds. The clue that this guard is a generator is the <- operator. The <- operator will appear in every generator, and means "draw from".

The LHS of the <- operator is a pattern and the RHS is an expression. This generator draws all the elements from xss which match the pattern [x, 1729, 99 ]. In other words, it draws out all the elements of xss, and tests that they are lists of length 3 whose last two subelements are 1729 and 99.

The variable x is set to the 1st subelement. list will be a list of all those x's. In the test suite, we have


    xss = [ [ 42, 1729, 99 ] ] [5]

so that list becomes [42] -- a list of one element whose value is 42.

Boolean guards

Generators can share very long prefixes with Boolean guards.


	list2 = [ x | [x, 1729, 99] <- xss,
               [x, 1729,
                  -- insert more here
                  99
               ] == ys,
             [ 42, 1729, 99 ] <- xss
             ] [6]

The expression defining list2 has 3 comma-separated guards: The first guard is a generator, the same one as in the previous example. The last guard is also a generator.

The middle guard is of a new type: it is a Boolean: [x, 1729, 99 ] == ys. This guard insists that x be such that the triple [x, 1729, 99 ] is equal to ys.

In the test suite, we have


    ys = [ 42, 1729, 99 ] [7]
so that list2 is also [42].

Boolean guards versus generators

From the parser's point of view, Boolean guards and generators start out looking the same -- in the examples above, three of our guards start out the same -- with the string [x, 1729, 99 ], but

  • in one case (the Boolean guard), [x, 1729, 99 ] is the beginning of an expression; and
  • in the other two cases (the generators), [x, 1729, 99 ] is a pattern.
Clearly patterns and expressions can look identical. And they can look identical for an arbitrarily long time -- I tested the Glasgow Haskell Compiler (GHC) with identical expression/pattern prefixes thousands of tokens in length. My virtual memory eventually gives out, but GHC itself never complains.[8] (The comments "insert more here" show the points at which the comma-separated lists of integers can be extended.)

The problem for parsers

So Haskell list comprehension presents a problem for parsers. A parser must determine whether it is parsing an expression or a pattern, but it cannot know this for an arbitrarily long time. A parser must keep track of two possibilities at once -- something traditional parsing has refused to do. As I have pointed out[9], belief that traditional parsing "solves" the parsing problem is belief in human exceptionalism -- that human have calculating abilities that Turing machines do not. Keeping two possibilites in mind for a long time is trivial for human beings -- in one form we call it worrying, and try to prevent ourselves from doing it obsessively. But it has been the orthodoxy that practical parsing algorithms cannot do this.

Arntzenius has a nice summary of the attempts to parse this construct while only allowing one possibility at a time -- that is, determistically. Lookahead clearly cannot work -- it would have to be arbitrarily long. Backtracking can work, but can be very costly and is a major obstacle to quality error reporting.

GHC avoids the problems with backtracking by using post-processing. At parsing time, GHC treats an ambiguous guard as a Boolean. Then, if it turns out that is a generator, it rewrites it in post-processing. This inelegance incurs some real technical debt -- either a pattern must always be a valid expression, or even more trickery must be resorted to.[10]

The Earley solution

Earley parsing deals with this issue by doing what a human would do -- keeping both possibilities in mind at once. Jay Earley's innovation was to discover a way for a computer to track multiple possible parses that is compact, efficient to create, and efficient to read.

Earley's algorithm maintains an "Earley table" which contains "Earley sets", one for each token. Each Earley set contains "Earley items". Here are some Earley items from Earley set 25 in one of our test cases:


	origin = 22; <atomic expression> ::=   '[' <expression> '|' . <guards> ']'
	origin = 25; <guards> ::= . <guard<>
	origin = 25; <guards> ::= . <guards> ',' <guard<>
	origin = 25; <guard<>  ::= . <pattern> '< <expression>
	origin = 25; <guard<>  ::= . <expression> [11]

In the code, these represent the state of the parse just after the pipe symbol ("|") on line 4 of our test code.

Each Earley item describes progress in one rule of the grammar. There is a dot (".") in each rule, which indicates how far the parse has progressed inside the rule. One of the rules has the dot just after the pipe symbol, as you would expect, since we have just seen a pipe symbol.

The other four rules have the dot at the beginning of the RHS. These four rules are "predictions" -- none of their symbols have been parsed yet, but we know that these rules might occur, starting at the location of this Earley set.

Each item also records an "origin": the location in the input where the rule described in the item began. For predictions the origin is always the same as the Earley set. For the first Earley item, the origin is 3 tokens earlier, in Earley set 22.

The "secret" of non-determinism

And now we have come to the secret of efficient non-deterministic parsing -- a "secret" which I hope to convince the reader is not magic, or even much of a mystery. Here, again, are two of the items from Earley set 25:


	origin = 25; <guard<>  ::= . <pattern> '< <expression>
	origin = 25; <guard<>  ::= . <expression>  [12]

At this point there are two possibilities going forward -- a generator guard or a Boolean expression guard. And there is an Earley item for each of these possibilities in the Earley set.

That is the basic idea -- that is all there is to it. Going forward in the parse, for as long as both possibilities stay live, Earley items for both will appear in the Earley sets.

From this point of view, it should now be clear why the Earley algorithm can keep track of several possibilities without lookahead or backtracking. No lookahead is needed because all possibilities are in the Earley set, and selection among them will take place as the rest of the input is read. And no backtracking is needed because every possibility was already recorded -- there is nothing new to be found by backtracking.

It may also be clearer why I claim that Marpa is left-eidetic, and how the Ruby Slippers work.[13] Marpa has perfect knowledge of everything in the parse so far, because it is all in the Earley tables. And, given left-eidetic knowledge, Marpa also knows what terminals are expected at the current location, and can "wish" them into existence as necessary.

The code, comments, etc.

A permalink to the full code and a test suite for this prototype, as described in this blog post, is on Github. In particular, the permalink of the the test suite file for list comprehension is here. I expect to update this code, and the latest commit can be found here.

To learn more about Marpa, a good first stop is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

1. If you are interested in my Marpa-driven Haskell subset parser, this blog post may be the best introduction. The code is on Github.

2. The Earley sets for the ambigious prefix immediately reach a size of 46 items, and then stay at that level. This is experimental evidence that the Earley set sizes stay constant.

And, if the Earley items are examined, and their derivations traced, it can be seen that they must repeat the same Earley item count for as long as the ambiguous prefix continues. The traces I examined are here, and the code which generated them is here, for the reader who wants to convince himself.

The guard prefixes of Haskell are ambiguous, but (modulo mistakes in the standards) the overall Haskell grammar is not. In the literature on Earley's, it has been shown that for an unambiguous grammar, each Earley item has an constant amortized cost in time. Therefore, if a parse produces a Earley sets that are all of less than a constant size, it must have linear time complexity.

3. There are many descriptions of Earley's algorithm out there. The Wikipedia page on Earley's algorithm (accessed 27 August 2018) is one good place to start. I did another very simple introduction to Earley's in an earlier blog post, which may be worth looking at. Note that Marpa contains improvements to Earley's algorithm. Particularly, to fulfill Marpa's claim of linear time for all LR-regular grammars, Marpa uses Joop Leo's speed-up. But Joop's improvement is not necessary or useful for parsing Haskell list comprehension, is not used in this example, and will not be described in this post.

4. Permalink to this code, accessed 27 August 2018.

5. Permalink to this code, accessed 27 August 2018.

6. Permalink to this code, accessed 27 August 2018.

7. Permalink to this code, accessed 27 August 2018.

8. Note that if the list is extended, the patterns matches and Boolean tests fail, so that 42 is no longer the answer. From the parsing point of view, this is immaterial.

9. In several places, including this blog post.

10. This account of the state of the art summarizes Arntzenius's recent post, which should be consulted for the details.

11. Adapted from this trace output, accessed 27 August 2018.

12. Adapted from this trace output, accessed 27 August 2018.

13. For more on the Ruby Slippers see my just previous blog post,

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Ocean of Awareness: Marpa and combinator parsing 2

In a previous post, I outlined a method for using the Marpa algorithm as the basis for better combinator parsing. This post follows up with a trial implementation.

For this trial, I choose the most complex example from the classic 1996 tutorial on combinator parsing by Hutton and Meijer[1]. Their example implements the offside-rule parsing of a functional language -- parsing where whitespace is part of the syntax.[2] The Hutton and Meijer example is for Gofer, a now obsolete implementation of Haskell. To make the example more relevant, I wrote a parser for Haskell layout according to the Haskell 2010 Language Report[3].

For tests, I used the five examples (2 long, 3 short) provided in the 2010 Report[4], and the four examples given in the "Gentle Introduction" to Haskell[5]. I implemented only enough of the Haskell syntax to run these examples, but this was enough to include a substantial subset of Haskell's syntax.

This description of the implementation includes many extracts from the code. For those looking for more detail, the full code and test suite for this example are on Github. While the comments in the code do not amount to a tutorial, they are extensive. Readers who like to "peek ahead" are encouraged to do so.

Layout parsing and the off-side rule

It won't be necessary to know Haskell to follow this post. This section will describe Haskell's layout informally. Briefly, these two code snippets should have the same effect:


       let y   = a*b
	   f x = (x+y)/y
       in f c + f d [6]
     

       let { y   = a*b
	   ; f x = (x+y)/y
	   } [7]
    

In my test suite, both code snippets produce the same AST. The first code display uses Haskell's implicit layout parsing, and the second code display uses explicit layout. In each, the "let" is followed by a block of declarations (symbol <decls>). Each decls contains one or more declarations (symbol <decl>). For the purposes of determining layout, Haskell regards <decls> as a "block", and each <decl> as a block "item". In both displays, there are two items in the block. The first item is y = a*b, and the second <decl> item is f x = (x+y)/y.

In explicit layout, curly braces surround the block, and semicolons separate each item. Implicit layout follows the "offside rule": The first element of the laid out block determines the "block indent". The first non-whitespace character of every subsequent non-empty line determines the line indent. The line indent is compared to the block indent.

  • If the line indent is deeper than the block indent, then the line continues the current block item.
  • If the line indent is equal to the block indent, then the line starts a new block item.
  • If the line indent is less than the block indent (an "outdent"), then the line ends the block. An end of file also ends the block.
Lines containing only whitespace are ignored. Comments count as whitespace.

Explicit semicolons can be used in implicit layout: If a semicolon occurs in implicit layout, it separates block items. In our test suite, the example


       let y   = a*b;  z = a/b
	   f x = (x+y)/z
       in f c + f d [8]
    
contains three <decl> items.

The examples in the displays above are simple. The two long examples from the 2010 Report are more complicated: 6 blocks of 4 different kinds, with nesting twice reaching a depth of 4. The two long examples in the 2010 Report are the same, except that one uses implicit layout and the other uses explicit layout. In the test of my Haskell subset parser, both examples produce identical ASTs.

There are additional rules, including for tabs, Unicode characters and multi-line comments. These rules are not relevant in the examples I took from the Haskell literature; they present no theoretical challenge to this parsing method; and they are not implemented in this prototype Haskell parser.

The strategy

To tackle Haskell layout parsing, I use a separate combinator for each layout block. Every combinator, therefore, has its own block and item symbols, and its own block indent; and each combinator implements exactly one method of layout -- explicit or implicit.

From the point of view of its parent combinator, a child combinator is a lexeme, and the parse tree it produces is the value of the lexeme. Marpa can automatically produce an AST, and it adds lexeme values to the AST as leaves. The effect is that Marpa automatically assembles a parse tree for us from the tree segments returned by the combinators.

Ruby Slippers semicolons

In outlining this algorithm, I will start by explaining where the "missing" semicolons come from in the implicit layout. Marpa allows various kinds of "events", including on discarded tokens. ("Discards" are tokens thrown away, and not used in the parse. The typical use of token discarding in Marpa is for the handling of whitespace and comments.)

The following code sets an event named 'indent', which happens when Marpa finds a newline followed by zero or more whitespace characters.[9] This does not capture the indent of the first line of a file, but that is not an issue -- the 2010 Report requires that the first indent be treated as a special case anyway.

      :discard ~ indent event => indent=off
      indent ~ newline whitechars [10]
      

Indent events, like others, occur in the main read loop of each combinator. Outdents and EOFs are dealt with by terminating the read loop.[11] Line indents deeper than the current block indent are dealt with by resuming the read loop. [12] Line indents equal to the block indent trigger the reading of a Ruby Slippers semicolon as shown in the following:


	$recce->lexeme_read( 'ruby_semicolon', $indent_start,
	    $indent_length, ';' ) [13]
    
    

Ruby Slippers

In Marpa, a "Ruby Slippers" symbol is one which does not actually occur in the input. Ruby Slippers parsing is new with Marpa, and made possible because Marpa is left-eidetic. By left-eidetic, I mean that Marpa knows, in full detail, about the parse to the left of its current position, and can provide that information to the parsing app. This implies that Marpa also knows which tokens are acceptable to the parser at the current location, and which are not.

Ruby Slippers parsing enables a very important trick which is useful in "liberal" parsing -- parsing where certain elements might be in some sense "missing". With the Ruby Slippers you can design a "liberal" parser with a "fascist" grammar. This is, in fact, how the Haskell 2010 Report's context-free grammar is designed -- the official syntax requires explicit layout, but Haskell programmers are encouraged to omit most of the explicit layout symbols, and Haskell implementations are required to "dummy up" those symbols in some way. Marpa's method for doing this is left-eideticism and Ruby Slippers parsing.

The term "Ruby Slippers" refers to a widely-known scene in the "Wizard of Oz" movie. Dorothy is in the fantasy world of Oz, desperate to return to Kansas. But, particularly after a shocking incident in which orthodox Oz wizardry is exposed as an affable fakery, she is completely at a loss as to how to escape. The "good witch" Glenda appears and tells Dorothy that in fact she's always had what she's been wishing for. The Ruby Slippers, which she had been wearing all through the movie, can return her to Kansas. All Dorothy needs to do is wish.

In Ruby Slippers parsing, the "fascist" grammar "wishes" for lots of things that may not be in the actual input. Procedural logic here plays the part of a "good witch" -- it tells the "fascist" grammar that what it wants has been there all along, and supplies it. To do this, the procedural logic has to have a reliable way of knowing what the parser wants. Marpa's left-eideticism provides this.

Ruby Slippers combinators

This brings us to a question I've postponed -- how do we know which combinator to call when? The answer is Ruby Slippers parsing. First, here are some lexer rules for "unicorn" symbols. We use unicorns when symbols need to appear in Marpa's lexer, but must never be found in actual input.


      :lexeme ~ L0_unicorn
      L0_unicorn ~ unicorn
      unicorn ~ [^\d\D]
      ruby_i_decls ~ unicorn
      ruby_x_decls ~ unicorn [14]
    
    

<unicorn> is defined to match [^\d\D]. This pattern is all the symbols which are not digits and not non-digits -- in other words, it's impossible that this pattern will ever match any character. The rest of the statements declare other unicorn lexemes that we will need. <unicorn> and <L0_unicorn> are separate, because we need to use <unicorn> on the RHS of some lexer rules, and a Marpa lexeme can never occur on the RHS of a lexer rule.[15]

In the above Marpa rule,

  • <decls> is the symbol from the 2010 Report;
  • <ruby_i_decls> is a Ruby Slippers symbol for a block of declarations with implicit layout.
  • <ruby_x_decls> is a Ruby Slippers symbol for a block of declarations with explicit layout.
  • <laidout_decls> is a symbol (not in the 2010 Report) for a block of declarations covering all the possibilities for a block of declarations.

      laidout_decls ::= ('{') ruby_x_decls ('}')
	       | ruby_i_decls
	       | L0_unicorn decls L0_unicorn [16]
    

It is the expectation of a <laidout_decls> symbol that causes child combinators to be invoked. Because <L0_unicorn> will never be found in the input, the <decls> alternative will never match -- it is there for documentation and debugging reasons.[17] Therefore Marpa, when it wants a <laidout_decls>, will look for a <ruby_x_decls> if a open curly brace is read; and a <ruby_i_decls> otherwise. Neither <ruby_x_decls> or <ruby_i_decls> will ever be found in the input, and Marpa will reject the input, causing a "rejected" event.

Rejected events

In this code, as often, the "good witch" of Ruby Slippers does her work through "rejected" events. These events can be set up to happen when, at some parse location, none of the tokens that Marpa's internal lexer finds are acceptable.

In the "rejected" event handler, we can use Marpa's left eideticism to find out what lexemes Marpa would consider acceptable. Specifically, there is a terminals_expected() method which returns a list of the symbols acceptable at the current location.


            my @expected =
              grep { /^ruby_/xms; } @{ $recce->terminals_expected() }; [18]
    

Once we "grep" out all but the symbols with the "ruby_" prefix, there are only 4 non-overlapping possibilities:

  • Marpa expects a <ruby_i_decls> lexeme;
  • Marpa expects a <ruby_x_decls> lexeme;
  • Marpa expects a <ruby_semicolon> lexeme;
  • Marpa does not expect any of the Ruby Slippers lexemes;

If Marpa does not expect any of the Ruby Slippers lexemes, there was a syntax error in the Haskell code.[19]

If a <ruby_i_decls> or a <ruby_x_decls> lexeme is expected, a child combinator is invoked. The Ruby Slippers symbol determines whether the child combinator looks for implicit or explicit layout. In the case of implicit layout, the location of the rejection determines the block indent.[20]

If a <ruby_semicolon> is expected, then the parser is at the point where a new block item could start, but none was found. Whether the block was implicit or explicit, this indicates we have reached the end of the block, and should return control to the parent combinator.[21]

To explain why <ruby_semicolon> indicates end-of-block, we look at both cases. In the case of an explicit layout combinator, the rejection should have been caused by a closing curly brace, and we return to the parent combinator and retry it. In the parent combinator, the closing curly brace will be acceptable.

If we experience a "rejected" event while expecting a <ruby_semicolon> in an implicit layout combinator, it means we did not find an explicit semicolon; and we also never found the right indent for creating a Ruby semicolon. In other words, the indentation is telling us that we are at the end of the block. We therefore return control to the parent combinator.

Conclusion

With this, we've covered the major points of this Haskell prototype parser. It produces an AST whose structure and node names are those of the 2010 Report. (The Marpa grammar introduces non-standard node names and rules, but these are pruned from the AST in post-processing.)

In the code, the grammars from the 2010 Report are included for comparison, so a reader can easily determine what syntax we left out. It might be tedious to add the rest, but I believe it would be unproblematic, with one interesting exception: fixity. To deal with fixity, we may haul out the Ruby Slippers again.

The code, comments, etc.

A permalink to the full code and a test suite for this prototype, as described in this blog post, is on Github. I expect to update this code, and the latest commit can be found here. Links for specific lines of code in this post are usually static permalinks to earlier commits.

To learn more about Marpa, a good first stop is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

1. Graham Hutton and Erik Meijer, Monadic parser combinators, Technical Report NOTTCS-TR-96-4. Department of Computer Science, University of Nottingham, 1996, pp 30-35. http://eprints.nottingham.ac.uk/237/1/monparsing.pdf. Accessed 19 August 2018.

2. I use whitespace-significant parsing as a convenient example for this post, for historical reasons and for reasons of level of complexity. This should not be taken to indicate that I recommend it as a language feature.

3. Simon Marlow, Haskell 2010 Language Report, 2010. Online version accessed 21 August 2018. For layout, see in particular section 2.7 (pp. 12-14) and section 10.3 (pp. 131-134).

4. 2010 Report. The short examples are on p. 13 and p. 134. The long examples are on p. 14.

5. Paul Hudak, John Peterson and Joseph Fasel Gentle Introduction To Haskell, version 98. Revised June, 2000 by Reuben Thomas. Online version accessed 21 August 2018. The examples are in section 4.6, which is on pp. 20-21 of the October 1999 PDF.

6. Github Permalink.

7. Github Permalink.

8. Github Permalink.

9. Single-line comments are dealt with properly by lexing them as a different token and discarding them separately. Handling multi-line comments is not yet implemented -- it is easy in principle but tedious in practice and the examples drawn from the Haskell literature did not provide any test cases.

10. Github Permalink.

11. Github Permalink.

12. Github Permalink.

13. Github Permalink.

14. Github Permalink.

15. The reason for this is that by default a Marpa grammar determines which of its symbols are lexemes using the presence of those symbol on the LHS and RHS of the rules in its lexical and context-free grammars. A typical Marpa grammar requires a minimum of explicit lexeme declarations. (Lexeme declarations are statements with the :lexeme pseudo-symbol on their LHS.) As an aside, the Haskell 2010 Report is not always careful about the lexer/context-free boundary, and adopting its grammar required more use of Marpa's explicit lexeme declarations than usual.

16. Github Permalink.

17. Specifically, the presense of a <decls> alternative silences the usual warnings about symbols inaccessible from the start symbol. These warnings can be silenced in other ways, but at the prototype stage it is convenient to check that all symbols supposed to be accessible through <decls> are in fact accessible. There is a small startup cost to allowing the extra symbols in the grammars, but the runtime cost is probably not measureable.

18. Github Permalink.

19. Currently the handling of these is simplistic. A practical implementation of this method would want better reporting. In fact, Marpa's left eideticism allows some interesting things to be done in this respect.

20. Github Permalink.

21. Github Permalink.

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Ocean of Awareness: Marpa and combinator parsing

The missing part

A previous post described how to use the current stable Marpa implementation as a better procedural parser. This post describes how the Marpa algorithm can be used as the basis of better combinator parsers.

In the post on procedural parsing, the subparsers[1] were like combinators, in that they could be called recursively, so that a parse could be built up from components. Like combinators, each child could return, not just a parse, but a set of parses. And, as in combinators, once a child combinator returned its value, the parent parser could resume parsing at a location specified by the child combinator. So what was missing?

A combinator, in order to handle ambiguity, returns not a subparse, but a set of subparses. In the full combinator model, each subparse can have its own "resume location".[2] The procedural parsing post did not provide for multiple resume locations. We will now proceed to make up for that.

How it works

The Marpa parser has the ability to accept multiple subparses, each with its own length. This allows child subparses to overlap in any fashion, forming a mosaic as complex as the application needs.

An Earley parser is table-driven -- its parse tables consists of Earley sets, with an initial Earley set and one Earley set per token. This makes for a very simple idea of location. Location 0 is the location of the initial Earley set. Location N is the location of the Earley set after the N'th token has been consumed.

Simplicity is great, but unfortunately this won't work for variable-length tokens. To handle those, Marpa introduces another idea of location: the earleme. Like Earley set locations, the earlemes begin at 0, and advance in integer sequence. Earley set 0 is always at earleme 0. Every Earley set has an earleme location. On the other hand, not every earleme has a corresponding Earley set -- there can be "empty" earlemes.

The lower-level interface for Marpa is Libmarpa. Every time Libmarpa adds a token, a length in earlemes must be specified. In the most-used higher level Marpa interfaces, this "earleme length" is always 1, which makes the Libmarpa location model collapse into the traditional one.

The Libmarpa recognizer advances earleme-by-earleme. In the most-used higher level Marpa interfaces, a token ends at every earleme (unless of course that earleme is after end-of-input). This means that the most-used Marpa interfaces create a new Earley set every time they advance one earleme. Again, in this case, the Libmarpa model collapses into the traditional one.

In Libmarpa and other lower-level interfaces, there may be cases where

  • one or more tokens end after the current earleme, but
  • no tokens end at the current earleme.
In such cases the current earleme will be empty.

This is only an outline of the basic concepts behind the Marpa input model. The formalisms are in the Marpa theory paper.[3] The documentation for Libmarpa and Marpa's other low-level interfaces contains more accessible, but detailed, descriptions.[4]

Value added

Left-eidetic information

As readers of my previous posts[5] will know, Marpa is "left-eidetic" -- the application has access to everything to its left. This is an advantage over the traditional implementation of combinator parsing, where parse information about the left context may be difficult or impossible to access.[6]

More powerful linear-time combinators

Marpa parses a superset of LR-regular grammars in linear time, which makes it a more powerful "building block" than traditionally available for combinator parsing. This gives the programmer of a combinator parser more options.

State of the art worse-than-linear combinators

In special circumstances, programmers may want to use subparsers which are worse than linear -- for example, they may know that the string is very short. Marpa parses context-free grammars in state of the art time.[7]

The code, comments, etc.

To learn more about Marpa, a good first stop is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

1. In some of the descriptions of Marpa's procedural parsing, these subparsers are called "lexers". This emphasizes the usual case in current practice, where the subparsers are the bottom layer of the parsing application, and do not invoke their own child subparsers.

2. In notational terms, a full combinator is a function of the form
         A* → ℙ( P × A* ),
where A is the alphabet of the grammar; P is a representation of a single parser (for example, a parse tree); ℙ(X) is the power set of a set X: and X × Y is the Cartesian product of sets X and Y. The subparsers of the procedural parsing post were of the form
         A* → ℙ( P ) × A*.

3. Kegler, Jeffrey. "Marpa, a Practical General Parser: The Recognizer". 2013. Section 12, "The Marpa input model", pp. 39-40.

4. Libmarpa API document, the "Input" section. Marpa::R2's NAIF interface allows access to the full Libmarpa input model and its documentation contains a higher-level description of Marpa's alternative input models. There is also a thin Perl interface to Libmarpa, the THIF interface, which allows full access to the alternative input models.

5. For example, the post on procedural parsing contains a good, simple, example of the use of Marpa's left-eideticism.

6. For best effect, left-eidetism and functional purity probably should be used in combination. For the moment at least, I am focusing on explaining the capabilities, and leaving it to others to find the monadic or other solutions that will allow programmers to leverage this power in functionally pure ways.

7. Specifically O(n^2) for unambiguous grammars, and O(n^3) for ambiguous grammars.

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Header image by Tambako the Jaguar. Some rights reserved.