Perl Hacks: Web Site Tune-Up: A Case Study

I thought it might be interesting to talk about some of the topics I’ll be covering at my workshops at The Perl Conference in Glasgow in August. Today I’ll be talking about the Web Site Tune-Up workshop and in my next post, I’ll cover The Professional Programmer.

And I thought it would be most useful to show you a case study of where I’ve done some work to tune up a web site. So here’s the story of some work I’ve done on the web site for The Perl Conference in Glasgow itself.

When the site first went live, I noticed that it didn’t have any Open Graph tags. Open Graph tags are a series of HTML elements that you can add to the of a web page which tell sites like Facebook interesting things about the page. Most usefully, they can tell Facebook and Twitter which image and text to use when someone shares a link to your site. Obviously, we all want people to share our URLs as widely as possible and having a nice image and a useful description show up when the site is shared is a good way to encourage more sharing (and as I’ve been typing that, I’ve just realised that actually having obvious sharing buttons on the page is another good idea – more work to do there!)

So I needed to find the right place to add these tags. Most web sites are generated by a content management system. Most Perl conference sites us A Conference Toolkit, So I just needed to look through the conference repo to find the template that generates the header of the page and edit that.  Here’s the first commit I made, which just added hard-coded values for the tags. With this in place, the tags looked like this:

<meta property="og:title" content="The Perl Conference - Glasgow 2018" />
<meta property="og:type" content="website" />
<meta property="og:url" content="http://act.perlconference.org/tpc-2018-glasgow/" />
<meta property="og:image" content="http://act.perlconference.org/tpc-2018-glasgow/css/logos/lrg-conf-logo.png" />

That was an improvement, but there were a few problems. Firstly, it was missing a couple of tags that Twitter likes to use, so I added those in this commit. Then I noticed I had forgotten the description (which prevented Twitter from parsing the data correctly). This commit fixed that.

And there it sat until quite recently. But last weekend I decided I needed to fix those hard-coded values. I noticed the problem when I shared a link to the workshops page on Facebook and my post contained information about the home page.

This took a bit more digging. I had to understand a little more about the internals of ACT. But over a series of small commits last weekend, I got it working as I wanted. Actually, not quite as I wanted – the Wiki URLs are still not working properly, I’ll get back to those later on. I also want to change the description on every page – but I’m not sure if that’s possible in ACT.

This weekend we published an initial version of the schedule for the conference – one that only covers the workshops (as those are the only talks with firm dates yet). Initially, it didn’t look very nice as the standard ACT template for the schedule page shows unscheduled talks before scheduled ones. That didn’t make much sense to me as there is a huge list of unscheduled talks and it seemed unlikely that anyone would ever scroll past that to find the scheduled talks. You should also bear in mind that Google is the most important visitor to your page and Google assumes that the most important content on your page comes first. So changing the order is likely to give us an SEO boost.

So I wanted to find a way to fix that. And that turned out to be harder than expected. It turns out that ACT is built on layers of templates. If your ACT instance doesn’t have a particular template, then the default one is used. And it looks like most people just use the default schedule template. But once I had copied that template into our repository, I was free to edit it any way I wanted. I started by doing the re-ordering that I mentioned above. Then I started to consider other options.

Firstly, the default formatting of the schedule on a ACT site is a little ugly. But I knew that the TPC Glasgow site was built using Bootstrap.  So I knew that I could use Bootstrap’s table classes to make the schedule table look a little nicer. That was just a case of adding some classes to the template that generates the table (and, actually, removing quite a lot of unnecessary classes and presentation mark-up – removing presentation mark-up is another good tip for SEO).

Finally, I wanted to change the order of the data in each cell of the presentation table. Remember when I said above that the most important data should come first? Well, if you’re presenting data about a conference talk, what’s the most important piece of information? The default template showed the speaker’s name before the title of the talk. I wanted to reverse that (I also wanted to split the data across several lines). It turned out that this mark-up was in another template which contained a number of “utility” macros that I had to copy into our repo. But once I had done that, it was simple to make the changes I wanted. The current version of the schedule layout is in the image at the top of this post. I hope you agree it looks nicer that the old version.

So that’s where I’ve got to. There are a few other fixes I’d like to make:

  • Fix the Open Graph tags for the Wiki (they’re broken currently because the URL includes parameters)
  • Add variable descriptions to the Open Graph tags (if ACT can be made to support it)
  • Add more prominent (and more varied) sharing buttons to the pages
  • Add structured data to the schedule pages, so that Google has a better chance of knowing what the pages are about.

But that’s a pretty good example of the kinds of things I’ll be talking about in my Web Site Tune-Up workshop. To summarise what I’ve done:

  • Added OpenGraph tags to all pages
  • Restructured the schedule pages so the most important information comes first
  • Moved towards using more semantic mark-up and removed presentation mark-up
  • Reordered the data for each talk so readers (including Google) know what the most important data item is

This only touches on the kind of information I’ll be covering in the workshop. There will be dozens more practical tips you can use to improve Google’s understanding of your web site.

The half-day workshop takes place on Tuesday 14th August from 09:30-13:00. Tickets are available when you book your ticket for the main conference and cost £75.

Hope to see you there.

The post Web Site Tune-Up: A Case Study appeared first on Perl Hacks.

NeilB: COED:ETHICS 2018

A group of Perl companies are sponsoring the COED:ETHICS conference, a one-day conference on ethics for developers and technologists, which is in London on July 13th.

"Hey, I'm just a programmer!", "I didn't know my code would be used to do that", "it's not my job to worry about that", "everyone (else) does it!", "my manager told me to do it".

There have been a number of widely published ethical failures of the software industry recently, some of the most reported being Facebook's inappropriate release of user data, and VW cheating on emissions tests. Software has "unexpectedly" been used to discriminate against groups of people, trick them into doing something, manipulate the stock market, cheat them out of something, or worse.

It's easy for developers to get tunnel vision, and just focus on the challenge, interest, frustration, and joy associated with the act of developing itself, ignoring the bigger issues. As with quality and security, ethical considerations are not something you deal with at the end of a project; they aren't like bugs: they have to be considered up front, and they should be the concern of everyone. That said, programmers shouldn't wait for others to take the lead.

COED:ETHICS is an inexpensive one-day conference about ethics for developers and technologists, in London on July 13th. There is a great lineup of speakers, and topics include: insights from US drone strikes; ethical treatment of large data sets; empowering users; ethical anti-patterns; ethical decision-making; and chatbots.

We believe this is an important and timely topic, so a group of Perl companies, both large and small, decided to sponsor the event. We particularly want to reinforce the message that this is an issue for programmers, and not something for others to deal with.

I will be attending, and hope to give a talk at least at the London Perl Workshop, with distilled insights from COED:ETHICS, and what ethics might mean for Perl developers, and for CPAN authors.

perlbuzz.com: ack 2.24 is released, speeding up common cases

I&#8217;ve just uploaded a new version of ack, version 2.24, to the CPAN and posted it to beyondgrep.com. This version of ack introduces an optimization to not search most files line-by-line until after ack has read the whole file and determined that the pattern matches in the file at all. This speeds things up quite [&#8230;]

Perl Foundation News: Grant Report: Perl 6: Bugfixing and Performance of Rationals

Zoffix has posted his June Report.

Most of the work on constants has been completed; Some bad math on zero-denominator rationals has been fixed. As Zoffix works through these issues, some work may find its way into ecosystem modules instead of core Perl 6.

You can read all the details at his June 2018 blogs.perl.com posting

Note that Zoffix will be taking the next month off to focus on some other Perl 6 issues, but should return the following month.

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Perl isn't dieing

Ocean of Awareness: Parsing left recursions

Left recursion

A lot has been written about parsing left recursion. Unfortunately, much of it simply adds to the mystery. In this post, I hope to frame the subject clearly and briefly.

I expect the reader has some idea of what left recursion is, and perhaps some experience of it as an issue. Informally, left recursion occurs when a symbol expands to something with itself on the left. This can happen directly, for example, if production (1) is in a grammar.


    (1) A ::= A B
    
Indirect left recursion happens when, for example, (2) and (3) are productions in a grammar.

    (2) A ::= B C
    (3) B ::= A D
    
A grammar with productions (4) and (5) has a "hidden" left recursion.

    (4) A ::= B A C
    (5) B ::= # empty
    
This is because <A> ends up leftmost in derivations like:

    (6) A  ⟶ B A C ⟶ A C
    
In derivation (6), production (4) was applied, then production (5).

For those into notation, a grammar is left recursive if and only if it allows a derivation of the form


    (7) A  ⟶+ β A γ  where  β = ε
    
In (7) ε is the empty string, and α ⟶+ β indicates that α derives β in one or more rule applications.

So, OK, what is the problem?

The problem with parsing left recursions is that if you are parsing using a derivation like


    (8) A  ⟶ A B 
    
then you have defined <A> in terms of <A>. All recursions can be a problem, but left recursions are a particular problem because almost all practical parsing methods[1] proceed left to right, and derivations like (8) will lead many of the most popular algorithms straight into an infinite regress.

Why do some algorithms not have a problem?

In a sense, all algorithms which solve the left recursion problem do it in the same way. It is just that in some, the solution appears in a much simpler form.

The solution is at most simple in Earley's algorithm. That is no coincidence -- as Pingali and Bilardi[2] show, Earley's, despite its daunting reputation, is actually the most basic Chomskyan context-free parsing algorithm, the one from which all others derive.

Earley's builds a table. The Earley table contains an initial Earley set and an Earley set for each token. The Earley set for each token describes the state of the parse after consuming that token. The basic idea is not dissimilar to that of the Might/Darais/Spiewak (MDS) idea of parsing by derivatives, and the logic for building the Earley sets resembles that of MDS.[3]

For the purpose of studying left recursion, what matters is that each Earley set contains Earley "items". Some of the items are called predictions because they predict the occurrence of a symbol at that location in the input.

To record a left recursion in an Earley set, the program adds a prediction item for the left recursive symbol. It is that simple.[4]

Multiple occurrences of a prediction item would be identical, and therefore useless. Therefore subsequent attempts to add the same prediction item are ignored, and recursion does not occur.

If some have no problem, why do others?

Besides Earley's, a number of other algorithms handle left recursion without any issue -- notably LALR (aka yacc or bison) and LR. This re-raises the original question: why do some algorithms have a left recursion problem?

The worst afflicted algorithms are the "top-down" parsers. The best known of these is recursive descent -- a parsing methodology which, essentially, does parsing by calling a subroutine to handle each symbol. In the traditional implementation of recursive descent, left recursion is very problematic. Suppose that, as part of a recursive descent implementation, you are writing the function to parse the symbol <A>, which you are calling parse_A(). If your grammar has a rule


    (9) A ::= A B
    
the first thing you need to do in a naive implementation of parse_A() is to call parse_A(). And parse_A() will then call parse_A(). And so, in the naive implementation, on and on forever.

The fixed-point solution to left recursion

Over the years, many ways to solve the top-down left recursion issue have been announced. The MDS solution is one of the more interesting -- interesting because it actually works[5], and because it describes all the others, including the Earley algorithm solution. MDS reduces the problem to the more general one of finding a "fixed point" of the recursion.

In math, the "fixed point" of a function is an argument of the function which is equal to its value for that argument -- that is, an x such that f(x) = x. MDS describes an algorithm which "solves" the left recursion for its fixed point. That "fixed point" can then be memoized. For example the value of parse_A can be the memoized "fixed point" value of <A>.

The Earley solution of left recursion was, in fact, an optimized "fixed point". The computation of an Earley is the application of a set of rules for adding Earley items. This continues until no more Earley items can be added. In other words, the rules for building an Earley set are applied until they find their "fixed point".[6]

Other top-down solutions

The MDS fixed point solution does the job, but as described in their paper it requires a functional programming language to implement, and it is expensive. In the worst case, the MDS approach is exponential, although they conjecture that it is linear for a large class of practical grammars.

Top-down algorithms can take an "in-between strategy" -- they can tackle those left recursions that are cheap to find, without computing the full "fixed point". Here a well-defined boundary is crucial: A programmer wants to know if their particular grammar will work, and whether small tweaks to their grammar will break it.

Top-down can be seen as a "guessing" strategy with hacks. Hacks are always needed in top-down, because the input is at the bottom, not the top, and a useful top-down algorithm needs to look at the input. But the hacks can be as simple as lookahead, and lookahead can be implemented without seriously compromising the simplicity and flexibility of the original top-down approach.

With detection and fixing of left-recursion, the "hack" part of the top-down strategy becomes very complicated. The attraction of top-down is its simplicity, and its resulting adapability to procedural logic. The point can be reached where the original strategy comes into question.

After all, a recursive descent parser can straightforwardly take care of left recursion issues by calling an Earley parser. But in that case, why not simply use Earley's?

Comments, etc.

Marpa is my own implementation of an Earley parser.[7] Comments on this post can be made in Marpa's Google group, or on its IRC channel: #marpa at freenode.net.

Footnotes

1. I probably could have said "all practical parsing methods" instead of "almost all". Right-to-left parsing methods exist, but they see little use. In any case, they only reverse the problem. Parsing in both directions is certainly possible but, as I will show, we do not have to go to quite that much trouble.

2. Keshav Pingali and Gianfranco Bilardi, UTCS tech report TR-2012. 2012. PDF accessed 9 Junk 2018. Video accessed 9 June 2018. Less accessible is Keshav Pingali and Gianfranco Bilardi, "A graphical model for context-free grammar parsing." Compiler Construction - 24th International Conference, CC 2015. Lecture Notes in Computer Science, Vol. 9031, pp. 3-27, Springer Verlag, 2015. PDF accessed 9 June 2018.

3. Matthew Might, David Darais and Daniel Spiewak. "Functional Pearl: Parsing with Derivatives." International Conference on Functional Programming 2011 (ICFP 2011). Tokyo, Japan. September, 2011. pages 189-195. PDF accessed 9 Jun 2018. Slides accessed 9 June 2018. Video accessed 9 June 2018.

4. Those familiar with Earley's algorithm may note that Earley items are traditionally in terms of productions, not symbols. Symbol predictions are therefore recorded indirectly.

Specifically, Earley items are traditionally duples of (dr, origin), where dr is a dotted production -- a production with a "dot location" marked; and origin is a location in the input. In all predictions origin is the current location, and the dot location is at the start of the production, so there can be at most one prediction per rule. A prediction of a symbol <A> is recorded as a prediction of every production which has <A> on its LHS.

The argument in the main text is made the way it is because it is simpler to speak of "predicted symbols" than to repeatedly refer to "sets of predictions of productions sharing a common LHS".

5. There have been many more attempts than implementations over the years, and even some of the most-widely used implementations have their issues.

6. Recall that potential left recursions are recorded as "predictions" in Earley's algorithm. Predictions recurse, but since they do not depend on the input, they can be precomputed. This means that Earley implementations can bring each Earley set to its fixed point very quickly.

7. Marpa has a stable implementation. For it, and for more information on Marpa, there are these resources:
Marpa website, accessed 25 April 2018.
Kegler's website, accessed 25 April 2018.
Github repo, accessed 25 April 2018.
MetaCPAN, accessed 30 April 2018..
There is also a theory paper for Marpa: Kegler, Jeffrey. "Marpa, A Practical General Parser: The Recognizer.", 2013. PDF accessed 24 April 2018.

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Ocean of Awareness: Parsing with pictures

Derivatives == Earley?

In a cordial Twitter exchange with Matt Might and and David Darais, Prof. Might asked if I was interested in looking at their derivatives-based approach. I answered that I was looking at it -- Marpa is an optimization of the Might/Darais approach.

This may sound strange. At first glance, our two algorithms seem about as different as parsing algorithms can get. My Marpa parser is an Earley parser, table-driven, and its parse engine is written in C language.[1]

The MDS (Might/Darais/Spiewak) parser is an extension of regular expressions which constructs states on the fly. MDS uses combinators and has implementations in several functional programming languages.[2]

Grammar Flow Graphs

Why then do I imagine that Marpa is an optimized version of the MDS approach? The reason is a paper sent to me by Prof. Keshav Pingali at Austin: "Parsing with Pictures".[3] The title is a little misleading: their approach is not that easy, and the paper requires a considerable amount of math. But it is a lot easier than the traditional way of learning the various approaches to parsing.

The basis of the Pingali-Bilardi approach is the Grammar Flow Graph (GFG). These GFGs are the "pictures" of their title. GFGs are NFAs with recursion added. As has long been known, adding recursion to NFAs allows them to represent any context-free language.

Pingali and Bilardi's next step is new. A GFG can be "simulated" using the same algorithm used to simulate an NFA. However, the result is not immediately impressive. The simulation does produce a recognizer, but not a good recognizer: Some of the strings recognized by the GFG simulator are not in the context-free language.[4]

To repair their "simulation", Pingali and Bilardi add a tag to each state. This tag tracks where the recursion began.[5]. This not only fixes the recognizer, but the added information is enough to allow the set of parse trees to be efficiently recovered from the sets of GFG states. In other words, with tags, the GFG recognizer now is a parser.

It turns out that this recognizer-turned-parser is not new. In fact, it is exactly Earley's algorithm.

Pingali and Bilardi do not stop there. Using their new framework, they go on to show that all LL-based and LR-based algorithms are simplifications of their Earley parser. From this point of view, Earley parsing is the foundation of all context-free parsing, and LL- and LR-based algorithms are Earley optimizations.[6]

Step 1: The MDS algorithm

To show that Marpa is an optimization of the MDS approach, I will start with the MDS algorithm, and attempt to optimize it. For its functional programming language, the MDS paper uses Racket. The MDS parser is described directly, in the usual functional language manner, as a matching operation.

In the MDS paper, the MDS parser is optimized with laziness and memoization. Nulls are dealt with by computing their fixed points on the fly. Even with these three optimizations, the result is still highly inefficient. So, as an additional step, MDS also implements "deep recursive simplication" -- in effect, strategically replacing laziness with eagerness. With this the MDS paper conjectures that the algorithm's time is linear for a large class of practical grammars.

Step 2: Extended regular expressions

Next, we notice that the context-free grammars of the MDS algorithm are regular expressions extended to allow recursion. Essentially, they are GFG's translated into Racket match expressions. The equivalence is close enough that you could imagine the MDS paper using GFG's for its illustrations.

Step 3: Simulating an NFA

Unsurprisingly, then, the MDS and GFG approaches are similar in their first step. Each consumes a single character to produce a "partial parse". A partial parse, for both of these algorithms, can be represented as a duple. One element of the duple is a string representing the unconsumed input. The other is a representation of a set of parse trees. In the case of MDS, in keeping with its functional approach, the set of parse trees is represented directly. In the case of the GFG-simulator, the set of parse trees is compressed into a sequence of GFG-state-sets. There is one GFG-state-set for the start of parsing, and one for each consumed character.

Step 4: Earley's Algorithm

At this point, with the introduction of the GFG state-sets to represent parse-trees, the MDS algorithm and its optimized GFG equivalent have "forked". Recall from above, that this GFG simulator has a bug -- it is over-liberal.

The bug is the one already described, and our fix is the one already described: Each GFG state either starts a recursion or is part of one. We fix the bug by tagging each GFG state with the index of the GFG state-set that starts its recursion. Once these tags are added, the GFG state-sets are exactly Earley sets.

Step 5: The Leo optimization

Next we incorporate an optimization by Joop Leo, which makes Earley parsing linear for all LR-regular grammars, without using lookahead. Since LR-regular is a superset of LR(k) and LL(k), including LL(*), we do not bother with lookahead.

Step 6: Marpa

To get from an Earley/Leo parser to a Marpa parser, we need to address one more major point. In modern parsing practice, programmers expect the ability to introduce procedural logic, even to the point of switching parsers. By ensuring that processing each Earley set is complete before processing on the next Earley set begins, we accomplish this.

This means that Marpa has available full information about the parse so far -- it is left-eidetic. Error reporting is unexcelled, and procedural logic can use this information as well. For example, full parse information implies full knowledge of which tokens are expected next.

This means that you can write an liberal HTML parser, starting with a very illiberal HTML grammar. When the illiberal parser encounters a point where it cannot continue because of a missing token, procedural logic can ask what the expected token is; concoct that token on the fly; supply that token to the illiberal parser; and then ask the illiberal parser to continue. This is called the "Ruby Slippers" technique, and an HTML parser based on it has been implemented.[7]

The way forward

As mentioned, the MDS algorithm has its own approach to optmization, one which takes maximum advantage of functional programming. In contrast, Marpa relies on C level coding.

One example of the contrast in optimization techniques is null handling. Recall from above that, to deal with null processing, the MDS algorithm uses a fixed-point algorithm on the fly. Marpa, on the other hand, before parsing begins. precomputes a list of nullable symbols using bit vectors. In this particular case, Marpa will usually be the winner.

A second case in which hand optimization wins is the all-important Leo optimization. Simon Peyton-Jones is a very smart man, but nonetheless I believe that the day that GHC will look at a functional specification of an Earley parser and rediscover the Leo optimization at compile time is far off.

On the other hand, I do not imagine that hand optimization and/or C language is the winner in all cases. A programmer is deceiving himself if he imagines that he can spot all the cases where lazy evaluation or memoization will be effective in the general case. And of course, even an omniscient programmer is not going to be there at run-time to do "just in time" optimization. Perhaps the optimal parser of the future will combine important hand optimizations with functional programming.

Comments, etc.

To learn about Marpa, my Earley/Leo-based parsing project, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

1. Kegler, Jeffrey. "Marpa, A Practical General Parser: The Recognizer.", 2013. PDF accessed 24 April 2018.

Marpa has a stable implementation. For it, and for more information on Marpa, there are these resources: Marpa website, accessed 25 April 2018. Kegler's website, accessed 25 April 2018. Github repo, accessed 25 April 2018. MetaCPAN, accessed 30 April 2018..

2. Matthew Might, David Darais and Daniel Spiewak. "Functional Pearl: Parsing with Derivatives." International Conference on Functional Programming 2011 (ICFP 2011). Tokyo, Japan. September, 2011. pages 189--195. PDF accessed 9 Jun 2018. Slides accessed 9 June 2018. Video accessed 9 June 2018.

3. Keshav Pingali and Gianfranco Bilardi, UTCS tech report TR-2012. 2012. PDF accessed 9 Junk 2018. Video accessed 9 June 2018.

Less accessible, but with more details about GFGs and GFG-based Earley parsing is Keshav Pingali and Gianfranco Bilardi, "A graphical model for context-free grammar parsing." Compiler Construction - 24th International Conference, CC 2015. Lecture Notes in Computer Science, Vol. 9031, pp. 3-27, Springer Verlag, 2015. PDF accessed 9 June 2018.

4. Pingali and Bilardi 2012, section 3.1.

5. Pingali and Bilardi 2015, p. 11.

6. Pingali and Bilardi 2012, Sections 4-7.

7. I have based an liberal HTML pretty-printer on that parser, one which I use quite frequently. I used it, for example, when writing this blog post. To find out more about Ruby Slippers parsing see the Marpa FAQ, questions 122 and 123; my blog series on parsing html; and my blog post "Marpa and the Ruby Slippers".

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: cgit syntax highlighting

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Ocean of Awareness: Marpa and combinator parsing

The missing part

A previous post described how to use the current stable Marpa implementation as a better procedural parser. This post describes how the Marpa algorithm can be used as the basis of better combinator parsers.

In the post on procedural parsing, the subparsers[1] were like combinators, in that they could be called recursively, so that a parse could be built up from components. Like combinators, each child could return, not just a parse, but a set of parses. And, as in combinators, once a child combinator returned its value, the parent parser could resume parsing at a location specified by the child combinator. So what was missing?

A combinator, in order to handle ambiguity, returns not a subparse, but a set of subparses. In the full combinator model, each subparse can have its own "resume location".[2] The procedural parsing post did not provide for multiple resume locations. We will now proceed to make up for that.

How it works

The Marpa parser has the ability to accept multiple subparses, each with its own length. This allows child subparses to overlap in any fashion, forming a mosaic as complex as the application needs.

An Earley parser is table-driven -- its parse tables consists of Earley sets, with an initial Earley set and one Earley set per token. This makes for a very simple idea of location. Location 0 is the location of the initial Earley set. Location N is the location of the Earley set after the N'th token has been consumed.

Simplicity is great, but unfortunately this won't work for variable-length tokens. To handle those, Marpa introduces another idea of location: the earleme. Like Earley set locations, the earlemes begin at 0, and advance in integer sequence. Earley set 0 is always at earleme 0. Every Earley set has an earleme location. On the other hand, not every earleme has a corresponding Earley set -- there can be "empty" earlemes.

The lower-level interface for Marpa is Libmarpa. Every time Libmarpa adds a token, a length in earlemes must be specified. In the most-used higher level Marpa interfaces, this "earleme length" is always 1, which makes the Libmarpa location model collapse into the traditional one.

The Libmarpa recognizer advances earleme-by-earleme. In the most-used higher level Marpa interfaces, a token ends at every earleme (unless of course that earleme is after end-of-input). This means that the most-used Marpa interfaces create a new Earley set every time they advance one earleme. Again, in this case, the Libmarpa model collapses into the traditional one.

In Libmarpa and other lower-level interfaces, there may be cases where

  • one or more tokens end after the current earleme, but
  • no tokens end at the current earleme.
In such cases the current earleme will be empty.

This is only an outline of the basic concepts behind the Marpa input model. The formalisms are in the Marpa theory paper.[3] The documentation for Libmarpa and Marpa's other low-level interfaces contains more accessible, but detailed, descriptions.[4]

Value added

Left-eidetic information

As readers of my previous posts[5] will know, Marpa is "left-eidetic" -- the application has access to everything to its left. This is an advantage over the traditional implementation of combinator parsing, where parse information about the left context may be difficult or impossible to access.[6]

More powerful linear-time combinators

Marpa parses a superset of LR-regular grammars in linear time, which makes it a more powerful "building block" than traditionally available for combinator parsing. This gives the programmer of a combinator parser more options.

State of the art worse-than-linear combinators

In special circumstances, programmers may want to use subparsers which are worse than linear -- for example, they may know that the string is very short. Marpa parses context-free grammars in state of the art time.[7]

The code, comments, etc.

To learn more about Marpa, a good first stop is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

1. In some of the descriptions of Marpa's procedural parsing, these subparsers are called "lexers". This emphasizes the usual case in current practice, where the subparsers are the bottom layer of the parsing application, and do not invoke their own child subparsers.

2. In notational terms, a full combinator is a function of the form
         A* → ℙ( P × A* ),
where A is the alphabet of the grammar; P is a representation of a single parser (for example, a parse tree); ℙ(X) is the power set of a set X: and X × Y is the Cartesian product of sets X and Y. The subparsers of the procedural parsing post were of the form
         A* → ℙ( P ) × A*.

3. Kegler, Jeffrey. "Marpa, a Practical General Parser: The Recognizer". 2013. Section 12, "The Marpa input model", pp. 39-40.

4. Libmarpa API document, the "Input" section. Marpa::R2's NAIF interface allows access to the full Libmarpa input model and its documentation contains a higher-level description of Marpa's alternative input models. There is also a thin Perl interface to Libmarpa, the THIF interface, which allows full access to the alternative input models.

5. For example, the post on procedural parsing contains a good, simple, example of the use of Marpa's left-eideticism.

6. For best effect, left-eidetism and functional purity probably should be used in combination. For the moment at least, I am focusing on explaining the capabilities, and leaving it to others to find the monadic or other solutions that will allow programmers to leverage this power in functionally pure ways.

7. Specifically O(n^2) for unambiguous grammars, and O(n^3) for ambiguous grammars.

Dave's Free Press: Journal: Number::Phone release

Ocean of Awareness: Why is parsing considered solved?

It is often said that parsing is a "solved problem". Given the level of frustration with the state of the art, the underuse of the very powerful technique of Language-Oriented Programming due to problematic tools[1], and the vast superiority of human parsing ability over computers, this requires explanation.

On what grounds would someone say that parsing is "solved"? To understand this, we need to look at the history of Parsing Theory.[2] In fact, we'll have to start decades before computer Parsing Theory exists, with a now nearly-extinct school of linguistics, and its desire to put the field on strictly scientific basis.

1929: Bloomfield redefines "language"

In 1929 Leonard Bloomfield, as part of his effort to create a linguistics that would be taken seriously as a science, published his "Postulates".[3] The "Postulates" include his definition of language:

The totality of utterances that can be made in a speech community is the language of that speech-community.[4]

There is no reference in this definition to the usual view, that the utterances of a language "mean" something. This omission is not accidental:

The statement of meanings is therefore the weak point in language-study, and will remain so until human knowledge advances very far beyond its present state. In practice, we define the meaning of a linguistic form, wherever we can, in terms of some other science.[5]

Bloomfield is passing the buck, because the behaviorist science of his time rejects any claims about mental states as unverifiable statements -- essentially, as claims to be able to read minds. "Hard" sciences like physics, chemistry and even biology avoid dealing with unverifiable mental states. Bloomfield and the behaviorists want to make the methods of linguistics as close to hard science as possible.

Draconian as Bloomfield's exclusion of meaning is, it is a big success. Known as structural linguistics, Bloomfield's approach dominates lingustics for the next couple of decades.

1955: Noam Chomsky graduates

Noam Chomsky earns his PhD at the University of Pennsylvania. His teacher, Zelig Harris, is a prominent Bloomfieldian, and Chomsky's early work is thought to be in the Bloomfield school.[6] Chomsky becomes a professor at MIT. MIT does not have a linguistics department, and Chomsky is free to teach his own approach to the subject.

The term "language" as of 1956

Chomsky publishes his "Three models" paper, one of the most important papers of all time. His definition of language uses the terminology of set theory:

By a language then, we shall mean a set (finite or infinite) of sentences, each of finite length, all constructed from a finite alphabet of symbols.[7]

This definition is pure Bloomfield in substance, but signs of departure from the behaviorist orthodoxy are apparent in "Three Models" -- Chomsky is quite willing to talk about what sentences mean, when it serves his purposes. For a utterance with multiple meanings, Chomsky's new model produces multiple syntactic derivations. Each of these syntactic derivations "looks" like the natural representation of one of the meanings. Chomsky points out that the insight into semantics that his new model provides is a very desirable property to have.[8]

1959: Chomsky reviews Skinner

In 1959, Chomsky reviews a book by B.F. Skinner's on linguistics.[9] Skinner is the most prominent behaviorist of the time.

Chomsky's review removes all doubt about where he stands on behaviorism or on the relevance of linguistics to the study of meaning.[10] His review galvanizes the opposition to behaviorism, and Chomsky establishes himself as behavorism's most prominent and effective critic.

In later years, Chomsky will make it clear that he had had no intention of avoiding semantics:

[...] it would be absurd to develop a general syntactic theory without assigning an absolutely crucial role to semantic considerations, since obviously the necessity to support semantic interpretation is one of the primary requirements that the structures generated by the syntactic component of a grammar must meet.[11]

1961: Oettinger discovers pushdown automata

While the stack itself goes back to Turing[12], its significance for parsing becomes an object of interest in itself with Samuelson and Bauer's 1959 paper[13]. Mathematical study of stacks as models of computing begins with Anthony Oettinger's 1961 paper.[14]

Oettinger 1961 is full of evidence that stacks (which he calls "pushdown stores") are still very new. For example, Oettinger does not use the terms "push" or "pop", but instead describes operations on his pushdown stores using a set of vector operations which will later form the basis of the APL language.

Oettinger defines 4 languages. Oettinger's definitions all follow the behavorist model -- they are sets of strings.[15] Oettinger's pushdown stores will eventually be called deterministic pushdown automata (DPDA's) and become the basis of a model of language and the subject of a substantial literature, all of which will use the behaviorist definition of "language".

Oettinger hopes that DPDA's will be an adequate basis for the study of both computer and natural language translation. (Oettinger's own field is Russian translation.) DPDA's soon prove totally inadequate for natural languages.

But for dealing with computing languages, DPDA's will have a much longer life. As of 1961, all algorithms with acceptable speed are using stacks with various modifications.

The development of a theory of pushdown algorithms should hopefully lead to systematic techniques for generating algorithms satisfying given requirements to replace the ad hoc invention of each new algorithm.[16]

The search for a comprehensive theory of stack-based parsing quickly becomes identified with the search for a theoretical basis for practical parsing.

1965: Knuth discovers LR(k)

Donald Knuth reports his new results on stack-based parsing. In a pivotal paper[17], Knuth sets out a theory that encompasses all the "tricks"[18] used for efficient parsing up to that time. With this Oettinger's hope for a theory to replace "ad hoc invention" is fulfilled. In an exhilarating (and exhausting) 39-page demonstration of mathematical virtuousity, Knuth shows that stack-based parsing is equivalent to a new and unexpected class of grammars. Knuth calls these LR(k), and provides a parsing algorithm for them.

Knuth's new algorithm might be expected to be "the one to rule them all". Unfortunately, while deterministic and linear, it is not practical -- it requires huge tables well beyond the memory capabilities of the time.

The impracticality of his LR(k) algorithm does not suggest to Knuth that the stack-based model is inappropriate as a model of practical parsing. Instead it suggests to him, and to the field, that the boundary of practical parsing lies in a subclass of the LR(k) grammars.

To be sure, Knuth, in his program for further research[19], does suggests investigation of parsers for superclasses of LR(k). He even describes a new superclass of his own: LR(k,t), which is LR(k) with more aggressive lookahead. But he is clearly unenthusiastic about LR(k,t).[20] It is reasonable to suppose that Knuth is even more negative about the more general approaches that he does not bother to mention.[21]

In any case, those reading Knuth's LR(k) paper focused almost exclusively on his suggestions for research within the stack-based model. These included grammar rewrites; streamlining of the LR(k) tables; and research into LR(k) subclasses. It is LR(k) subclassing that will receive the most attention.[22]

The idea that the solution to the parsing problem must be stack-based is not without foundation. In 1965, the limits of computer technology are severe. For practitioners, any parsing technique that required much more than a reasonably-sized state machine and a stack, is not likely to happen. After all, only four years earlier, stacks were bleeding edge.

The practitioners of 1965 are inclined to believe that, like it or not, they are stuck with stack-based parsing. But why do the theoreticians feel compelled to follow them? The answer is that theoreticians talk themselves into it, using a misleading equivalence based on the behaviorist definition of language.

"Language" as of 1965

Knuth defines language as follows:

The language defined by G is
         { α | S => α and α is a string over T }
namely, the set of all terminal strings derivable from S by using the productions of G as substitution rules.[23]

Here G is a grammar whose start symbol is S and whose set of terminals is T. This is the behavorist definition of language translated into set-theoretic terms.

Knuth proves, to the satisfaction of the profession, the "equivalence" of LR(k) and DPDA's. LR(k) is a class of grammars and the DPDA model is of languages -- sets of strings. At first glance, this is an "apples and oranges" comparison -- how do you prove the equivalence of a class of languages and a class of grammars?

Knuth does this by reducing the class of DPDA languages and the class of grammars to their lowest common denominator, which is the language. And, of course, the "language" in the usage of Parsing Theory is a set of strings, without consideration of their syntax.

Every grammar, when stripped of its syntax, defines a language. So Knuth compares the language which results from stripping down the LR(k) grammars, to the language of DPDA's. After some very impressive mathematics, Knuth is able to show that the two languages are equivalent.

In theoretical mathematics, of course, you can define "equivalent" however you like. But if the purpose is to suggest limits in practice, you have to be much more careful. And in fact, as Knuth's paper shows, if you equate languages and grammars, you get into a very serious degree of magical thinking. Using the Knuth algorithm,

  • parsing LR(k) grammars for arbitrary k is hopelessly impractical;
  • parsing LR(1) grammars is impractical, but close to the boundary[24]; and
  • parsing LR(0) grammars is very practical.

A problem for the relevance of Knuth's proof of equivalence is that, if you just look at sets of strings without regard to syntax, LR(1) and LR(k) are equivalent. That means that from the sets-of-strings point of view, hopelessly impractical and borderline impractical are the same thing.

Worse, both LR(1) and LR(k) are equivalent to LR(0) for most applications. If you add an explicit end marker to an LR(1) language, which in most applications is easy to do[25], your LR(1) language becomes LR(0). Therefore, for most applications,

LR(k) = LR(1) = LR(0)

This means that, in the world of sets-of-strings, extremely impractical and very practical are usually the same thing.

Clearly the world of sets of strings is a magical one, in which we can easily transport ourselves across the boundary between practical and impractical. We can take visions of a magical world back into the world of practice, but we cannot assume they will be helpful. In that light, it is no surprise that Joop Leo will show how to extend practical parsing well beyond LR(k).[26]

Comments, etc.

I encourage those who want to know more about the story of Parsing Theory to look at my Parsing: a timeline 3.0. In particular, "Timeline 3.0" tells the story of the search for a good LR(k) subclass, and what happened afterwards.

To learn about Marpa, my Earley/Leo-based parsing project, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

1. The well-known Design Patterns book (aka "the Gang of 4 book") has a section on this. The Gang of 4 call Language-Oriented Programming their "Interpreter pattern". That section amply illustrates the main obstacle to use of the pattern -- lack of adequate parsing tools. I talk more about this in my two blog posts on the Interpreter pattern: BNF to AST and The Interpreter Design Pattern.

2. This post takes the form of a timeline, and is intended to be incorporated in my Parsing: a timeline. The earlier entires in this post borrow heavily from a previous blog post.

3. Bloomfield, Leonard, "A set of Postulates for the Science of Language", Language, Vol. 2, No. 3 (Sep., 1926), pp. 153-164.

4. Bloomfield 1926, definition 4 on p. 154.

5. Bloomfield, Leonard. Language. Holt, Rinehart and Winston, 1933, p. 140.

6. Harris, Randy Allen, The Linguistics Wars, Oxford University Press, 1993, pp 31-34, p. 37.

7. The quote is on p. 114 of Chomsky, Noam. "Three models for the description of language." IRE Transactions on information theory, vol. 2, issue 3, September 1956, pp. 113-124. In case there is any doubt Chomsky's "strings" are Bloomfield's utterances, Chomsky also calls his strings, "utterances". For example in Chomsky, Noam, Syntactic Structures, 2nd ed., Mouton de Gruyter, 2002, on p. 15: "Any grammar of a language will project the finite and somewhat accidental corpus of observed utterances to a set (presumably infinite) of grammatical utterances."

8. Chomsky 1956, p. 118, p. 123.

9. Chomsky, Noam. “A Review of B. F. Skinner’s Verbal Behavior”. Language, Volume 35, No. 1, 1959, 26-58. https://chomsky.info/1967____/ accessed on 3 June 2018.

10. See in particular, Section IX of Chomsky 1959.

11. Chomsky, Noam. Topics in the Theory of Generative Grammar. De Gruyter, 1978, p. 20. (The quote occurs in footnote 7 starting on p. 19.)

12. Carpenter, Brian E., and Robert W. Doran. "The other Turing machine." The Computer Journal, vol. 20, issue 3, 1 January 1977, pp. 269-279.

13. Samelson, Klaus, and Friedrich L. Bauer. "Sequentielle formelübersetzung." it-Information Technology 1.1-4 (1959): 176-182.

14. Oettinger, Anthony. "Automatic Syntactic Analysis and the Pushdown Store" Proceedings of Symposia in Applied Mathematics, Volume 12, American Mathematical Society, 1961.

15. Oettinger 1961, p. 106.

16. Oettinger 1961, p. 127.

17. Knuth, Donald E. "On the translation of languages from left to right." Information and Control, Volume 8, Issue 6, December 1965, pp. 607-639. https://ac.els-cdn.com/S0019995865904262/1-s2.0-S0019995865904262-main.pdf?_tid=dcf0f8a0-d312-475e-a559-be7714206374&acdnat=1524066529_64987973992d3a5fffc1b0908fe20b1d, accessed 24 April 2018.

18. Knuth 1965, p. 607, in the abstract.

19. Knuth 1961, pp. 637-639.

20. "Finally, we might mention another generalization of LR(k)" (Knuth 1965, p. 638); and "One might choose to call this left-to-right translation, although we had to back up a finite amount." (p. 639).

21. Knuth's skepticism of more general Chomskyan approaches is suggested by his own plans for his (not yet released) Chapter 12 of the Art of Computer Programming, in which he planned to use pre-Chomskyan bottom-up methods. (See Knuth, Donald E., "The genesis of attribute grammars", Attribute Grammars and Their Applications, Springer, September 1990, p. 3.)

22. The story of the research followup to Knuth's LR(k) paper is told in my Parsing: a timeline 3.0.

23. Knuth 1965, p. 608.

24. Given the capacity of computer memories in 1965, LR(1) was clearly impractical. With the huge computer memories of 2018, that could be reconsidered, but LR(1) is still restrictive and has poor error-handling, and few have looked at the possibility.

25. Some parsing applications, such as those which receive their input "on-line", can not determine the size of their input in advance. For these applications adding an end marker to their input is inconvenient or impossible.

26. Joop M. I. M. "A general context-free parsing algorithm running in linear time on every LR (k) grammar without using lookahead." Theoretical computer science, Volume 82, Issue 1, 22 May 1991, pp. 165-176. https://www.sciencedirect.com/science/article/pii/030439759190180A, accessed 24 April 2018.

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Ocean of Awareness: Marpa and procedural parsing

Procedural parsing

Marpa is an Earley-based parser, and Earley parsers are typically not good at procedural parsing. Many programmers are used to recursive descent (RD), which has been state-of-the-art in terms of its procedural programming capabilities -- it was these capabilities which led to RD's triumph over the now-forgotten Irons algorithm.

Marpa, however, has a parse engine expressly redesigned[1] to handle procedural logic well. In fact, Marpa is better at procedural logic than RD.

A context-sensitive grammar

Marpa parses all LR-regular grammars in linear time, so the first challenge is to find a grammar that illustrates a need for procedural logic, even when Marpa is used. The following is the canonical example of a grammar that is context-sensitive, but not context-free:

          a^n . b^n . c^n : n >= 1
    
I will call this the "ABC grammar". It is a sequence of a's, b's, and c's, in alphabetical order, where the character counts are all equal to each other and greater than zero.

The ABC "grammar" is really a counting problem more than a natural parsing problem, and parsing is not the fastest or easiest way to solve it. Three tight loops, with counters, would do the same job nicely, and would be much faster. But I chose the ABC grammar for exactly this reason. It is simple in itself, but it is tricky when treated as a parsing problem.[2]

In picking the strategy below, I opted for one that illustrates a nice subset of Marpa's procedural parsing capabilities. Full code is on-line, and readers are encouraged to "peek ahead".

Step 1: the syntax

Our strategy will be to start with a context-free syntax, and then extend it with procedural logic. Here is the context-free grammar:


    lexeme default = latm => 1
    :default ::= action => [name,start,length,values]
    S ::= prefix ABC trailer
    ABC ::= ABs Cs
    ABs ::= A ABs B | A B
    prefix ::= A*
    trailer ::= C_extra*
    A ~ 'a'
    B ~ 'b'
    :lexeme ~ Cs pause => before event => 'before C'
    Cs ~ 'c' # dummy -- procedural logic reads 
    C_extra ~ 'c'
    

The first line is boiler-plate: It turns off a default which was made pointless by a later enhancement to Marpa::R2. Marpa::R2 is stable, and backward-compatibility is a very high priority.


    :default ::= action => [name,start,length,values]
    

We will produce a parse tree. The second line defines its format -- each node is an array whose elements are, in order, the node name, its start position, its length and its child nodes.


    S ::= prefix ABC trailer
    

The symbol <ABC> is our "target" -- the counted a's, b's, and c's. To make things a bit more interesting, and to make the problem more like a parsing problem instead of a counting problem, we allow a prefix of a's and a trailer of c's.


    ABC ::= ABs Cs
    

We divide the <ABC> target into two parts: <ABs>, which contains the a's, and b's; and <Cs>, which contains the c's.

The string


    a^n . b^n
    

is context free, so that we can handle it without procedural logic, as follows:


    ABs ::= A ABs B | A B
    

The line above recognizes a non-empty string of a's, followed by an equal number of b's.


    prefix ::= A*
    trailer ::= C_extra*
    

As stated above, <prefix> is a series of a's and <trailer> is a series of c's.


    A ~ 'a'
    B ~ 'b'
    

Marpa::R2 has a separate lexical and syntactic phase. Here we define our lexemes. The first two are simple enough: <A> is the character "a"; and <B> is the character "b".


    :lexeme ~ Cs pause => before event => 'before C'
    Cs ~ 'c' # dummy -- procedural logic reads 
    C_extra ~ 'c'
    

For the character "c", we need procedural logic. As hooks for procedural logic, Marpa allows a full range of events. Events can occur on prediction and completion of symbols; when symbols are nulled; before lexemes; and after lexemes. The first line in the above display declares a "before lexeme" event on the symbol <Cs>. The name of the event is "before C".

The second line is a dummy entry, which is needed to allow the "before C" event to trigger. The entry says that <Cs> is a single character "c". This is false -- <Cs> is a series of one or more c's, which needs to be counted. But when the "before C" event triggers, the procedural logic will make things right.

The third line defines <C_extra>, which is another lexeme for the character "c". We have two different lexemes for the character c, because we want some c's (those in the target) to trigger events; and we want other c's (those in the trailer) not to trigger events, but to be consumed by Marpa directly.

The procedural logic

At this point, we have solved part of the problem with context-free syntax, and set up a Marpa event named "before C", which will solve the rest of it.


    my $input_length = length ${$input};
    for (
        my $pos = $recce->read($input);
        $pos < $input_length;
        $pos = $recce->resume()
      )
    {
      ... Process events ...
    }
    

Processing of events takes place inside a Marpa read loop. This is initialized with a read() method, and is continued with a resume() method. The read() and resume() methods both return the current position in the input. If the current position is end-of-input, we are done. If not, we were interrupted by an event, which we must process.


    Process events

    EVENT:
      for (
	  my $event_ix = 0 ;
	  my $event    = $recce->event($event_ix) ;
	  $event_ix++
	)
      {
	  my $name = $event->[0];
	  if ( $name eq 'before C' ) {
	      ... Process "before C" event ...
	  }
	  die qq{Unexpected event: name="$name"};
      }
    

In this application, only one event can occur at any location, so the above loop is "overkill". It loops through the events, one by one. The event method returns a reference to an array of event data. The only element we care about is the event name. In fact, if we weren't being careful about error checking, we would not even care about the event name, since there can be only one.

If, as expected, the event name is "before C", we process it. In any other case, we die with an error message.


    Process "before C" event

    my ( $start, $length ) = $recce->last_completed_span('ABs');
    my $c_length = ($length) / 2;
    my $c_seq = ( 'c' x $c_length );
    if ( substr( ${$input}, $pos, $c_length ) eq $c_seq ) {
	$recce->lexeme_read( 'Cs', $pos, $c_length, $c_seq );
	next EVENT;
    }
    die qq{Too few C's};
    

This is the core part of our procedural logic, where we have a "before C" event. We must

  • determine the right number of c characters;
  • check that the input has the right number of c characters;
  • put together a lexeme to feed the Marpa parser; and
  • return control to Marpa.
There is a lot going on, and some of Marpa's most powerful capabilities for assisting procedural logic are shown here. So we will go through the above display in detail.

Left-eidetic


    my ( $start, $length ) = $recce->last_completed_span('ABs');
    my $c_length = ($length) / 2;
    

Marpa claims to be "left-eidetic", that is, to have full knowledge of the parse so far, and to make this knowledge available to the programmer. How does a programmer cash in on this promise?

Of course, there is a fully general interface, which allows you to go through the Earley tables and extract the information in any form necessary. But another, more convenient interface, fits our purposes here. Specifically,

  • we want to determine how many c characters we are looking for.
  • How many c characters we are looking for depends on the number of a and b characters that we have already seen in the target.
  • The a and b characters that we have already seen in the target are in the <ABs> symbol instance.
  • So, what we want to know is the length of the most recent <ABs> symbol instance.

Marpa has a last_completed_span() method, and that is just what we need. This finds the most recent instance of a symbol. (If there had been more than one most recent instance, it would have found the longest.) The last_completed_span() method returns the start of the symbol instance (which we do not care about) and its length. The desired number of c characters, $c_length, is half the length of the <ABs> instance.

External parsing


    my $c_seq = ( 'c' x $c_length );
    if ( substr( ${$input}, $pos, $c_length ) eq $c_seq ) { ... }
    

Marpa allows external parsing. You can pause Marpa, as we have done, and hand control over to another parser -- including another instance of Marpa.

Here external parsing is necessary to make our parser context-sensitive, but the external parser does not have to be fancy. All it needs to do is some counting -- not hard, but something that a context-free grammar cannot do.

$pos is the current position in the input, as returned by the read() or resume() method in the outer loop. Our input is the string referred to by $input. We just calculated $c_length as the number of c characters required. The above code checks to see that the required number of c characters is at $pos in the input.

Communicating with Marpa


	$recce->lexeme_read( 'Cs', $pos, $c_length, $c_seq );
    

Our external logic is doing the parsing, but we need to let Marpa know what we are finding. We do this with the lexeme_read() method. lexeme_read() needs to know what symbol we are reading (Cs in our case); and its value ($c_seq in our case).

Marpa requires that every symbol be tied in some way to the input. The tie-in is only for error reporting, and it can be hack-ish or completely artificial, if necessary. In this application, our symbol instance is tied into the input in a very natural way -- it is the stretch of the input that we compared to $c_seq in the display before last. We therefore tell Marpa that the symbol is at $pos in the input, and of length $c_length.

Passing control back to Marpa


	next EVENT;
    

External parsing can go on quite a long time. In fact, an external parser never has to hand control back to Marpa. But in this case, we are done very quickly.

We ask for the next iteration of the EVENT loop. (In this code, there will not be a next iteration, unless there is an error.) Once done, the EVENT loop will hand control over to the outer loop. The outer loop will call the resume() method to return control back to Marpa.

The code, comments, etc.

The full code for this example is on-line. There is a lot more to Marpa, including more facilities for adding procedural logic to your Marpa parsers. To learn more about Marpa, a good first stop is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

1. To handle procedural logic well, an Earley engine needs to complete its Earley sets in strict order -- that is, Earley set N cannot change after work on Earley set N+1 has begun. I have not looked at every Earley parse engine, and some may have had this strict-sequencing property. And many of the papers are agnostic about the order of operations. But Marpa is the first Earley parser to recognize and exploit strict-sequencing as a feature.

2. The ABC grammar, in fact, is not all that easy or natural to describe even with a context-sensitive phrase structure description. A solution is given on Wikipedia: https://en.wikipedia.org/wiki/Context-sensitive_grammar#Examples.

Header image by Tambako the Jaguar. Some rights reserved.