Perl Hacks: Selenium and Perl

Some of you might remember the lightning talk I gave at the London Perl Workshop last year (it’s available on YouTube, I’ll wait if you want to watch it). In it, I said I planned to resurrect the Perl School brand, using it to publish Perl ebooks. One book, Perl Taster, was already available and I had plans to write and publish several more. Those plans are still ongoing…

Also in the talk, I asked if anyone else wanted to write a book for the series. I offered to help out with the hard parts of getting your text into the Amazon system (it’s actually nowhere near as hard as you might think). Three people approached me later to discuss the possibility of writing something, but only one followed through with something more concrete. That was John Davies, who has been a regular attendee at London Perl Mongers for a pretty long time. At the LPW, John had helped Martin Berends to run a training course on using Selenium with Perl. As part of that help, John had written some notes on the course which had been distributed to the attendees. John wondered if those notes would be worth publishing as a Perl School ebook. I said that it was certainly worth pursuing the idea.

Over the last few months, John has expanded his original notes considerably and I’ve been doing the work needed to convert his input into an ebook. And I’m happy to say that the book was published on Amazon yesterday. It’s called Selenium and Perl and you should be able to find it on your local Kindle store. If you want to test your Perl web applications using Selenium, then I hope that you find it useful.

It’s been the first time I’ve edited someone else’s work and converted it into an ebook. I think the process has gone well (perhaps someone should ask John for his opinion!)

But I’m confident enough of the process to renew the offer I made at the LPW. If you’re interested in writing an ebook as part of the Perl School range, please get in touch and we can discuss it.

The post Selenium and Perl appeared first on Perl Hacks.

Perl Foundation News: Maintaining the Perl 5 Core (Dave Mitchell): Grant Report for April 2018

This is a monthly report by Dave Mitchell on his grant under Perl 5 Core Maintenance Fund. We thank the TPF sponsors to make this grant possible.

I spent the last month mainly working on blocking issues for the 5.28.0
release.

SUMMARY: 4:54 File::Copy test failure under quadmath 5:00 RT #132863 BBC LWP-Protocol-https-6.07 0:46 RT #132955 USECPLUSPLUS build broken in 5.27 blead and all 5.26 stables 8:00 [perl #131577] heap-use-after-free (READ of size 1) 12:36 [perl #131648] Out-of-bounds read in Sregmatch 2:00 look at failing smokes 0:30 process p5p mailbox 11:28 review 5.28 blocker tickets 0:30 silence truncated_utf8.t TODO passing ------ 45:44 TOTAL (HH::MM)

237.1 weeks 3076.6 total hours 13.0 average hours per week

There are 56.4 hours left on the grant

Ocean of Awareness: Parsers and Useful Power

What do parser users want? What makes a parser[1] successful? In this post I will look at one aspect of that question, in light of an episode in the history of parsing.

Irons 1961

The first paper fully describing a parser was Irons 1961[2]. The Irons parser was what is called "general", meaning that it can parse all of the "context-free grammars". That makes it far more powerful than most parsers in practical use today.

But the Irons algorithm was not always fast in the general case. Irons 1961 used backtracking to achieve its power, so it would go exponential for many useful grammars.

Among the grammars Irons 1961 could not parse quickly were those containing the all-important arithmetic expressions. Irons 1961 gave way to recursive descent.

Recursive descent (RD) in its pure form, could not parse arithmetic expressions at all, but it could be customized with procedural code. That is, it could call specialized parsers which were reliably fast for specific sections of the input. The Irons parser was declarative, and not easy to cusomtize.

Raw power versus useful power

The contest between Irons parsing and recursive descent took place before the theory for analyzing algorithms was fully formed.[3] In retrospect, we can say that, except in specialized uses, an acceptable parser for most practical uses must be linear or quasi-linear.[4] That is, the "useful power" of a parser is the class of grammars that it will parse in quasi-linear time.[5]

Useful power turns out to be more important, in practice, than raw power. Recursive descent won out over the Irons algorithm because, while the Irons algorithm had vastly more raw power, RD had slightly more "useful power".

It is nice to have raw power as well -- it means an algorithm can take on some specialized tasks. And raw power provides a kind of "soft failure" debugging mode for grammars with, for example, unintended ambiguities. But, in the eyes of the programming community, the more important measure of a parser is its useful power -- the class of grammars that it will parse at quasi-linear speed.

Stating the obvious?

That useful power is more important than raw power may seem, in retrospect, obvious. But in fact, it remains a live issue. In practice raw power and useful power are often confused. The parsing literature is not always as helpful as it could be: it can be hard to determine what the useful power of an algorithm is.

And the Irons experiment with raw power is often repeated, in hopes of a different result. Very often, a new algorithm is a hybrid of two others: an algorithm with a lot of raw power, but which can go quadratic or worse; and a fast algorithm which lacks power. When the power of the fast algorithm fails, the hybrid algorithm switches over to the algorithm with raw power.

It is a sort of cross-breeding of algorithms. The hope is that the hybrid algorithm has the best features of each of its parents. This works a lot better in botany than it does in parsing. Once you have a successful cross in a plant, you can breed from the successful hybrid and expect good things to happen. In botany, the individual crosses can have an extremely high failure rate, and cross-breeding can still succeed. But it's different when you cross algorithms: Even after you've succeeded with one parse, the next parse from your hybrid is a fresh new toss of the dice.

References, comments, etc.

To learn about my own parsing project, Marpa[6], there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Footnotes

1. By "parser" in this post, I will mean a programmer's most powerful toolbox parser -- what might be called the "flagship" parser. No parser will ever be the right one for all uses.

2. For the reference to Irons, see V3 of my "Parsing: A Timeline". The "Timeline" contains the background material for this post.

3. Even the term "analysis of algorithms" did not exist until 1969: see Knuth, "Recent News".

4. For more about "linear" and "quasi-linear", including definitions, see V3 of my "Parsing: A Timeline", in particular its 'Term: linear' section.

5. While it is clearly the consensus among practitioners and theoreticians that, for parsing, practical time is quasi-linear or better, there are those who argue that worse-than-quasi-linear parsers are often the right ones for the job, and that research on them has been unwisely neglected. The dissenters are not without a case: For example, in natural language, while sentences are in theory infinite in length, in practice their average size is fixed. And while very long difficult-to-parse sentences do occur in some texts, such as older ones, it is normal for a human reader to have to spend extra time on them. So it may be unreasonable to insist that a parsing algorithm be quasi-linear in this application.

6. Marpa's useful power is LR-regular, which properly contains every class of grammar in practical use: regular expressions, LALR, LL(k) for all k, LR(k) for all k, and the LL-regular grammars.

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Dave's Free Press: Journal: Perl isn't dieing

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: cgit syntax highlighting

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Ocean of Awareness: What parser do birds use?

"Here we provide, to our knowledge, the first unambiguous experimental evidence for compositional syntax in a non-human vocal system." -- "Experimental evidence for compositional syntax in bird calls", Toshitaka N. Suzuki, David Wheatcroft & Michael Griesser Nature Communications 7, Article number: 10986

In this post I look at a subset of the language of the Japanese great tit, also known as Parus major. The above cited article presents evidence that bird brains can parse this language. What about standard modern computer parsing methods? Here is the subset -- probably a tiny one -- of the language actually used by Parus major.

      S ::= ABC
      S ::= D
      S ::= ABC D
      S ::= D ABC
    

Classifying the Parus major grammar

Grammophone is a very handy new tool for classifying grammars. Its own parser is somewhat limited, so that it requires a period to mark the end of a rule. The above grammar is in Marpa's SLIF format, which is smart enough to use the "::=" operator to spot the beginning and end of rules, just as the human eye does. Here's the same grammar converted into a form acceptable to Grammophone:

      S -> ABC .
      S -> D .
      S -> ABC D .
      S -> D ABC .
    

Grammophone tells us that the Parus major grammar is not LL(1), but that it is LALR(1).

What does this mean?

LL(1) is the class of grammar parseable by top-down methods: it's the best class for characterizing most parsers in current use, including recursive descent, PEG, and Perl 6 grammars. All of these parsers fall short of dealing with the Parus major language.

LALR(1) is probably most well-known from its implementations in bison and yacc. While able to handle this subset of Parus's language, LALR(1) has its own, very strict, limits. Whether LALR(1) could handle the full complexity of Parus language is a serious question. But it's a question that in practice would probably not arise. LALR(1) has horrible error handling properties.

When the input is correct and within its limits, an LALR-driven parser is fast and works well. But if the input is not perfectly correct, LALR parsers produce no useful analysis of what went wrong. If Parus hears "d abc d", a parser like Marpa, on the other hand, can produce something like this:

# * String before error: abc d\s
# * The error was at line 1, column 7, and at character 0x0064 'd', ...
# * here: d
    

Parus uses its language in predatory contexts, and one can assume that a Parus with a preference for parsers whose error handling is on an LALR(1) level will not be keeping its alleles in the gene pool for very long.

References, comments, etc.

Those readers content with sub-Parus parsing methods may stop reading here. Those with greater parsing ambitions, however, may wish to learn more about Marpa. A Marpa test script for parsing the Parus subset is in a Github gist. Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Ocean of Awareness: Version 3 of "Parsing: a timeline"

My most popular blog posts by far have been my two versions of "Parsing: a timeline". I have just created a 3rd version, which has so many changes that it might be considered a new work. The new version is less Marpa-centric and several times as long. It covers new topics, including combinator and monadic parsing, and operator expression parsing. And sources are now provided for all material.

References, comments, etc.

For more about Marpa, my own parsing project, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Palm Treo call db module

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Ocean of Awareness: Introduction to Marpa Book in progress

What follows is a summary of the features of the Marpa algorithm, followed by a discussion of potential applications. It refers to itself as a "monograph", because it is a draft of part of the introduction to a technical monograph on the Marpa algorithm. I hope the entire monograph will appear in a few weeks.

The Marpa project

The Marpa project was intended to create a practical and highly available tool to generate and use general context-free parsers. Tools of this kind had long existed for LALR and regular expressions. But, despite an encouraging academic literature, no such tool had existed for context-free parsing. The first stable version of Marpa was uploaded to a public archive on Solstice Day 2011. This monograph describes the algorithm used in the most recent version of Marpa, Marpa::R2. It is a simplification of the algorithm presented in my earlier paper.

A proven algorithm

While the presentation in this monograph is theoretical, the approach is practical. The Marpa::R2 implementation has been widely available for some time, and has seen considerable use, including in production environments. Many of the ideas in the parsing literature satisfy theoretical criteria, but in practice turn out to face significant obstacles. An algorithm may be as fast as reported, but may turn out not to allow adequate error reporting. Or a modification may speed up the recognizer, but require additional processing at evaluation time, leaving no advantage to compensate for the additional complexity.

In this monograph, I describe the Marpa algorithm as it was implemented for Marpa::R2. In many cases, I believe there are better approaches than those I have described. But I treat these techniques, however solid their theory, as conjectures. Whenever I mention a technique that was not actually implemented in Marpa::R2, I will always explicitly state that that technique is not in Marpa as implemented.

Features

General context-free parsing

As implemented, Marpa parses all "proper" context-free grammars. The proper context-free grammars are those which are free of cycles, unproductive symbols, and inaccessible symbols. Worst case time bounds are never worse than those of Earley's algorithm, and therefore never worse than O(n**3).

Linear time for practical grammars

Currently, the grammars suitable for practical use are thought to be a subset of the deterministic context-free grammars. Using a technique discovered by Joop Leo, Marpa parses all of these in linear time. Leo's modification of Earley's algorithm is O(n) for LR-regular grammars. Leo's modification also parses many ambiguous grammars in linear time.

Left-eidetic

The original Earley algorithm kept full information about the parse --- including partial and fully recognized rule instances --- in its tables. At every parse location, before any symbols are scanned, Marpa's parse engine makes available its information about the state of the parse so far. This information is in useful form, and can be accessed efficiently.

Recoverable from read errors

When Marpa reads a token which it cannot accept, the error is fully recoverable. An application can try to read another token. The application can do this repeatedly as long as none of the tokens are accepted. Once the application provides a token that is accepted by the parser, parsing will continue as if the unsuccessful read attempts had never been made.

Ambiguous tokens

Marpa allows ambiguous tokens. These are often useful in natural language processing where, for example, the same word might be a verb or a noun. Use of ambiguous tokens can be combined with recovery from rejected tokens so that, for example, an application could react to the rejection of a token by reading two others.

Using the features

Error reporting

An obvious application of left-eideticism is error reporting. Marpa's abilities in this respect are ground-breaking. For example, users typically regard an ambiguity as an error in the grammar. Marpa, as currently implemented, can detect an ambiguity and report specifically where it occurred and what the alternatives were.

Event driven parsing

As implemented, Marpa::R2 allows the user to define "events". Events can be defined that trigger when a specified rule is complete, when a specified rule is predicted, when a specified symbol is nulled, when a user-specified lexeme has been scanned, or when a user-specified lexeme is about to be scanned. A mid-rule event can be defined by adding a nulling symbol at the desired point in the rule, and defining an event which triggers when the symbol is nulled.

Ruby slippers parsing

Left-eideticism, efficient error recovery, and the event mechanism can be combined to allow the application to change the input in response to feedback from the parser. In traditional parser practice, error detection is an act of desperation. In contrast, Marpa's error detection is so painless that it can be used as the foundation of new parsing techniques.

For example, if a token is rejected, the lexer is free to create a new token in the light of the parser's expectations. This approach can be seen as making the parser's "wishes" come true, and I have called it "Ruby Slippers Parsing".

One use of the Ruby Slippers technique is to parse with a clean but oversimplified grammar, programming the lexical analyzer to make up for the grammar's short-comings on the fly. As part of Marpa::R2, the author has implemented an HTML parser, based on a grammar that assumes that all start and end tags are present. Such an HTML grammar is too simple even to describe perfectly standard-conformant HTML, but the lexical analyzer is programmed to supply start and end tags as requested by the parser. The result is a simple and cleanly designed parser that parses very liberal HTML and accepts all input files, in the worst case treating them as highly defective HTML.

Ambiguity as a language design technique

In current practice, ambiguity is avoided in language design. This is very different from the practice in the languages humans choose when communicating with each other. Human languages exploit ambiguity in order to design highly flexible, powerfully expressive languages. For example, the language of this monograph, English, is notoriously ambiguous.

Ambiguity of course can present a problem. A sentence in an ambiguous language may have undesired meanings. But note that this is not a reason to ban potential ambiguity --- it is only a problem with actual ambiguity.

Syntax errors, for example, are undesired, but nobody tries to design languages to make syntax errors impossible. A language in which every input was well-formed and meaningful would be cumbersome and even dangerous: all typos in such a language would be meaningful, and parser would never warn the user about errors, because there would be no such thing.

With Marpa, ambiguity can be dealt with in the same way that syntax errors are dealt with in current practice. The language can be designed to be ambiguous, but any actual ambiguity can be detected and reported at parse time. This exploits Marpa's ability to report exactly where and what the ambiguity is. Marpa::R2's own parser description language, the SLIF, uses ambiguity in this way.

Auto-generated languages

In 1973, Čulik and Cohen pointed out that the ability to efficiently parse LR-regular languages opens the way to auto-generated languages. In particular, Čulik and Cohen note that a parser which can parse any LR-regular language will be able to parse a language generated using syntax macros.

Second order languages

In the literature, the term "second order language" is usually used to describe languages with features which are useful for second-order programming. True second-order languages --- languages which are auto-generated from other languages --- have not been seen as practical, since there was no guarantee that the auto-generated language could be efficiently parsed.

With Marpa, this barrier is raised. As an example, Marpa::R2's own parser description language, the SLIF, allows "precedenced rules". Precedenced rules are specified in an extended BNF. The BNF extensions allow precedence and associativity to be specified for each RHS.

Marpa::R2's precedenced rules are implemented as a true second order language. The SLIF representation of the precedenced rule is parsed to create a BNF grammar which is equivalent, and which has the desired precedence. Essentially, the SLIF does a standard textbook transformation. The transformation starts with a set of rules, each of which has a precedence and an associativity specified. The result of the transformation is a set of rules in pure BNF. The SLIF's advantage is that it is powered by Marpa, and therefore the SLIF can be certain that the grammar that it auto-generates will parse in linear time.

Notationally, Marpa's precedenced rules are an improvement over similar features in LALR-based parser generators like yacc or bison. In the SLIF, there are two important differences. First, in the SLIF's precedenced rules, precedence is generalized, so that it does not depend on the operators: there is no need to identify operators, much less class them as binary, unary, etc. This more powerful and flexible precedence notation allows the definition of multiple ternary operators, and multiple operators with arity above three.

Second, and more important, a SLIF user is guaranteed to get exactly the language that the precedenced rule specifies. The user of the yacc equivalent must hope their syntax falls within the limits of LALR.

References, comments, etc.

Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Ocean of Awareness: Parsing: an expanded timeline

The fourth century BCE: In India, Pannini creates a sophisticated description of the Sanskrit language, exact and complete, and including pronunciation. Sanskrit could be recreated using nothing but Pannini's grammar. Pannini's grammar is probably the first formal system of any kind, predating Euclid. Even today, nothing like it exists for any other natural language of comparable size or corpus. Pannini is the object of serious study today. But in the 1940's and 1950's Pannini is almost unknown in the West. His work has no direct effect on the other events in this timeline.

1943: Emil Post defines and studies a formal rewriting system using productions. With this, the process of reinventing Pannini in the West begins.

1948: Claude Shannon publishes the foundation paper of information theory. Andrey Markov's finite state processes are used heavily.

1952: Grace Hopper writes a linker-loader and describes it as a "compiler". She seems to be the first person to use this term for a computer program. Hopper uses the term "compiler" in its original sense: "something or someone that brings other things together".

1954: At IBM, a team under John Backus begins working on the language which will be called FORTRAN. The term "compiler" is still being used in Hopper's looser sense, instead of its modern one. In particular, there is no implication that the output of a "compiler" is ready for execution by a computer. The output of one 1954 "compiler", for example, produces relative addresses, which need to be translated by hand before a machine can execute them.

1955: Noam Chomsky is awarded a Ph.D. in linguistics and accepts a teaching post at MIT. MIT does not have a linguistics department and Chomsky, in his linguistics course, is free to teach his own approach, highly original and very mathematical.

1956: Chomsky publishes the paper which is usually considered the foundation of Western formal language theory. The paper advocates a natural language approach that involves

  • a bottom layer, using Markov's finite state processes;
  • a middle, syntactic layer, using context-free grammars and context-sensitive grammars; and
  • a top layer, which involves mappings or "transformations" of the output of the syntactic layer.

These layers resemble, and will inspire, the lexical, syntactic and AST transformation phases of modern parsers. For finite state processes, Chomsky acknowledges Markov. The other layers seem to be Chomsky's own formulations -- Chomsky does not cite Post's work.

1957: Steven Kleene discovers regular expressions, a very handy notation for Markov's processes. Regular expressions turn out to describe exactly the mathematical objects being studied as finite state automata, as well as some of the objects being studied as neural nets.

1957: Noam Chomsky publishes Syntactic Structures, one of the most influential books of all time. The orthodoxy in 1957 is structural linguistics which argues, with Sherlock Holmes, that "it is a capital mistake to theorize in advance of the facts". Structuralists start with the utterances in a language, and build upward.

But Chomsky claims that without a theory there are no facts: there is only noise. The Chomskyan approach is to start with a grammar, and use the corpus of the language to check its accuracy. Chomsky's approach will soon come to dominate linguistics.

1957: Backus's team makes the first FORTRAN compiler available to IBM customers. FORTRAN is the first high-level language that will find widespread implementation. As of this writing, it is the oldest language that survives in practical use. FORTRAN is a line-by-line language and its parsing is primitive.

1958: John McCarthy's LISP appears. LISP goes beyond the line-by-line syntax -- it is recursively structured. But the LISP interpreter does not find the recursive structure: the programmer must explicitly indicate the structure herself, using parentheses.

1959: Backus invents a new notation to describe the IAL language (aka ALGOL). Backus's notation is influenced by his study of Post -- he seems not to have read Chomsky until later.

1960: Peter Naur improves the Backus notation and uses it to describe ALGOL 60. The improved notation will become known as Backus-Naur Form (BNF).

1960: The ALGOL 60 report specifies, for the first time, a block structured language. ALGOL 60 is recursively structured but the structure is implicit -- newlines are not semantically significant, and parentheses indicate syntax only in a few specific cases. The ALGOL compiler will have to find the structure. It is a case of 1960's optimism at its best. As the ALGOL committee is well aware, a parsing algorithm capable of handling ALGOL 60 does not yet exist. But the risk they are taking will soon pay off.

1960: A.E. Gleenie publishes his description of a compiler-compiler. Glennie's "universal compiler" is more of a methodology than an implementation -- the compilers must be written by hand. Glennie credits both Chomsky and Backus, and observes that the two notations are "related". He also mentions Post's productions. Glennie may have been the first to use BNF as a description of a procedure instead of as the description of a Chomsky grammar. Glennie points out that the distinction is "important".

Chomskyan BNF and procedural BNF: BNF, when used as a Chomsky grammar, describes a set of strings, and does not describe how to parse strings according to the grammar. BNF notation, if used to describe a procedure, is a set of instructions, to be tried in some order, and used to process a string. Procedural BNF describes a procedure first, and a language only indirectly.

Both procedural and Chomskyan BNF describe languages, but usually not the same language. That is,

  • Suppose D is some BNF description.
  • Let P(D) be D interpreted as a procedure,
  • Let L(P(D)) be the language which the procedure P(D) parses.
  • Let G(D) be D interpreted as a Chomsky grammar.
  • Let L(G(D)) be the language which the grammar G(D) describes.
  • Then, usually, L(P(D)) != L(G(D)).

The pre-Chomskyan approach, using procedural BNF, is far more natural to someone trained as a computer programmer. The parsing problem appears to the programmer in the form of strings to be parsed, exactly the starting point of procedural BNF and pre-Chomsky parsing.

Even when the Chomskyan approach is pointed out, it does not at first seem very attractive. With the pre-Chomskyan approach, the examples of the language more or less naturally lead to a parser. In the Chomskyan approach the programmer has to search for an algorithm to parse strings according to his grammar -- and the search for good algorithms to parse Chomskyan grammars has proved surprisingly long and difficult. Handling semantics is more natural with a Chomksyan approach. But, using captures, semantics can be added to a pre-Chomskyan parser and, with practice, this seems natural enough.

Despite the naturalness of the pre-Chomskyan approach to parsing, we will find that the first fully-described automated parsers are Chomskyan. This is a testimony to Chomsky's influence at the time. We will also see that Chomskyan parsers have been dominant ever since.

1961: In January, Ned Irons publishes a paper describing his ALGOL 60 parser. It is the first paper to fully describe any parser. The Irons algorithm is Chomskyan and top-down with a "left corner" element. The Irons algorithm is general, meaning that it can parse anything written in BNF. It is syntax-driven (aka declarative), meaning that the parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Peter Lucas publishes the first description of a purely top-down parser. This can be considered to be recursive descent, though in Lucas's paper the algorithm has a syntax-driven implementation, useable only for a restricted class of grammars. Today we think of recursive descent as a methodology for writing parsers by hand. Hand-coded approaches became more popular in the 1960's due to three factors:

  • Memory and CPU were both extremely limited. Hand-coding paid off, even when the gains were small.
  • Non-hand coded top-down parsing, of the kind Lucas's syntax-driven approach allowed, is a very weak parsing technique. It was (and still is) often necessary to go beyond its limits.
  • Top-down parsing is intuitive -- it essentially means calling subroutines. It therefore requires little or no knowledge of parsing theory. This makes it a good fit for hand-coding.

1963: L. Schmidt, Howard Metcalf, and Val Schorre present papers on syntax-directed compilers at a Denver conference.

1964: Schorre publishes a paper on the Meta II "compiler writing language", summarizing the papers of the 1963 conference. Schorre cites both Backus and Chomsky as sources for Meta II's notation. Schorre notes that his parser is "entirely different" from that of Irons 1961 -- in fact it is pre-Chomskyan. Meta II is a template, rather than something that readers can use, but in principle it can be turned into a fully automated compiler-compiler.

1965: Don Knuth invents LR parsing. The LR algorithm is deterministic, Chomskyan and bottom-up, but it is not thought to be practical. Knuth is primarily interested in the mathematics.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is Chomskyan, syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's algorithm is both top-down and bottom-up at once -- it uses dynamic programming and keeps track of the parse in tables. Earley's approach makes a lot of sense and looks very promising indeed, but there are three serious issues:

  • First, there is a bug in the handling of zero-length rules.
  • Second, it is quadratic for right recursions.
  • Third, the bookkeeping required to set up the tables is, by the standards of 1968 hardware, daunting.

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size. LALR looks practical.

1969: Ken Thompson writes the "ed" editor as one of the first components of UNIX. At this point, regular expressions are an esoteric mathematical formalism. Through the "ed" editor and its descendants, regular expressions will become an everyday part of the working programmer's toolkit.

Recognizers: In comparing algorithms, it can be important to keep in mind whether they are recognizers or parsers. A recognizer is a program which takes a string and produces a "yes" or "no" according to whether a string is in part of a language. Regular expressions are typically used as recognizers. A parser is a program which takes a string and produces a tree reflecting its structure according to a grammar. The algorithm for a compiler clearly must be a parser, not a recognizer. Recognizers can be, to some extent, used as parsers by introducing captures.

1972: Alfred Aho and Jeffrey Ullman publish a two volume textbook summarizing the theory of parsing. This book is still important. It is also distressingly up-to-date -- progress in parsing theory slowed dramatically after 1972. Aho and Ullman describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1972: Under the names TDPL and GTDPL, Aho and Ullman investigate the non-Chomksyan parsers in the Schorre lineage. They note that "it can be quite difficult to determine what language is defined by a TDPL parser". That is, GTDPL parsers do whatever they do, and that whatever is something the programmer in general will not be able to describe. The best a programmer can usually do is to create a test suite and fiddle with the GTDPL description until it passes. Correctness cannot be established in any stronger sense. GTDPL is an extreme form of the old joke that "the code is the documentation" -- with GTDPL nothing documents the language of the parser, not even the code.

GTDPL's obscurity buys nothing in the way of additional parsing power. Like all non-Chomskyan parsers, GTDPL is basically a extremely powerful recognizer. Pressed into service as a parser, it is comparatively weak. As a parser, GTDPL is essentially equivalent to Lucas's 1961 syntax-driven algorithm, which was in turn a restricted form of recursive descent.

At or around this time, rumor has it that the main line of development for GTDPL parsers is classified secret by the US government. GTDPL parsers have the property that even small changes in GTDPL parsers can be very labor-intensive. For some government contractors, GTDPL parsing provides steady work for years to come. Public interest in GTDPL fades.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed.

1979: Part of the V7 toolkit is Yet Another Compiler Compiler (YACC). YACC is LALR-powered. Despite its name, YACC is the first compiler-compiler in the modern sense. For some useful languages, the process of going from Chomskyan specification to executable is fully automated. Most practical languages, including the C language and YACC's own input language, still require manual hackery. Nonetheless, after two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses YACC and LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of the Earley algorithm.

But Earley parsing is almost forgotten. Twenty years will pass before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: John Aycock and R. Nigel Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2004: Bryan Ford publishes his paper on PEG. Implementers by now are avoiding YACC, and it seems as if there might soon be no syntax-driven algorithms in practical use. Ford fills this gap by repackaging the nearly-forgotten GTDPL. Ford adds packratting, so that PEG is always linear, and provides PEG with an attractive new syntax. But nothing has been done to change the problematic behaviors of GTDPL.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

Today: After five decades of parsing theory, the state of the art seems to be back where it started. We can imagine someone taking Ned Iron's original 1961 algorithm from the first paper ever published describing a parser, and republishing it today. True, he would have to translate its code from the mix of assembler and ALGOL into something more fashionable, say Haskell. But with that change, it might look like a breath of fresh air.

Marpa: an afterword

The recollections of my teachers cover most of this timeline. My own begin around 1970. Very early on, as a graduate student, I became unhappy with the way the field was developing. Earley's algorithm looked interesting, and it was something I returned to on and off.

The original vision of the 1960's was a parser that was

  • efficient,
  • practical,
  • general, and
  • syntax-driven.

By 2010 this vision seemed to have gone the same way as many other 1960's dreams. The rhetoric stayed upbeat, but parsing practice had become a series of increasingly desperate compromises.

But, while nobody was looking for them, the solutions to the problems encountered in the 1960's had appeared in the literature. Aycock and Horspool had solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and are probably no longer relevant in any case -- cache misses are now the bottleneck.

The programmers of the 1960's would have been prepared to trust a fully declarative Chomskyan parser. With the experience with LALR in their collective consciousness, modern programmers might be more guarded. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove." But I found it straightforward to rearrange the Earley parse engine to allow efficient event-driven handovers between procedural and syntax-driven logic. And Earley tables provide the procedural logic with full knowledge of the state of the parse so far, so that Earley's algorithm is a better platform for hand-written procedural logic than recursive descent.

References, comments, etc.

My implementation of Earley's algorithm is called Marpa. For more about Marpa, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Header image by Tambako the Jaguar. Some rights reserved.