NEILB: csvgrep - grep a CSV and display results in a table

Prompted by PhReiuoyx's recent blog post on CSV tools, I've released csvgrep, a simple command-line script that I use at work to quickly find things in CSV files. It's simplistic, but in a lot of situations it does what I need.

Gabor Szabo: Extending the Bailador crowdfunding campaign by another month

I have been contemplating the idea for quite some time, but finally I've decided to extend the Bailador crowdfunding campaign by another month.

For the full article visit Extending the Bailador crowdfunding campaign by another month

Gabor Szabo: Less than 24 hours in the Bailador campaign

Finally I had some time to work on the Bailador book and made the following changes:
  • Added first section of the Ajax code in Chapter 7.
  • Added first section about using SQLite in Chapter 3.
  • Added instructions to use Vagrant with Ubuntu in Chapter 1.
Given that the campaign won't reach its original goal you might have been wondering what will you get if you back the book? I've been wondering that myself. So let me clarify, that indeed you are going to get all the chapters of the book as outlined on the campaign page. But why should you join the 125 people who have already supported the book instead of waiting for it till it is finished in December? Because this way you can either get a deep discount ($25 instead of $55), or you can gain early access to the book that will give an advantage over those who buy it later. (And still get $5 discount.) You will also be mentioned in the book among the supporters and trust me, it is a great feeling. So don't miss the deadline, support the book now!

For the full article visit Less than 24 hours in the Bailador campaign

Sawyer X: Perl 5 Porters Mailing List Summary: June 12th-19th

Hey everyone,

Following is the p5p (Perl 5 Porters) mailing list summary for the past week.

Enjoy!

June 12th-19th

News

Eric Herman released Perl 5.27.1!

Tony Cook added a new Travis CI configuration, based on work by c9s and Dennis Kararsemaker. You can read more at Perl #123981.

Grant Reports

  • Dave Mitchell TPF Grant 2 reports #179, #180.

Issues

New Issues

Resolved Issues

  • Perl #123981: [PATCH] f8210d2 Add a simple .travis.yml.
  • Perl #129183: perl -S erroneously allows \ escapes in PATH.
  • Perl #131336: Encode CPAN version 2.78 broke g++ build on FreeBSD-10.3.
  • Perl #131487: VS, Makefile, makefile.mk.
  • Perl #131522: Spurious "Assuming NOT a POSIX class" warning.
  • Perl #131526: heap-buffer-overflow (READ of size 1) in Perl_my_atof2().
  • Perl #131575: s// on utf8 string occasionally crashes with "Malformed UTF-8 character".
  • Perl #131577: heap-use-after-free (READ of size 1) in S_reghop4().
  • Perl #131594: Bleadperl v5.27.0-141-g5d09ee1cb7 breaks DAVIDO/JSON-Tiny-0.56.tar.gz.
  • Perl #131603: Wrong bracket type in perlre.pod.

Patches

Tony Cook provided a patch to resolve Perl #131597: NULL pointer reference in Perl_newRV().

Discussion

Paul (LeoNerd) Evans asked (lib/_charnames.pm puts references in %^H) for a clarification about %^H which is leading to an improvement in the documentation on the topic.

In new versions you may not use a bitwise operation on characters above 0xFF, but as Graham Knop raises (Behavior of bitwise ops on unencountered wide characters), this violates a possible optimized way to determine a string or number value.

Dave Mitchell shows that open with encoding(UTF-8) could mask an OS error, and asks what we would consider correct behavior here.

perlancar's blog: Upgrading to perl 5.26

perlancar's blog

This morning I decided to purge my 5.24 installation on my development PC and move to 5.26. Bad mistake. Many distributions still fail to build without PERL_USE_UNSAFE_INC=1, I might as well put the environment on my ~/.bashrc. These are the ones that hit me (after a few of these, I just built with PERL_USE_UNSAFE_INC=1): Compiler::Lexer (GOCCY), OrePAN (TOKUHIROM), Task::Weaken (ADAMK), Test::SubCalls (ADAMK), Test::Object (ADAMK), Term::Encoding (MIYAGAWA). Apparently, for many of these dists tickets/issues have been posted since 3 months ago but the authors/maintainers have simply not responded. Many authors have perhaps moved on to greener pastures, we need more adopters.


Gabor Szabo: Batteries included - the biggest impact on programming language popularity

In business, and especially in real estate it is said that the most important thing is Location, Location, Location. As the popularity of JavaScript shows this is also true in programming languages though I think the biggest impact on the popularity of a programming language is how easy it is to get things done in it which is directly impacted by the batteries included.

For the full article visit Batteries included - the biggest impact on programming language popularity

brian d foy: 20 Years of Perl mongers

It was 20 years ago (almost) that a group of New Yorkers started Perl mongers. I think at the time we were eating something the caterer for The Perl Conference called "California sandwich" that had sprouts and avocado. In 1997, there was a small core of a Perl community that worked on perl, but there was still room for regular user engagement.

I didn't see this coming. While I was in Amsterdam a couple of weeks ago, Wendy noted that 1997 and 2017 end in the same digit (as does 1967). I don't think we've ever bothered to celebrate an anniversary. I went to check on the date. I registered pm.org on August 22, 1997, which I think makes August 21 the birthday. That was the last day of that conference.

We'll have to do something special guess.

I know that NY.pm will have its anniversary this year, and there my be some other groups. London.pm, when did you start? I think Boston and Seattle was in there very quickly. Who else?

:: Luca Ferrari ::: Perl and regexps are too much verbose (!)

A colleague asked me a quick Perl way to check if a string contains only a predefined set of chars excluding a few ones.
Easy pal!
Regexp to the rescue!

Then I was asked for a way that did no use a regular expression, because "you know, regular expression can be somehow verbose".
Gosh!
My answer was simply "you don't use regexps a lot!".

I don't believe I have to explain to any (Perl) programmer that regexps have been designed to express in short a whole set of choices,
more in general an alphabet (in the most wider concept).

But let's move on, and after showing a quick and dirty one liner such as:


perl -E 'say "Wrong $_" unless ( /^[0-9]+$/ ); ' -F

I got another strange sentence as "yeah, but I don't like Perl too much, it is somehow verbose as a language".
And again, my answer was: "you don't use it regularly!".

It is clear Perl was seen in the wrong way, since both the language as well as regular expressions can be really short, as well as
a one liner…
No other comments required!

perlancar's blog: List of new CPAN distributions – May 2017

perlancar's blog
dist author version date abstract
AI-XGBoost PABLROD 0.001 2017-05-29T19:06:04 Perl wrapper for XGBoost library https://github.com/dmlc/xgboost
Acme-DotDotGone LNATION 0.01 2017-05-21T16:33:02 The great new Acme::DotDotGone!
Acme-Helloworld TOMCHA 0.01 2017-05-27T03:16:27 output 'Hello, world'
Acme-Study-perl NEILB 1.00 2017-05-11T13:35:15 turns baubles into trinkets
Acme-Testing-Permissions BOOK 0.001 2017-05-31T07:49:51 Test target for PAUSE permissions
Acme-Want5000trillion ANATOFUZ 0.01 2017-05-31T14:46:00 I want 5000trillion yen.
Alien-LibBigWig AYATES v0.3.3 2017-05-11T10:52:42 Installation of libBigWig for Perl
Alien-XPA DJERIUS 0.01 2017-05-12T21:14:48 Find or Build libxpa
AnyEvent-Net-MPD JJATRIA 0.001 2017-05-29T11:57:29 A non-blocking interface to MPD
App-Aliyun FAYLAND 0.01 2017-05-11T10:28:03 Aliyun Command Tools
App-CISetup DROLSKY 0.01 2017-05-26T16:48:34 Command line tools to generate and update Travis and AppVeyor configs for Perl libraries
App-DumpChromeHistory PERLANCAR 0.001 2017-05-30T10:33:44 Dump Chrome history
App-DumpOperaHistory PERLANCAR 0.001 2017-05-30T10:30:55 Dump Opera history
App-EvalServerAdvanced SIMCOP 0.001 2017-05-27T02:14:18 A more featured update to App::EvalServer
App-EvalServerAdvanced-Protocol SIMCOP 0.100 2017-05-26T21:14:11 Protocol abstraction for App::EvalServerAdvanced
App-EvalServerAdvanced-REPL SIMCOP 0.001 2017-05-26T23:02:41 Example client for App::EvalServerAdvanced
App-ExtractLinks RORYRJB 0.0.1 2017-05-29T14:01:58 extract href's in HTML docs to stdout
App-Git-Autofixup TORBIAK 0.001 2017-05-25T11:13:16 create fixup commits for topic branches
App-Repo ZDENEK 0.11 2017-05-09T16:40:56 create debian repository
App-TOMLUtils PERLANCAR 0.001 2017-05-09T12:49:14 TOML utilities
App-btcindo PERLANCAR 0.001 2017-05-31T12:39:16 CLI for bitcoin.co.id
App-dt PERLANCAR 0.001 2017-05-18T08:35:39 CLI data transformer
App-nodie ORKUN 1.00 2017-05-29T22:05:00 runs immortal processes
App-unicomb PERLANCAR 0.001 2017-05-04T09:28:52 Produce Unicode character with combining character
Beam-Service PREACTION 0.001 2017-05-05T05:01:51 Role for services to access Beam::Wire features
Bencher-Scenario-SortHashKeys PERLANCAR 0.001 2017-05-23T23:37:34 Benchmark Sort::HashKeys
Bencher-Scenario-TOMLParsingModules PERLANCAR 0.001 2017-05-09T12:49:25 Benchmark TOML parsing modules
Benchmark-Featureset-ParamCheck TOBYINK 0.001 2017-05-15T18:25:13 compare different parameter validation modules
Bio-DB-Big AYATES v1.0.0 2017-05-11T10:54:09 Perl interface to bigWigLib for accessing the kent big formats
BlankOnDev YUSRIDEB 0.1000 2017-05-31T00:53:26 BlankOnDev – Several Development tools for BlankOn GNU/Linux.
CPAN-Testers-Backend PREACTION 0.001 2017-05-13T10:10:11 Backend processes for CPAN Testers data and operations
CPAN-Uploader-Tiny SKAJI v0.0.1 2017-05-13T16:52:47
CPP-panda-lib SYBER v1.0.0 2017-05-11T12:59:24 Collection of useful functions and classes for C++.
Catalyst-Model-Data-MuForm JJNAPIORK 0.001 2017-05-26T00:32:10 Model adaptor for Data::MuForm
Catalyst-Plugin-BootstrapAlert CAGAO 0.20 2017-05-03T22:21:33 Replacement for Catalyst::Plugin::StatusMessage inline with Bootstrap alert names (success, info, warning, and danger).
CloudCron JLMARTIN 0.02 2017-05-18T14:50:56 A simple distributed cloud friendly cron for the masses
CloudCron-Worker JLMARTIN 0.02 2017-05-18T14:51:08 A worker process for CloudCron
CloudDeploy JLMARTIN 1.05 2017-05-18T09:34:52 A toolkit for building and managing AWS CloudFormation stacks
Data-Faker-Colour MGV 0.001 2017-05-31T15:09:54 Generate random colours
Date-Holidays-BY BESINT 0.2017.0 2017-05-30T09:39:01 Determine Belorussian official holidays and business days.
Devel-Jemallctl TVDW 0.01 2017-05-23T15:39:05 Insight into what Jemalloc is doing
Device-PaloAlto-Firewall PUGLET 0.01 2017-05-02T11:36:31 Interact with the Palo Alto firewall API
Dist-Zilla-Plugin-EnsureLatestPerl ETHER 0.001 2017-05-12T08:29:29 Ensure the author is releasing using the latest Perl
Dist-Zilla-Plugin-UseUnsafeInc ETHER 0.001 2017-05-13T12:11:54 Indicates the value of PERL_USE_UNSAFE_INC to use during installation
ElasticEmail ELASTICML 0.01 2017-05-17T11:35:18 The great new ElasticEmail!
File-PCAP MAMAWE v0.0.4 2017-05-04T08:53:16 a pure Perl library to read and write PCAP files
FileDirUtil MTW v0.01 2017-05-15T14:47:03 A Moose Role for basic File IO
Finance-BTCIndo PERLANCAR 0.001 2017-05-31T10:42:46 Trade with bitcoin.co.id (VIP) using Perl
GPSD-Parse STEVEB 0.01 2017-05-17T00:30:32 Parse, extract use the JSON output from GPS units
Geo-Coder-CA NHORNE 0.01 2017-05-09T21:44:45 Get data from http://geocoder.ca
Geo-Coder-XYZ NHORNE 0.01 2017-05-11T22:33:18 Provides a geocoding functionality using http:://geocoder.xyz
Geo-OSM-Imager FIREDRAKE 0.02 2017-05-04T10:56:01 simplifies plotting onto OpenStreetMap tiles
GitHub-Crud PRBRENAN 2017.512 2017-05-13T00:29:04 Create, Read, Update, Delete files on GitHub
GraphQL ETJ 0.01 2017-05-18T14:48:06 The great new GraphQL!
Hash-Normalize VPIT 0.01 2017-05-26T14:58:51 Automatically normalize Unicode hash keys.
IO-ReadPreProcess ADDW 0.8 2017-05-22T18:51:47 Macro processing built into IO::File replacement
Image-Sane RATCLIFFE 0.06 2017-05-11T19:54:47 Perl extension for the SANE (Scanner Access Now Easy) Project
Imgur-API MLHOLLEN v0.0.1 2017-05-14T03:20:06
JSON-Assert SGREEN 0.04 2017-05-26T03:12:20 Tests JPaths into an JSON Data structure for correct values/matches
LWP-CurlLog JACOBG 0.01 2017-05-29T22:14:05 Log LWP requests as curl commands
LWP-Throttle NHORNE 0.01 2017-05-26T16:43:03 Throttle requests to a site
LWP-UserAgent-Throttled NHORNE 0.02 2017-05-26T18:22:28 Throttle requests to a site
Lingua-IN-TGC RAJ 1.01 2017-05-17T13:52:05 Perl extension for tailored grapheme clusters for indian languages
Log-Any-Plugin-Format JJATRIA 0.01 2017-05-06T20:25:14 Add a formatting subroutine to your Log::Any adapter
Log-Any-Plugin-History JJATRIA 0.01 2017-05-06T20:23:41 Add a message history to a Log::Any adapter
MARC-File-XML GMCHARLT v1.0.5 2017-05-24T01:18:18 convert a MARC file to XML
MarpaX-ESLIF-ECMA404 JDDPAUSE 0.001 2017-05-13T07:20:52 JSON Data Interchange Format following ECMA-404 specification
Mojo-CallFire SADAMS 0.01 2017-05-21T02:03:10 A simple interface to the CallFire API
Mojo-CloudCheckr SADAMS 0.01 2017-05-19T04:55:03 A simple interface to the CloudCheckr API
Mojo-IOLoop-DNSNative COFFEE 0.001 2017-05-25T12:57:23 Async native DNS lookup
Mojolicious-Plugin-Mailgun MRAMBERG 0.01 2017-05-15T13:45:56 Easy Email sending with mailgun
MooX-Locale-Passthrough REHSACK 0.001 2017-05-31T14:46:12 provide API used in translator modules without translating
MooX-PDL2 DJERIUS 0.01 2017-05-31T03:19:49 A Moo based PDL 2.X object
MooseX-DataModel JLMARTIN 1.00 2017-05-09T15:31:03 Create object models from datastructures
NOLookup TROHAU 1.01 2017-05-03T13:40:32 a set of lookup modules for various Norwegian data services.
Net-Async-Github TEAM 0.001 2017-05-11T14:48:48 support for https://github.com's REST API with IO::Async
Net-Async-Graphite CHOHAG 0.0_1 2017-05-26T09:32:28 Request data from graphite
Net-Async-OAuth TEAM 0.001 2017-05-11T14:53:03 Basic OAuth support for async code
Net-Async-TravisCI TEAM 0.001 2017-05-11T15:38:57 API support for travis-ci.com and travis-ci.org
Net-Async-Trello TEAM 0.001 2017-05-11T12:49:54 Interaction with the trello.com API
Net-DNS-Extlang JRLEVINE 0.1 2017-05-14T03:16:00 DNS extension language
Net-Etcd HEXFUSION 0.008 2017-05-29T18:38:28 Provide access to the etcd v3 API.
Net-FTP-Path-Iter DJERIUS 0.02 2017-05-09T17:11:29 Iterative, recursive, FTP file finder
Net-FTP-Rule DJERIUS 0.01 2017-05-03T21:38:28 Iterative, recursive, FTP file finder
Net-Frame-Layer-MPLS VINSWORLD 1.00 2017-05-25T02:24:10 Multiprotocol Label Switching layer object
Net-Hacky-Detect-IP DAEMON 0.01 2017-05-07T10:29:44 Hackily try different methods of attaining local system IPs
Pcore-Ext ZDM v0.1.0 2017-05-01T20:03:51 ExtJS raw js generator
Perinci-Examples-ResMeta-Table PERLANCAR 0.001 2017-05-24T02:24:25 Demonstrate the various table and table.* result metadata property/attributes
Perl2Package MLHOLLEN v0.1.0 2017-05-15T01:27:28 Wrapper scripts to generate RPM/DEB packages
Plack-App-Catmandu-Bag NICS 0.01 2017-05-12T15:24:46 Plack application that wraps a REST API around a Catmandu::Bag
Plack-Middleware-Memento NICS 0.01 2017-05-19T10:14:41 Base role and interface for Plack::Middleware::Memento handlers
Plack-Middleware-Memento-Handler-Catmandu-Bag NICS 0.01 2017-05-19T11:36:09 Connect Plack::App::Catmandu::Bag to Plack::Middleware::Memento
Plack-Middleware-Signposting VPEIL 0.01 2017-05-29T18:44:22 A Signposting implementation from JSON content
RPi-Pin STEVEB 2.3601 2017-05-27T21:39:23 Access and manipulate Raspberry Pi GPIO pins
RT-Extension-RightsInspector BPS 0.02 2017-05-11T19:26:54 RT-Extension-RightsInspector Extension
Redis-ClusterRider IPH 0.01 2017-05-19T21:42:08 Daring Redis Cluster client
Ref-Util-XS XSAWYERX 0.114 2017-05-11T16:03:20 Utility functions for checking references
Reply-Plugin-Autocomplete-ExportedSymbols AKIYM 0.01 2017-05-28T11:13:08 Tab completion for exported symbol names
Resque-Plugin-Retry MERU 0.01 2017-05-08T09:43:15 Retry the fail job
Router-XS DFARRELL 0.01 2017-05-24T14:24:53 Fast URI path to value lookup
SNS-Notification JLMARTIN 0.02 2017-05-15T12:38:14 An object for representing SNS Notifications
SQS-Worker JLMARTIN 0.05 2017-05-15T14:43:59 A light framework for processing messages from SQS queues
Swagger-Schema JLMARTIN 1.00 2017-05-09T15:31:15 Object model for Swagger schema files
Sys-Linux-Namespace SIMCOP 0.001 2017-05-04T06:04:28 Sets up linux kernel namespaces
TOML-Examples PERLANCAR 0.001 2017-05-09T12:49:37 Example TOML configuration files
Test-HTTP-LocalServer CORION 0.57 2017-05-08T16:38:00 spawn a local HTTP server for testing
Test2-Plugin-IOSync EXODIST 0.000001 2017-05-02T03:10:14 Load IOEvents and IOMuxer so that they work together.
Text-ReadConditionally ADDW 0.72 2017-05-12T17:48:22 Macro processing built into IO::File replacement
Time-Moment-Epoch HEANEY 0.001 2017-05-20T13:47:59 Convert various epoch times to Time::Moment times.
URI-amqp JASEI 0.1.0 2017-05-23T08:04:18 AMQP (RabbitMQ) URI
VMOMI STUMPR 0.01 2017-05-02T23:21:41 VMware vSphere API Perl Bindings
VMware-vCloudDirector NIGELM 0.004 2017-05-04T14:26:03 Interface to VMWare vCloud Directory REST API
Verilog-VCD-Writer JVS 0.001 2017-05-23T22:35:53 VCD waveform File creation module.
WWW-Leech-Parser JAREDSPB 0.01 2017-05-07T10:46:35 HTML Page parser used by WWW::Leech::Walker
WWW-Leech-Walker JAREDSPB 0.01 2017-05-08T12:22:08 small web content grabbing framework
WWW-Mechanize-Plugin-Selector CORION 0.16 2017-05-12T13:04:28 CSS selector method for WWW::Mechanize
WebService-Naver-TTS AANOAA v0.0.1 2017-05-05T03:22:40 Perl interface to Naver TTS API
XS-Tutorial DFARRELL 0.01 2017-05-03T02:16:43 documentation with examples for learning Perl XS
Yandex-Dictionary BRNFLK 0.001 2017-05-15T14:53:55 a simple API for Yandex.Dictionary
Yandex-Translate BRNFLK 0.01 2017-05-02T17:05:45 Perl extension for using Yandex API
Zimbra-Expect OETIKER v0.1.0 2017-05-16T07:23:56 Execute multiple commands with a single instance of zmprov or zmmailbox and thus run magnitudes faster
chi-driver-elasticache-memcache ZEBARDY 0.01 2017-05-11T20:17:56 This is a CHI Driver for AWS's Elasticache memcache implementation
fastQ_brew HALLORAN 1.0.2 2017-05-09T20:18:31 Provides methods for fastQ file manipulation
fastQ_brew_ver HALLORAN 1.0.2 2017-05-12T17:36:44 Provides methods for fastQ file manipulation
geoip DANNYT 0 2017-05-11T14:44:44
graphcompare SCASTILLO v0.6.1 2017-05-25T12:49:23 A command-line tool to compare graph files in DOT or tabular format.
net-async-graphite CHOHAG 0.0 2017-05-25T19:26:47 Request data from graphite.

You can also browse new distributions on MetaCPAN. See also What’s new on CPAN – May 2017 (a curated list by David Farrell).


Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Perl isn't dieing

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Ocean of Awareness: Top-down parsing is guessing

Top-down parsing is guessing. Literally. Bottom-up parsing is looking.

The way you'll often hear that phrased is that top-down parsing is looking, starting at the top, and bottom-up parsing is looking, starting at the bottom. But that is misleading, because the input is at the bottom -- at the top there is nothing to look at. A usable top-down parser must have a bottom-up component, even if that component is just lookahead.

A more generous, but still accurate, way to describe the top-down component of parsers is "prediction". And prediction is, indeed, a very useful component of a parser, when used in combination with other techniques.

Of course, if a parser does nothing but predict, it can predict only one input. Top-down parsing must always be combined with a bottom-up component. This bottom-up component may be as modest as lookahead, but it must be there or else top-down parsing is really not parsing at all.

So why is top-down parsing used so much?

Top-down parsing may be unusable in its pure form, but from one point of view that is irrelevant. Top-down parsing's biggest advantage is that it is highly flexible -- there's no reason to stick to its "pure" form.

A top-down parser can be written as a series of subroutine calls -- a technique called recursive descent. Recursive descent allows you to hook in custom-written bottom-up logic at every top-down choice point, and it is a technique which is completely understandable to programmers with little or no training in parsing theory. When dealing with recursive descent parsers, it is more useful to be a seasoned, far-thinking programmer than it is to be a mathematician. This makes recursive descent very appealing to seasoned, far-thinking programmers, and they are the audience that counts.

Switching techniques

You can even use the flexibility of top-down to switch away from top-down parsing. For example, you could claim that a top-down parser could do anything my own parser (Marpa) could do, because a recursive descent parser can call a Marpa parser.

A less dramatic switchoff, and one that still leaves the parser with a good claim to be basically top-down, is very common. Arithmetic expressions are essential for a computer language. But they are also among the many things top-down parsing cannot handle, even with ordinary lookahead. Even so, most computer languages these days are parsed top-down -- by recursive descent. These recursive descent parsers deal with expressions by temporarily handing control over to an bottom-up operator precedence parser. Neither of these parsers is extremely smart about the hand-over and hand-back -- it is up to the programmer to make sure the two play together nicely. But used with caution, this approach works.

Top-down parsing and language-oriented programming

But what about taking top-down methods into the future of language-oriented programming, extensible languages, and grammars which write grammars? Here we are forced to confront the reality -- that the effectiveness of top-down parsing comes entirely from the foreign elements that are added to it. Starting from a basis of top-down parsing is literally starting with nothing. As I have shown in more detail elsewhere, top-down techniques simply do not have enough horsepower to deal with grammar-driven programming.

Perl 6 grammars are top-down -- PEG with lots of extensions. These extensions include backtracking, backtracking control, a new style of tie-breaking and lots of opportunity for the programmer to intervene and customize everything. But behind it all is a top-down parse engine.

One aspect of Perl 6 grammars might be seen as breaking out of the top-down trap. That trick of switching over to a bottom-up operator precedence parser for expressions, which I mentioned above, is built into Perl 6 and semi-automated. (I say semi-automated because making sure the two parsers "play nice" with each other is not automated -- that's still up to the programmer.)

As far as I know, this semi-automation of expression handling is new with Perl 6 grammars, and it may prove handy for duplicating what is done in recursive descent parsers. But it adds no new technique to those already in use. And features like

  • mulitple types of expression, which can be told apart based on their context,
  • n-ary expressions for arbitrary n, and
  • the autogeneration of multiple rules, each allowing a different precedence scheme, for expressions of arbitrary arity and associativity,

all of which are available and in current use in Marpa, are impossible for the technology behind Perl 6 grammars.

I am a fan of the Perl 6 effort. Obviously, I have doubts about one specific set of hopes for Perl 6 grammars. But these hopes have not been central to the Perl 6 effort, and I will be an eager student of the Perl 6 team's work over the coming months.

Comments

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Ocean of Awareness: Parsing: an expanded timeline

The fourth century BCE: In India, Pannini creates a sophisticated description of the Sanskrit language, exact and complete, and including pronunciation. Sanskrit could be recreated using nothing but Pannini's grammar. Pannini's grammar is probably the first formal system of any kind, predating Euclid. Even today, nothing like it exists for any other natural language of comparable size or corpus. Pannini is the object of serious study today. But in the 1940's and 1950's Pannini is almost unknown in the West. His work has no direct effect on the other events in this timeline.

1943: Emil Post defines and studies a formal rewriting system using productions. With this, the process of reinventing Pannini in the West begins.

1948: Claude Shannon publishes the foundation paper of information theory. Andrey Markov's finite state processes are used heavily.

1952: Grace Hopper writes a linker-loader and describes it as a "compiler". She seems to be the first person to use this term for a computer program. Hopper uses the term "compiler" in its original sense: "something or someone that brings other things together".

1954: At IBM, a team under John Backus begins working on the language which will be called FORTRAN. The term "compiler" is still being used in Hopper's looser sense, instead of its modern one. In particular, there is no implication that the output of a "compiler" is ready for execution by a computer. The output of one 1954 "compiler", for example, produces relative addresses, which need to be translated by hand before a machine can execute them.

1955: Noam Chomsky is awarded a Ph.D. in linguistics and accepts a teaching post at MIT. MIT does not have a linguistics department and Chomsky, in his linguistics course, is free to teach his own approach, highly original and very mathematical.

1956: Chomsky publishes the paper which is usually considered the foundation of Western formal language theory. The paper advocates a natural language approach that involves

  • a bottom layer, using Markov's finite state processes;
  • a middle, syntactic layer, using context-free grammars and context-sensitive grammars; and
  • a top layer, which involves mappings or "transformations" of the output of the syntactic layer.

These layers resemble, and will inspire, the lexical, syntactic and AST transformation phases of modern parsers. For finite state processes, Chomsky acknowledges Markov. The other layers seem to be Chomsky's own formulations -- Chomsky does not cite Post's work.

1957: Steven Kleene discovers regular expressions, a very handy notation for Markov's processes. Regular expressions turn out to describe exactly the mathematical objects being studied as finite state automata, as well as some of the objects being studied as neural nets.

1957: Noam Chomsky publishes Syntactic Structures, one of the most influential books of all time. The orthodoxy in 1957 is structural linguistics which argues, with Sherlock Holmes, that "it is a capital mistake to theorize in advance of the facts". Structuralists start with the utterances in a language, and build upward.

But Chomsky claims that without a theory there are no facts: there is only noise. The Chomskyan approach is to start with a grammar, and use the corpus of the language to check its accuracy. Chomsky's approach will soon come to dominate linguistics.

1957: Backus's team makes the first FORTRAN compiler available to IBM customers. FORTRAN is the first high-level language that will find widespread implementation. As of this writing, it is the oldest language that survives in practical use. FORTRAN is a line-by-line language and its parsing is primitive.

1958: John McCarthy's LISP appears. LISP goes beyond the line-by-line syntax -- it is recursively structured. But the LISP interpreter does not find the recursive structure: the programmer must explicitly indicate the structure herself, using parentheses.

1959: Backus invents a new notation to describe the IAL language (aka ALGOL). Backus's notation is influenced by his study of Post -- he seems not to have read Chomsky until later.

1960: Peter Naur improves the Backus notation and uses it to describe ALGOL 60. The improved notation will become known as Backus-Naur Form (BNF).

1960: The ALGOL 60 report specifies, for the first time, a block structured language. ALGOL 60 is recursively structured but the structure is implicit -- newlines are not semantically significant, and parentheses indicate syntax only in a few specific cases. The ALGOL compiler will have to find the structure. It is a case of 1960's optimism at its best. As the ALGOL committee is well aware, a parsing algorithm capable of handling ALGOL 60 does not yet exist. But the risk they are taking will soon pay off.

1960: A.E. Gleenie publishes his description of a compiler-compiler. Glennie's "universal compiler" is more of a methodology than an implementation -- the compilers must be written by hand. Glennie credits both Chomsky and Backus, and observes that the two notations are "related". He also mentions Post's productions. Glennie may have been the first to use BNF as a description of a procedure instead of as the description of a Chomsky grammar. Glennie points out that the distinction is "important".

Chomskyan BNF and procedural BNF: BNF, when used as a Chomsky grammar, describes a set of strings, and does not describe how to parse strings according to the grammar. BNF notation, if used to describe a procedure, is a set of instructions, to be tried in some order, and used to process a string. Procedural BNF describes a procedure first, and a language only indirectly.

Both procedural and Chomskyan BNF describe languages, but usually not the same language. That is,

  • Suppose D is some BNF description.
  • Let P(D) be D interpreted as a procedure,
  • Let L(P(D)) be the language which the procedure P(D) parses.
  • Let G(D) be D interpreted as a Chomsky grammar.
  • Let L(G(D)) be the language which the grammar G(D) describes.
  • Then, usually, L(P(D)) != L(G(D)).

The pre-Chomskyan approach, using procedural BNF, is far more natural to someone trained as a computer programmer. The parsing problem appears to the programmer in the form of strings to be parsed, exactly the starting point of procedural BNF and pre-Chomsky parsing.

Even when the Chomskyan approach is pointed out, it does not at first seem very attractive. With the pre-Chomskyan approach, the examples of the language more or less naturally lead to a parser. In the Chomskyan approach the programmer has to search for an algorithm to parse strings according to his grammar -- and the search for good algorithms to parse Chomskyan grammars has proved surprisingly long and difficult. Handling semantics is more natural with a Chomksyan approach. But, using captures, semantics can be added to a pre-Chomskyan parser and, with practice, this seems natural enough.

Despite the naturalness of the pre-Chomskyan approach to parsing, we will find that the first fully-described automated parsers are Chomskyan. This is a testimony to Chomsky's influence at the time. We will also see that Chomskyan parsers have been dominant ever since.

1961: In January, Ned Irons publishes a paper describing his ALGOL 60 parser. It is the first paper to fully describe any parser. The Irons algorithm is Chomskyan and top-down with a "left corner" element. The Irons algorithm is general, meaning that it can parse anything written in BNF. It is syntax-driven (aka declarative), meaning that the parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Peter Lucas publishes the first description of a purely top-down parser. This can be considered to be recursive descent, though in Lucas's paper the algorithm has a syntax-driven implementation, useable only for a restricted class of grammars. Today we think of recursive descent as a methodology for writing parsers by hand. Hand-coded approaches became more popular in the 1960's due to three factors:

  • Memory and CPU were both extremely limited. Hand-coding paid off, even when the gains were small.
  • Non-hand coded top-down parsing, of the kind Lucas's syntax-driven approach allowed, is a very weak parsing technique. It was (and still is) often necessary to go beyond its limits.
  • Top-down parsing is intuitive -- it essentially means calling subroutines. It therefore requires little or no knowledge of parsing theory. This makes it a good fit for hand-coding.

1963: L. Schmidt, Howard Metcalf, and Val Schorre present papers on syntax-directed compilers at a Denver conference.

1964: Schorre publishes a paper on the Meta II "compiler writing language", summarizing the papers of the 1963 conference. Schorre cites both Backus and Chomsky as sources for Meta II's notation. Schorre notes that his parser is "entirely different" from that of Irons 1961 -- in fact it is pre-Chomskyan. Meta II is a template, rather than something that readers can use, but in principle it can be turned into a fully automated compiler-compiler.

1965: Don Knuth invents LR parsing. The LR algorithm is deterministic, Chomskyan and bottom-up, but it is not thought to be practical. Knuth is primarily interested in the mathematics.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is Chomskyan, syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's algorithm is both top-down and bottom-up at once -- it uses dynamic programming and keeps track of the parse in tables. Earley's approach makes a lot of sense and looks very promising indeed, but there are three serious issues:

  • First, there is a bug in the handling of zero-length rules.
  • Second, it is quadratic for right recursions.
  • Third, the bookkeeping required to set up the tables is, by the standards of 1968 hardware, daunting.

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size. LALR looks practical.

1969: Ken Thompson writes the "ed" editor as one of the first components of UNIX. At this point, regular expressions are an esoteric mathematical formalism. Through the "ed" editor and its descendants, regular expressions will become an everyday part of the working programmer's toolkit.

Recognizers: In comparing algorithms, it can be important to keep in mind whether they are recognizers or parsers. A recognizer is a program which takes a string and produces a "yes" or "no" according to whether a string is in part of a language. Regular expressions are typically used as recognizers. A parser is a program which takes a string and produces a tree reflecting its structure according to a grammar. The algorithm for a compiler clearly must be a parser, not a recognizer. Recognizers can be, to some extent, used as parsers by introducing captures.

1972: Alfred Aho and Jeffrey Ullman publish a two volume textbook summarizing the theory of parsing. This book is still important. It is also distressingly up-to-date -- progress in parsing theory slowed dramatically after 1972. Aho and Ullman describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1972: Under the names TDPL and GTDPL, Aho and Ullman investigate the non-Chomksyan parsers in the Schorre lineage. They note that "it can be quite difficult to determine what language is defined by a TDPL parser". That is, GTDPL parsers do whatever they do, and that whatever is something the programmer in general will not be able to describe. The best a programmer can usually do is to create a test suite and fiddle with the GTDPL description until it passes. Correctness cannot be established in any stronger sense. GTDPL is an extreme form of the old joke that "the code is the documentation" -- with GTDPL nothing documents the language of the parser, not even the code.

GTDPL's obscurity buys nothing in the way of additional parsing power. Like all non-Chomskyan parsers, GTDPL is basically a extremely powerful recognizer. Pressed into service as a parser, it is comparatively weak. As a parser, GTDPL is essentially equivalent to Lucas's 1961 syntax-driven algorithm, which was in turn a restricted form of recursive descent.

At or around this time, rumor has it that the main line of development for GTDPL parsers is classified secret by the US government. GTDPL parsers have the property that even small changes in GTDPL parsers can be very labor-intensive. For some government contractors, GTDPL parsing provides steady work for years to come. Public interest in GTDPL fades.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed.

1979: Part of the V7 toolkit is Yet Another Compiler Compiler (YACC). YACC is LALR-powered. Despite its name, YACC is the first compiler-compiler in the modern sense. For some useful languages, the process of going from Chomskyan specification to executable is fully automated. Most practical languages, including the C language and YACC's own input language, still require manual hackery. Nonetheless, after two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses YACC and LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of the Earley algorithm.

But Earley parsing is almost forgotten. Twenty years will pass before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: John Aycock and R. Nigel Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2004: Bryan Ford publishes his paper on PEG. Implementers by now are avoiding YACC, and it seems as if there might soon be no syntax-driven algorithms in practical use. Ford fills this gap by repackaging the nearly-forgotten GTDPL. Ford adds packratting, so that PEG is always linear, and provides PEG with an attractive new syntax. But nothing has been done to change the problematic behaviors of GTDPL.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

Today: After five decades of parsing theory, the state of the art seems to be back where it started. We can imagine someone taking Ned Iron's original 1961 algorithm from the first paper ever published describing a parser, and republishing it today. True, he would have to translate its code from the mix of assembler and ALGOL into something more fashionable, say Haskell. But with that change, it might look like a breath of fresh air.

Marpa: an afterword

The recollections of my teachers cover most of this timeline. My own begin around 1970. Very early on, as a graduate student, I became unhappy with the way the field was developing. Earley's algorithm looked interesting, and it was something I returned to on and off.

The original vision of the 1960's was a parser that was

  • efficient,
  • practical,
  • general, and
  • syntax-driven.

By 2010 this vision seemed to have gone the same way as many other 1960's dreams. The rhetoric stayed upbeat, but parsing practice had become a series of increasingly desperate compromises.

But, while nobody was looking for them, the solutions to the problems encountered in the 1960's had appeared in the literature. Aycock and Horspool had solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and are probably no longer relevant in any case -- cache misses are now the bottleneck.

The programmers of the 1960's would have been prepared to trust a fully declarative Chomskyan parser. With the experience with LALR in their collective consciousness, modern programmers might be more guarded. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove." But I found it straightforward to rearrange the Earley parse engine to allow efficient event-driven handovers between procedural and syntax-driven logic. And Earley tables provide the procedural logic with full knowledge of the state of the parse so far, so that Earley's algorithm is a better platform for hand-written procedural logic than recursive descent.

References, comments, etc.

My implementation of Earley's algorithm is called Marpa. For more about Marpa, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Ocean of Awareness: What are the reasonable computer languages?

"You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?'" -- George Bernard Shaw

In the 1960's and 1970's computer languages were evolving rapidly. It was not clear which way they were headed. Would most programming be done with general-purpose languages? Or would programmers create a language for every task domain? Or even for every project? And, if lots of languages were going to be created, what kinds of languages would be needed?

It was in that context that Čulik and Cohen, in a 1973 paper, outlined what they thought programmers would want and should have. In keeping with the spirit of the time, it was quite a lot:

  • Programmers would want to extend their grammars with new syntax, including new kinds of expressions.
  • Programmers would also want to use tools that automatically generated new syntax.
  • Programmers would not want to, and especially in the case of auto-generated syntax would usually not be able to, massage the syntax into very restricted forms. Instead, programmers would create grammars and languages which required unlimited lookahead to disambiguate, and they would require parsers which could handle these grammars.
  • Finally, programmers would need to be able to rely on all of this parsing being done in linear time.

Today, we think we know that Čulik and Cohen's vision was naive, because we think we know that parsing technology cannot support it. We think we know that parsing is much harder than they thought.

The eyeball grammars

As a thought problem, consider the "eyeball" class of grammars. The "eyeball" class of grammars contains all the grammars that a human can parse at a glance. If a grammar is in the eyeball class, but a computer cannot parse it, it presents an interesting choice. Either,

  • your computer is not using the strongest practical algorithm; or
  • your mind is using some power which cannot be reduced to a machine computation.

There are some people out there (I am one of them) who don't believe that everything the mind can do reduces to a machine computation. But even those people will tend to go for the choice in this case: There must be some practical computer parsing algorithm which can do at least as well at parsing as a human can do by "eyeball". In other words, the class of "reasonable grammars" should contain the eyeball class.

Čulik and Cohen's candidate for the class of "reasonable grammars" were the grammars that a deterministic parse engine could parse if it had a lookahead that was infinite, but restricted to distinguishing between regular expressions. They called these the LR-regular, or LRR, grammars. And the LRR grammars do in fact seem to be a good first approximation to the eyeball class. They do not allow lookahead that contains things that you have to count, like palindromes. And, while I'd be hard put to eyeball every possible string for every possible regular expression, intuitively the concept of scanning for a regular expression does seem close to capturing the idea of glancing through a text looking for a telltale pattern.

So what happened?

Alas, the algorithm in the Čulik and Cohen paper turned out to be impractical. But in 1991, Joop Leo discovered a way to adopt Earley's algorithm to parse the LRR grammars in linear time, without doing the lookahead. And Leo's algorithm does have a practical implementation: Marpa.

References, comments, etc.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: cgit syntax highlighting

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Ocean of Awareness: What parser do birds use?

"Here we provide, to our knowledge, the first unambiguous experimental evidence for compositional syntax in a non-human vocal system." -- "Experimental evidence for compositional syntax in bird calls", Toshitaka N. Suzuki, David Wheatcroft & Michael Griesser Nature Communications 7, Article number: 10986

In this post I look at a subset of the language of the Japanese great tit, also known as Parus major. The above cited article presents evidence that bird brains can parse this language. What about standard modern computer parsing methods? Here is the subset -- probably a tiny one -- of the language actually used by Parus major.

      S ::= ABC
      S ::= D
      S ::= ABC D
      S ::= D ABC
    

Classifying the Parus major grammar

Grammophone is a very handy new tool for classifying grammars. Its own parser is somewhat limited, so that it requires a period to mark the end of a rule. The above grammar is in Marpa's SLIF format, which is smart enough to use the "::=" operator to spot the beginning and end of rules, just as the human eye does. Here's the same grammar converted into a form acceptable to Grammophone:

      S -> ABC .
      S -> D .
      S -> ABC D .
      S -> D ABC .
    

Grammophone tells us that the Parus major grammar is not LL(1), but that it is LALR(1).

What does this mean?

LL(1) is the class of grammar parseable by top-down methods: it's the best class for characterizing most parsers in current use, including recursive descent, PEG, and Perl 6 grammars. All of these parsers fall short of dealing with the Parus major language.

LALR(1) is probably most well-known from its implementations in bison and yacc. While able to handle this subset of Parus's language, LALR(1) has its own, very strict, limits. Whether LALR(1) could handle the full complexity of Parus language is a serious question. But it's a question that in practice would probably not arise. LALR(1) has horrible error handling properties.

When the input is correct and within its limits, an LALR-driven parser is fast and works well. But if the input is not perfectly correct, LALR parsers produce no useful analysis of what went wrong. If Parus hears "d abc d", a parser like Marpa, on the other hand, can produce something like this:

# * String before error: abc d\s
# * The error was at line 1, column 7, and at character 0x0064 'd', ...
# * here: d
    

Parus uses its language in predatory contexts, and one can assume that a Parus with a preference for parsers whose error handling is on an LALR(1) level will not be keeping its alleles in the gene pool for very long.

References, comments, etc.

Those readers content with sub-Parus parsing methods may stop reading here. Those with greater parsing ambitions, however, may wish to learn more about Marpa. A Marpa test script for parsing the Parus subset is in a Github gist. Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Ocean of Awareness: Introduction to Marpa Book in progress

What follows is a summary of the features of the Marpa algorithm, followed by a discussion of potential applications. It refers to itself as a "monograph", because it is a draft of part of the introduction to a technical monograph on the Marpa algorithm. I hope the entire monograph will appear in a few weeks.

The Marpa project

The Marpa project was intended to create a practical and highly available tool to generate and use general context-free parsers. Tools of this kind had long existed for LALR and regular expressions. But, despite an encouraging academic literature, no such tool had existed for context-free parsing. The first stable version of Marpa was uploaded to a public archive on Solstice Day 2011. This monograph describes the algorithm used in the most recent version of Marpa, Marpa::R2. It is a simplification of the algorithm presented in my earlier paper.

A proven algorithm

While the presentation in this monograph is theoretical, the approach is practical. The Marpa::R2 implementation has been widely available for some time, and has seen considerable use, including in production environments. Many of the ideas in the parsing literature satisfy theoretical criteria, but in practice turn out to face significant obstacles. An algorithm may be as fast as reported, but may turn out not to allow adequate error reporting. Or a modification may speed up the recognizer, but require additional processing at evaluation time, leaving no advantage to compensate for the additional complexity.

In this monograph, I describe the Marpa algorithm as it was implemented for Marpa::R2. In many cases, I believe there are better approaches than those I have described. But I treat these techniques, however solid their theory, as conjectures. Whenever I mention a technique that was not actually implemented in Marpa::R2, I will always explicitly state that that technique is not in Marpa as implemented.

Features

General context-free parsing

As implemented, Marpa parses all "proper" context-free grammars. The proper context-free grammars are those which are free of cycles, unproductive symbols, and inaccessible symbols. Worst case time bounds are never worse than those of Earley's algorithm, and therefore never worse than O(n**3).

Linear time for practical grammars

Currently, the grammars suitable for practical use are thought to be a subset of the deterministic context-free grammars. Using a technique discovered by Joop Leo, Marpa parses all of these in linear time. Leo's modification of Earley's algorithm is O(n) for LR-regular grammars. Leo's modification also parses many ambiguous grammars in linear time.

Left-eidetic

The original Earley algorithm kept full information about the parse --- including partial and fully recognized rule instances --- in its tables. At every parse location, before any symbols are scanned, Marpa's parse engine makes available its information about the state of the parse so far. This information is in useful form, and can be accessed efficiently.

Recoverable from read errors

When Marpa reads a token which it cannot accept, the error is fully recoverable. An application can try to read another token. The application can do this repeatedly as long as none of the tokens are accepted. Once the application provides a token that is accepted by the parser, parsing will continue as if the unsuccessful read attempts had never been made.

Ambiguous tokens

Marpa allows ambiguous tokens. These are often useful in natural language processing where, for example, the same word might be a verb or a noun. Use of ambiguous tokens can be combined with recovery from rejected tokens so that, for example, an application could react to the rejection of a token by reading two others.

Using the features

Error reporting

An obvious application of left-eideticism is error reporting. Marpa's abilities in this respect are ground-breaking. For example, users typically regard an ambiguity as an error in the grammar. Marpa, as currently implemented, can detect an ambiguity and report specifically where it occurred and what the alternatives were.

Event driven parsing

As implemented, Marpa::R2 allows the user to define "events". Events can be defined that trigger when a specified rule is complete, when a specified rule is predicted, when a specified symbol is nulled, when a user-specified lexeme has been scanned, or when a user-specified lexeme is about to be scanned. A mid-rule event can be defined by adding a nulling symbol at the desired point in the rule, and defining an event which triggers when the symbol is nulled.

Ruby slippers parsing

Left-eideticism, efficient error recovery, and the event mechanism can be combined to allow the application to change the input in response to feedback from the parser. In traditional parser practice, error detection is an act of desperation. In contrast, Marpa's error detection is so painless that it can be used as the foundation of new parsing techniques.

For example, if a token is rejected, the lexer is free to create a new token in the light of the parser's expectations. This approach can be seen as making the parser's "wishes" come true, and I have called it "Ruby Slippers Parsing".

One use of the Ruby Slippers technique is to parse with a clean but oversimplified grammar, programming the lexical analyzer to make up for the grammar's short-comings on the fly. As part of Marpa::R2, the author has implemented an HTML parser, based on a grammar that assumes that all start and end tags are present. Such an HTML grammar is too simple even to describe perfectly standard-conformant HTML, but the lexical analyzer is programmed to supply start and end tags as requested by the parser. The result is a simple and cleanly designed parser that parses very liberal HTML and accepts all input files, in the worst case treating them as highly defective HTML.

Ambiguity as a language design technique

In current practice, ambiguity is avoided in language design. This is very different from the practice in the languages humans choose when communicating with each other. Human languages exploit ambiguity in order to design highly flexible, powerfully expressive languages. For example, the language of this monograph, English, is notoriously ambiguous.

Ambiguity of course can present a problem. A sentence in an ambiguous language may have undesired meanings. But note that this is not a reason to ban potential ambiguity --- it is only a problem with actual ambiguity.

Syntax errors, for example, are undesired, but nobody tries to design languages to make syntax errors impossible. A language in which every input was well-formed and meaningful would be cumbersome and even dangerous: all typos in such a language would be meaningful, and parser would never warn the user about errors, because there would be no such thing.

With Marpa, ambiguity can be dealt with in the same way that syntax errors are dealt with in current practice. The language can be designed to be ambiguous, but any actual ambiguity can be detected and reported at parse time. This exploits Marpa's ability to report exactly where and what the ambiguity is. Marpa::R2's own parser description language, the SLIF, uses ambiguity in this way.

Auto-generated languages

In 1973, Čulik and Cohen pointed out that the ability to efficiently parse LR-regular languages opens the way to auto-generated languages. In particular, Čulik and Cohen note that a parser which can parse any LR-regular language will be able to parse a language generated using syntax macros.

Second order languages

In the literature, the term "second order language" is usually used to describe languages with features which are useful for second-order programming. True second-order languages --- languages which are auto-generated from other languages --- have not been seen as practical, since there was no guarantee that the auto-generated language could be efficiently parsed.

With Marpa, this barrier is raised. As an example, Marpa::R2's own parser description language, the SLIF, allows "precedenced rules". Precedenced rules are specified in an extended BNF. The BNF extensions allow precedence and associativity to be specified for each RHS.

Marpa::R2's precedenced rules are implemented as a true second order language. The SLIF representation of the precedenced rule is parsed to create a BNF grammar which is equivalent, and which has the desired precedence. Essentially, the SLIF does a standard textbook transformation. The transformation starts with a set of rules, each of which has a precedence and an associativity specified. The result of the transformation is a set of rules in pure BNF. The SLIF's advantage is that it is powered by Marpa, and therefore the SLIF can be certain that the grammar that it auto-generates will parse in linear time.

Notationally, Marpa's precedenced rules are an improvement over similar features in LALR-based parser generators like yacc or bison. In the SLIF, there are two important differences. First, in the SLIF's precedenced rules, precedence is generalized, so that it does not depend on the operators: there is no need to identify operators, much less class them as binary, unary, etc. This more powerful and flexible precedence notation allows the definition of multiple ternary operators, and multiple operators with arity above three.

Second, and more important, a SLIF user is guaranteed to get exactly the language that the precedenced rule specifies. The user of the yacc equivalent must hope their syntax falls within the limits of LALR.

References, comments, etc.

Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Header image by Tambako the Jaguar. Some rights reserved.