Ricardo Signes: The Slack IRC gateway drops your messages.

So, imagine the following exchange in private message with one of your team members:

  <alice> So, how have I been doing?
  <bob> Frankly, I don't think it's working out.
  * bob is joking!  You're doing great.

Maybe Bob shouldn't be such a joker here, but sometimes Bob can't help himself. Unfortunately, Bob has just caused Alice an incredible amount of stress. Or, more to the point, Slack has gotten in the way of team communication, leading to a terrible misunderstanding.

The problem is that when you use /me on the IRC gateway while in private chat with someone, Slack just drops the message on the floor and doesn't deliver it! Alice never got that final message.

Read that again, then ask yourself how often you may have miscommunicated with your team because of this. Remember that it goes both ways: maybe you never used /me in privmsg, but did your coworkers? Who knows! I reported this bug to Slack feedback in August 2016 and again in September. When I reported it the second time, I went back in my IRC logs to compare them to Slack logs. If I found a time when an emote showed up in both, I'd know when the problem started.

The answer is that it started around January or February 2016. In other words, this silent data loss bug has been in place for a year and known about for, at the very least, five months. It hasn't been fixed. More than a few times in this period, I've realized that I missed an invitation to a meeting or other important communication because it was /me-ed at me in privmsg. This is garbage.

I can't fix Slack from my desk, but I can fix my IRC client to prevent me from making this error. I wrote this irssi plugin:

use warnings;
use strict;

use Irssi ();

our $VERSION = '0.001';
our %IRSSI = (
  authors => 'rjbs',
  name    => 'slack-privmsg-me',
);

Irssi::signal_add('message irc own_action'  => sub {
  my ($server, $message, $target) = @_;

  # only stop /me on Slack
  return unless $server->{address} =~ /\.slack\.com\z/i;

  return if $target =~ /^#/; # allow /me in channels

  Irssi::print("Sorry, Slack drops /me silently in private chat.");
  Irssi::signal_stop();
});

This intercepts every "about to send an action" event and, if it's to a Slack chatnet and in a private message, reports an error to the user and aborts. I could've made it turn the message into a normal text message, but I thought I'd keep it inconvenient to keep me angry.

Please use this plugin. If you port it to WeeChat, I'll add a link here.

When people tell you, "Of course your free software project should use Slack. There's even an IRC gateway!" remember that Slack doesn't seem to give a darn about the IRC gateway losing messages. It's saying that gateway users are second-class users.

Garbage.

Update:

Sawyer X: Perl 5 Porters Mailing List Summary: January 16th-22nd

Hey everyone,

Following is the p5p (Perl 5 Porters) mailing list summary for the past week.

Enjoy!

January 16th-22nd

News and updates

Perl 5.25.9 is now available!

Abigail has implemented the discussed deprecations published with a few changes.

New API function ML:available:#242422 op_class.

CPAN.pm was updated to version 2.16.

Grant reports

Issues

New issues

Jan Dubois reports (Building Perl on Cygwin: /proc/curproc/exe) 5.24.0 and 5.24.1 hang in the Configure phase on Cygwin.

Resolved issues

Suggested patches

Christian Millour provided (Re: [PATCH] Re: no '"my $b" used in sort comparison' warning when $bis dereferenced ?) a patch to the documentation regarding the nature of $a and $b in sort().

Neil Bowers suggested updating the NAME documentation of Getopt::Std.

Andy Lester provided several patches:

  • Perl #130591: [PATCH] Make arguments to S_invlist_is_iterating and S_invlist_max be const.
  • Perl #130592: [PATCH] Add prototypes for 6 mathoms to satisfy -Wmissing-prototypes.
  • Perl #130596: Perl_utf8_to_uvchr_buf has no prototype in any .h file.

Discussion

Dave Mitchell asked several questions about the slowness of two tests: ext/XS-APItest/t/handy.t and ext/XS-APItest/t/utf8.t.

Todd Rinaldo sent an email to cpan-workers regarding the toolchain and removal of . from @INC. Discussion continued.

Dave Mitchell has made the desired changes to the output format of op_dump() discussed previously.

Karl Williamson notes that perl still attempts to compile regular expressions, even if there are parsing errors, and wonders if there's a valid reason for it.

Perl Foundation News: Grant Report : RPerl User Documentation #2 - Dec 2016

Will Braswell reports that he has completed the deliverables for RPerl Docs #2:

"Lots of big news for RPerl! First, Christmas saw the release of the new Perl-powered platform CloudForFree.org ;v1.0, codename Nimbostratus. Secondly, on New Years Day we released RPerl v2.4, codename Aurora. And last but not least, we are proud to announce the publication of Learning RPerl chapter 4, thereby completing part 2 of the TPF grant! Over 160 pages of brand new material has been written under this grant for chapters 2, 3, and 4 of Learning RPerl.

CHAPTER 4: ORGANIZING BY SUBROUTINES

What's next for RPerl? ;What other exciting new advancements are coming down the pike? Stay tuned to RPerl.org and find out!"

MAJ

Gabor Szabo: Open Source Evenings in Modiin

This is the description of the Open Source meetings in Modiin, Israel. For dates and to RSVP visit our Meetup page

For the full article visit Open Source Evenings in Modiin

Perlgeek.de : Perl 6 By Example: Perl 6 Review

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


In the previous "Perl 6 by Example" blog posts we've discussed some examples interleaved with the Perl 6 mechanics that make them work. Here I want to summarize and deepen the Perl 6 knowledge that we've touched on so far, removed from the original examples.

Variables and Scoping

In Perl 6, variable names are made of a sigil, $, @, % or &, followed by an identifier. The sigil implies a type constraint, where $ is the most general one (no restriction by default), @ is for arrays, % for hashes (associative arrays/maps), and & for code objects.

Identifiers can contain - and ' characters, as long as the character after it is a letter. Identifiers must start with a letter or underscore.

Subroutines and variables declared with my are lexically scoped. They are visible from the point of the declaration to the end of the current {}-enclosed block (or the current file, in case the declaration is outside a block). Subroutine parameters are visible in the signature and block of the subroutine.

An optional twigil between the sigil and identifier can influence the scoping. The * twigil marks a dynamically scoped variable, thus lookup is performed in the current call stack. ! marks attributes, that is, a per-instance variable that's attached to an object.

Subroutines

A subroutine, or short sub, is a piece of code with its own scope and usually also a name. It has a signature which specifies what kind of values you have to pass in when you call it:

sub chunks(Str $s, Int $chars) {
#         ^^^^^^^^^^^^^^^^^^^^ signature
#   ^^^^^^ name
    gather for 0 .. $s.chars / $chars - 1 -> $idx {
        take substr($s, $idx * $chars, $chars);
    }
}

The variables used in the signature are called parameters, whereas we call the values that you pass in arguments.

To refer to a subroutine without calling it, put an ampersand character in front of it, for example

say &chunks.^name;      # Sub

to call it, simply use its name, followed by the list of arguments, which can optionally be in parentheses:

say chunks 'abcd', 2;   # (ab cd)
say chunks('abcd', 2);  # (ab cd)

You only need the parentheses if some other construct would otherwise interfere with the subroutine call. For example if you intend to write

say chunks(join('x', 'ab', 'c'), 2);

and you leave out the inner pair of parentheses:

say chunks(join 'x', 'ab', 'c', 2);

then all the arguments go to the join function, leaving only one argument to the chunks function. On the other hand it is fine to leave out the outer pair of parentheses and write

say chunks join('x', 'ab', 'c'), 2;

because there's no ambiguity here.

One case worth noting is that if you call a subroutine without arguments as the block of an if condition or a for loop (or similar constructs), you have to include the parentheses, because otherwise the block is parsed as an argument to the function.

sub random-choice() {
    Bool.pick;
}

# right way:
if random-choice() {
    say 'You were lucky.';
}

# wrong way:
if random-choice {
    say 'You were lucky.';
}

If you do happen to make this mistake, the Perl 6 compiler tries very hard to detect it. In the example above, it says

Function 'random-choice' needs parens to avoid gobbling block

and when it tries to parse the block for the if-statement, it doesn't find one:

Missing block (apparently claimed by 'random-choice')

When you have a sub called MAIN, Perl 6 uses its signature to parse the command line arguments and pass those command line arguments to MAIN.

multi subs are several subroutines with the same name but different signatures. The compiler decides at run time which of the candidates it calls based on the best match between arguments and parameters.

Classes and Objects

Class declarations follow the same syntactic schema as subroutine declarations: the keyword class, followed by the name, followed by the body in curly braces:

class OutputCapture {
    has @!lines;
    method print(\s) {
        @!lines.push(s);
    }
    method captured() {
        @!lines.join;
    }
}

By default, type names are scoped to the current namespace, however you can make it lexically scoped by adding a my in front of class:

my class OutputCapture { ... }

Creating a new instance generally works by calling the new method on the type object. The new method is inherited from the implicit parent class Any that all types get:

my $c = OutputCapture.new;

Per-instance state is stored in attributes, which are declared with the has keyword, as seen above in has @!lines. Attributes are always private, as indicated by the ! twigil. If you use the dot . twigil in the declaration instead, you have both the private attribute @!lines and a public, read-only accessor method:

my class OutputCapture {
    has @.lines;
    method print(\s) {
        # the private name with ! still works
        @!lines.push(s);
    }
    method captured() {
        @!lines.join;
    }
}
my $c = OutputCapture.new;
$c.print('42');
# use the `lines` accessor method:
say $c.lines;       # [42]

When you declare attributes with the dot twigil, you can also initialize the attributes from the constructor through named arguments, as in OutputCapture.new( lines => [42] ).

Private methods start with a ! and can only be called from inside the class body as self!private-method.

Methods are basically just subroutines, with two differences. The first is that they get an implicit parameter called self, which contains the object the method is called on (which we call the invocant). The second is that if you call a subroutine, the compiler searches for this subroutine in the current lexical scope, and outer scopes. On the other hand, the methods for a method calls are looked up in the class of the object and its superclasses.

Concurrency

Perl 6 provides high-level primitives for concurrency and parallel execution. Instead of explicitly spawning new threads, you are encouraged to run a computation with start, which returns a Promise. This is an object that promises that in the future the computation will yield a result. The status can thus be Planned, Kept or Broken. You can chain promises, combine them, and wait for them.

In the background, a scheduler distributes such computations to operating system level threads. The default scheduler is a thread pool scheduler with an upper limit to the number of threads to use.

Communication between parallel computations should happen through thread-safe data structures. Foremost among them are the Channel, a thread-safe queue, and Supply, Perl 6's implementation of the Observer Pattern. Supplies are very powerful, because you can transform them with methods such as map, grep, throttle or delayed, and use their actor semantic to ensure that a consumer is run in only one thread at a time.

Subscribe to the Perl 6 book mailing list

* indicates required

Ovid: I hope to see you at FOSDEM in Brussels

I've been rather quiet lately because between building Tau Station and working with some clients, I'm running around faster than a long-tailed cat in a room full of rocking chairs.

I'll be in Brussels for the February 4/5 2017, FOSDEM, talking about Building the Tau Station Universe in Perl. I was planning on giving a talk about testing, but I was specifically asked if I'd talk about Tau Station. While I love the project, I tried to think of a way it wouldn't sound like a 40 minute infomercial at an open source conference.

Woman relaxing in a high-tech hotel room in a space station.
Relaxing in a high-tech hotel room in a space station.

I'm pretty sure I've succeeded and I have to say, I think people are going to be really impressed with some of the code examples. In fact, it's gotten me to the point where I'm having more serious doubts about how object-oriented programming is currently taught. For example, what does the single responsibility principle mean for a combat suit that can pass the Turing test? It serves as armor, and might serve as a weapon, and might serve as a medkit, and even give the soldier a pep-talk.

Single-responsibility my ass.

The Tau Station talk will show a very simple way out of that conundrum.

We have an awesome team working on Tau Station. We've used the same hiring strategy that we use hiring for our clients at All Around the World and it's paid off in spades. You'll be impressed with their work. Fortunately, since this is FOSDEM, even if you can't make it to Brussels, the video will later be online for free and I'll post a link for you.

Hope to see you there!

Perl Foundation News: Grant Proposal: Standardization, Test Coverage, and Documentation of Perl 6 I/O Routines

The Grants Committee has received one grant proposal Standardization, Test Coverage, and Documentation of Perl 6 I/O Routines for the January round. Before the Committee members vote, we would like to solicit feedback from the Perl community on the proposal.

Review the proposal below and please comment here by January 23rd, 2017. The Committee members will start the voting process following that and the conclusion will be announced approximately in one week.

Standardization, Test Coverage, and Documentation of Perl 6 I/O Routines

  • Name:

    Zoffix Znet (legal name: Pete Evstratov)

  • Amount Requested:

    USD 999

Synopsis

Many of Perl 6's I/O routines currently have inconsistencies in behaviour between method and subroutine forms along with inconsistencies in calling forms and failure modes compared to the rest of the language. Also some of them, despite being speculated in the synopses and implemented in Rakudo, have no test coverage in the Perl 6 Specification, and as such they remain undocumented and not part of the Perl 6 Language. Some of the routines that do have presence in the Specification, have sparse test coverage, leaving some of their functionality entirely untested.

The work funded by this grant seeks to resolve those inconsistencies, as well as provide full test coverage and documentation.

Benefits to the Perl Community

Inconsistent interfaces are difficult to master and are bug-prone due to programmers forgetting about varying details of individual routines. By making I/O routines take arguments and indicate failures consistent to the way it's done in the rest of the language, they will be much easier for programmers to learn and use. Along with a more pleasant interface, the Perl Community will also benefit by needing to provide less support for people trying to learn the language.

The largest benefit will come from the full test coverage that will no doubt reveal bugs that can be fixed before they're encountered by users in production code. The coverage will also protect from future bugs being introduced inadvertently. Also, the currently unspecced routines will be officially part of the language, and so will be documented and available for use.

As bonus benefits:

The timing of this work seeks to align with the production of "Learning Perl 6" book, in order for it to contain valid information on I/O routines and describe only routines that are actually part of the language, which will avoid confusion for its readers and folks helping those readers learn Perl 6.

Lastly, the bonus deliverables (described below) will benefit further development of Rakudo/Perl 6 by elucidating unspecced, unused, or unwanted routines. The data can also be used to produce a teaching/reference aid (e.g. flash cards with routine names or IRC bot for routine reference).

Deliverables

Scope: by I/O routines this grant proposal means subroutines and methods of the Rakudo implementation found in src/core/io_operators.pm, src/core/IO.pm, src/core/IO/ArgFiles.pm, src/core/IO/Handle.pm, src/core/IO/Notification.pm, src/core/IO/Path.pm, src/core/IO/Pipe.pm, src/core/IO/Special.pm, and src/core/IO/Spec.pm as well as its subclasses found in src/core/IO/Spec/ directory. Note that IO::Socket and its subclasses are NOT in the scope of this grant proposal.

Deliverables:

  • I/O Action Plan Report
  • Commits in rakudo implementing the Action Plan.
  • Commits in Perl 6 Specification providing full coverage for I/O routines.
  • Commits in Perl 6 Documentation providing full coverage for undocumented I/O routines, as well as any corrections for existing I/O documentation.
  • Bonus deliverable: The "Map" of Rakudo Routines
  • Bonus objective: whenever possible, I intend to fix any of the bugs found by the new test coverage. Should a bug prove to be hard to fix, the test exposing it will be fudged and a ticket for the bug will be filed.

Project Details

  • (Bonus deliverable): The "Map" of Rakudo Routines This will be semi-automatically generated (by introspection) list of all subroutines and publicly accessible object methods provided by Rakudo implementation, together with information on what calling convention they use and how they fail (e.g. returning some object, throwing, or returning a Failure object). This data will guide the decision on how the calling convetion/failure mode for I/O routines should be standardized. I intend to work on this deliverable while this grant proposal is deliberated. As such, it's a bonus deliverable and will be completed regardless of whether this grant is approved.
  • I/O Action Plan Report This will be a Markdown document detailing how the existing routines will change, what the new routines will do, if any are to be added, and whether any currently unspecced routines are to be removed. Parts of this document will eventually be re-used as documention for currently undocumented routines.
    • The scope of all of the changes will be limited by the 6.c Perl 6 language specification, so none of the changes would have to be deferred until 6.d language release.
    • This document will be placed into rakudo repository when ready and other core developers will be asked to reivew it and make comments on the proposed changes.
  • Commits The commits to rakudo/roast/doc repositories will implement changes in the Action Plan, and test and document any I/O routines that are still left untested/undocumented after those changes.

Inch-stones

These are the inch-stones I intend to follow:

Routine Map

  • Use Perl 6's introspection facilities to locate core subroutines and classes in CORE:: lexical scope. Further introspect those classes to obtain Method objects. Further introspect the candidates and signatures of all of these routines to obtain the calling convention list (e.g. we pass arguments as adverbs or as positionals, etc)
  • Further introspect the arguments and programmatically invoke the routines with some probably-acceptable values to obtain the output range, in particular failure modes. Some heuristics or manual massaging will be needed here.
  • Part of this work already exists in a repo. If the grant is approved, this repository will be transfered to Perl 6 GitHub organization

I/O Action Plan Report

  • Using the routine map, figure out a better interface for I/O routines and then see what can be changed without breaking the 6.c language spec. Even without that map, I already can see some targets for change: the routines that take $test named arg as a predefined string of test letters in particular order, as well as the inconsistency that most methods fail() while most subroutines throw(). Based on a brief conversation with Larry Wall, the likely change would be to make all of them consistently fail().
  • Inspect the code of NewIO branch for any ideas that can be salvaged for 6.c Language. The work in this branch was created by lizmat++ in 2014-2015, but due to unfortunate circumstances never got merged to master prior to 6.c release. My understanding is it offers a lot of improvements to our current IO, but based on brief perusal of the code, I suspect it would have a lot of conflict with 6.c Language Specification tests. Under this grant, I would like to see if any of the improvements that work offers can be applied to current 6.c language.
  • Using the MoarVM coverage report tool (and by grepping the roast repo), find routines that are entirely unspecced and decide whether they should stay or be removed. For example, the currently unspecced indir() routine is a likely candidate for staying, while I've seen some calls on IRC to remove the unspecced tmpdir().
  • I forcee the largest part of this grant work to be the writing of the tests. Here are some examples of the rough current coverage state of IO routines: src/core/io_operators.pm, IO::Argfiles, IO::Special, IO::Spec::Win32, and IO::Handle. The routines whose names are on red lines indicate they're untested in roast (this report was generated on Oct 7 and as I recall only IO::ArgFiles.lines and IO::Handle.seek received any further tests since then).
  • A lot of the decisions made by the Action Plan will be able to close many of the tickets opened by brian d foy while gathering info on our I/O routines that is going to be included in Learning Perl 6 book and the concerns raised in the tickets will be used as input for the Action Plan as well. (Some of them are: RT#130460, RT#130456, RT#130454, RT#130455, RT#130489, RT#130490).
  • Write down the action plan in Markdown format, make it available in Rakudo repository, and invite other core members in #perl6-dev IRC channel to comment on it.
  • After a 1 week review period, update the Action Plan to reflect any received feedback, and proceed to implement it.

Commits

The commits will implement the Action Plan. Roast commits will be based on the changes to routines as well as the report generated by the coverage tool. And the doc commits will be done by manually searching and reading exiting documentation and amending it as needed.

(N.B.: I'm aware the coverage tool is currently busted by a commit that changed filenames in .file method for core routines; however MasterDuke++ promised to fix it, and if they won't be able to find time to do so, the fix should be simple enough that I'd fix the tool myself).

Project Schedule

I already began work on the routine map generator and will complete it by the time the decision on this grant proposal is available. After that, I expect to spend 2 weeks preparing the I/O Action Plan Report, 1 week for its review by other core members, and 2 weeks for its full implementation (including tests and docs). I also allow for extra 2 weeks for any unforseen delays in any of the steps.

If my understanding of the date when the decision on this grant would be ready is correct, I intend to finish the work before the end of March, 2017.

Completeness Criteria

  • rakudo repository will contain the IO Action Plan document and it will be fully implemented.
  • All of the I/O routines will have tests in roast and documented on docs.perl6.org. If any of the currently implemented but unspecced routines are decided against being included in Perl 6 Language, their implementation will no longer be available in Rakudo.
  • The test coverage tool will report all I/O routines as covered and the information will be visible on perl6.wtf (Perl 6's Wonderful Test Files) website. Note: due to current experimental status of the coverage tool, its report may still show some lines or conditionals untested despite them actually being tested; however, it will show the lines where routines' names are specified as covered.

Sidenote Comments

The grant amount requested may be low compared to the described amount of work because I'd like to still view part of my time on this work as donated to Perl 6. Those who see me on IRC may notice I bounced around the ideas in this proposal before.

The grant will let me finish this work much sooner and in a more complete state than I would be able to otherwise.

Bio

I'm a 30-year old Canadian who lived near Toronto for the past 14 years. I also spent some years of my life living in USA (New Jersey) and, before that, in Russia (Siberia), where I was born.

I started with Perl 5 about 12 years ago and have since released over 200 Perl 5 modules on CPAN and for the last 10 years held a single job, large part of which is web development with Perl 5.

In the fall of 2015, I switched my focus to Perl 6 and to date released 34 Perl 6 modules and delivered a couple of Perl 6 presentations at the Toronto Perl Mongers meetings.

Around July, 2016 I joined the Rakudo Perl 6 Core Development Team. I also have been Rakudo's release manager every month since the 2016.06 release.

My notable deliverables to the Perl 6 Community involve the creation of the web app driving modules.perl6.org; nearly total automation of the Rakudo's release process, including the development of perl6.fail web app for RT interfacing and release status tracking; and writing all of the tutorials on perl6.party website.

To date, I have authored 461 commits to Rakudo and 1,823 commits to repositories in Perl 6 GitHub organization (523 of which have been to the Perl 6 Specification repository).

Perl Foundation News: Maintaining the Perl 5 Core: December 2016 report

Dave Mitchell writes:

I spent December:

1) looking for quick some wins on speeding up perl compile-time, using
'perl -MCPAN -e1' as a typical example of loading and compiling several
.pm files.  I tweaked Perl_yyparse() and shaved ~2% off the compile time;
then tweaked Perl_sv_gets() and shaved another ~2% off, and also got a
pleasing run-time boost for line reading, with reading a big list of words
now ~8% faster on my system:

    perl -e'$i++ while (<>)' /usr/share/dict/words

2) investigating possible ways of making the regex engine faster by (in
some cases) processing 8 bytes at a time on 64-bit platforms. This
culminated in an this email thread:

    http://nntp.perl.org/group/perl.perl5.porters/241891

3) as usual, working on a few miscellaneous tickets.

SUMMARY:
      1:30 RT #130385 Bleadperl breaks DNS-LDNS
      1:00 RT #130385 Bleadperl breaks List-Pairwise
      1:00 RT #130398 Bleadperl breaks Method-Signatures
      4:26 [perl #129199] Difficult-to-trigger panic
      1:01 [perl #130247] Perl_rpeep(OP *): Assertion `oldop' failed
      0:55 [perl #130307] Bug: Regex matches when it shouldn't
      0:33 [perl #130311] heap-buffer-overflow Perl_yyparse
      0:55 fix build warnings and smoke failures
      1:40 fix some CPAN modules under PERL_OP_PARENT
     19:00 investigate making find_by_class faster()
      2:37 optimise Perl_sv_gets()
      4:45 optimise Perl_yyparse()
     12:20 process p5p mailbox
      3:22 review RT tickets
      3:04 review security tickets
    ------
     58:08 TOTAL (HH::MM)

 170.1 weeks
2338.8 total hours
  13.7 average hours per week

There are 61 hours left on the grant

Sawyer X: Perl 5 Porters Mailing List Summary: January 9th-15th

Hey everyone,

Following is the p5p (Perl 5 Porters) mailing list summary for the past week.

Enjoy!

January 9th-15th

News and updates

Perl 5.22.3 is now available!

Perl 5.24.1 is now available!

Grant reports

  • Dave Mitchell TPF Grant 2 reports #156 and #157.

Issues

New issues

  • Perl #130543: Configure: Cross/to-scp duplicates targetdir when rerun.
  • Perl #130546: Better-written-as-warning missing on delete.
  • Perl #130550: Out of bounds read when calling perl from C.

Resolved issues

Discussion

Curtis (Ovid) Poe asked about rephrasing an error message regarding subroutine names.

Karl Williamson asked (What to do about qr/(?i-i:foo)/) whether we should warn when a regular expression has flags that negate and cancel each other.

Dave Mitchell suggests (changing the format of op tree dumping (-Dx / Perl_op_dump())) changing the output of op tree dumping on the C-side to a more useful output format.

Ricardo Signes: I will make friends by programming.

Every once in a while I randomly think of some friend I haven't talked to in a while and I drop them an email. Half the time (probably more), I never hear back, but sometimes I do, and that's pretty great. This week, I read an article about Eagle Scouts and it made me realize I hadn't talked to my high school friend Bill Campbell for a while, so I dropped him an email and he wrote right back, and I was happy to have sent the email.

Today, I decided it was foolish to wait for random thoughts to suggest I should write to people, so I went into my macOS Contacts application and made a "Correspondents" group with all the people whom it might be fun to email out of the blue once in a while.

Next, I wrote a program to pick somebody for me to email. Right now, it's an incredibly stupid program, but it does the job. Later, I can make it smarter. I figure I'll run it once every few days and see how that goes.

I wrote the program in JavaScript. It's the sort of thing you used to have to write in AppleScript (or Objective-C), but JavaScript now works well for scripting OS X applications, which is pretty great. This was my first time writing any JavaScript for OSA scripting, and I'm definitely looking forward to writing more. Probably my next step will be to rewrite some of the things I once wrote in Perl, using Mac::Glue, which stopped working years ago when Carbon was retired.

Here's the JavaScript program in its entirety:

  #!/usr/bin/osascript
  Contacts = Application("Contacts");

  var people = Contacts.groups.byName("Correspondents").people;
  var target = people[ Math.floor( Math.random() * people.length ) ];

  var first = target.firstName.get();
  var last  = target.lastName.get();

  first + " " + last;

Perlgeek.de : Perl 6 By Example: Stateful Silent Cron

This blog post is part of my ongoing project to write a book about Perl 6.

If you're interested, please sign up for the mailing list at the bottom of the article, or here. It will be low volume (less than an email per month, on average).


In the last two installments we've looked at silent-cron, a wrapper around external programs that silences them in case their exit status is zero. But to make it really practical, it should also silence occasional failures.

External APIs fail, networks become congested, and other things happen that prevent a job from succeeding, so some kind of retry mechanism is desirable. In case of a cron job, cron already takes care of retrying a job on a regular basis, so silent-cron should just suppress occasional errors. On the other hand, if a job fails consistently, this is usually something that an operator or developer should look into, so it's a problem worth reporting.

To implement this functionality, silent-cron needs to store persistent state between separate runs. It needs to record the results from the current run and then analyze if the failure history qualifies as "occasional".

Persistent Storage

The storage backend needs to write and retrieve structured data, and protect concurrent access to the state file with locking. A good library for such a storage backend is SQLite, a zero-maintenance SQL engine that's available as a C library. It's public domain software and in use in most major browsers, operating systems and even some airliners.

Perl 6 gives you access to SQLite's functionality through DBIish, a generic database interface with backend drivers for SQLite, MySQL, PostgreSQL and Oracle DB. To use it, first make sure that SQLite3 is installed, including its header files. On a Debian-based Linux system, for example, you can achieve this with apt-get install libsqlite3-dev. If you are using the Rakudo Star distribution, DBIish is already available. If not, you can use one of the module installers to retrieve and install it: panda install DBIish or zef install DBIish.

To use the DBIish's SQLite backend, you first have to create a database handle by selecting the backend and supplying connection information:

use DBIish;
my $dbh = DBIish.connect('SQLite', :database('database-file.sqlite3'));

Connecting to a database file that does not yet exist creates that file.

One-off SQL statements can be executed directly on the database handle:

$dbh.do('INSERT INTO player (name) VALUES ?', 'John');

The ? in the SQL is a placeholder that is passed out-of-band as a separate argument to the do method, which avoids potential errors such as SQL injection vulnerabilities.

Queries tend to work by first preparing a statement which returns a statement handle. You can execute a statement once or multiple times, and retrieve result rows after each execute call:

my $sth = $dbh.prepare('SELECT id FROM player WHERE name = ?');

my %ids;
for <John Jack> -> $name {
    $sth.execute($name);
    %ids{ $name } = $sth.row[0];
}
$sth.finish;

Developing the Storage Backend

We shouldn't just stuff all the storage handling code into sub MAIN, we should instead carefully consider the creation of a useful API for the storage backend. At first, we need only two pieces of functionality: insert the result of a job execution; and retrieve the most recent results.

Since silent-cron can be used to guard multiple cron jobs on the same machine, we might need something to distinguish the different jobs so that one of them succeeding doesn't prevent error reporting for one that is constantly failing. For that we introduce a job name, which can default to the command (including arguments) being executed but which can be set explicitly on the command line.

The API for the storage backend could look something like this:

my $repo = ExecutionResultRepository.new(
    jobname   => 'refresh cache',
    statefile => 'silent-cron.sqlite3',
);
$repo.insert($result);
my @last-results = $repo.tail(5);

This API isn't specific to the SQLite backend at all; a storage backend that works with plain text files could have the exact same API.

Let's implement this API. First we need the class and the two attributes that should be obvious from the usage example above:

class ExecutionResultRepository {
    has $.jobname   is required;
    has $.statefile is required;
    # ... more code

To implement the insert method, we need to connect to the database and create the relevant table if it doesn't exist yet.

has $!db;
method !db() {
    return $!db if $!db;
    $!db = DBIish.connect('SQLite', :database($.statefile));
    self!create-schema();
    return $!db;
}

This code uses a private attribute $!db to cache the database handle and a private method !db to create the handle if it doesn't exist yet.

Private methods are declared like ordinary methods, except that the name starts with an exclamation mark. To call one, substitute the method call dot for the exclamation mark, in other words, use self!db() instead of self.db().

The !db method also calls the next private method, !create-schema, which creates the storage table and some indexes:

method !create-schema() {
    $!db.do(qq:to/SCHEMA/);
        CREATE TABLE IF NOT EXISTS $table (
            id          INTEGER PRIMARY KEY,
            jobname     VARCHAR NOT NULL,
            exitcode    INTEGER NOT NULL,
            timed_out   INTEGER NOT NULL,
            output      VARCHAR NOT NULL,
            executed    TIMESTAMP NOT NULL DEFAULT (DATETIME('NOW'))
        );
    SCHEMA
    $!db.do(qq:to/INDEX/);
        CREATE INDEX IF NOT EXISTS {$table}_jobname_exitcode ON $table ( jobname, exitcode );
    INDEX
    $!db.do(qq:to/INDEX/);
        CREATE INDEX IF NOT EXISTS {$table}_jobname_executed ON $table ( jobname, executed );
    INDEX
}

Multi-line string literals are best written with the heredoc syntax. qq:to/DELIMITER/ tells Perl 6 to finish parsing the current statement so that you can still close the method call parenthesis and add the statement-ending semicolon. The next line starts the string literal, which goes on until Perl 6 finds the delimiter on a line on its own. Leading whitespace is stripped from each line of the string literal by as much as the closing delimiter is indented.

For example

print q:to/EOS/;
    Not indented
        Indented four spaces
    EOS

Produces the output

Not indented
    Indented four spaces

Now that we have a working database connection and know that the database table exists, inserting a new record becomes easy:

method insert(ExecutionResult $r) {
    self!db.do(qq:to/INSERT/, $.jobname, $r.exitcode, $r.timed-out, $r.output);
        INSERT INTO $table (jobname, exitcode, timed_out, output)
        VALUES(?, ?, ?, ?)
    INSERT
}

Selecting the most recent records is a bit more work, partially because we need to convert the table rows into objects:

method tail(Int $count) {
    my $sth = self!db.prepare(qq:to/SELECT/);
        SELECT exitcode, timed_out, output
          FROM $table
         WHERE jobname = ?
      ORDER BY executed DESC
         LIMIT $count
    SELECT
    $sth.execute($.jobname);
    $sth.allrows(:array-of-hash).map: -> %h {
        ExecutionResult.new(
            exitcode  => %h<exitcode>,
            timed-out => ?%h<timed_out>,
            output    => %h<output>,
        );
    }
}

The last statement in the tail method deserves a bit of extra attention. $sth.allrows(:array-of-hash) produces the database rows as a list of hashes. This list is lazy, that is, it's generated on-demand. Lazy lists are a very convenient feature because they allow you to use iterators and lists with the same API. For example when reading lines from a file, you can write for $handle.lines -> $line { ... }, and the lines method doesn't have to load the whole file into memory; instead it can read a line whenever it is accessed.

$sth.allrows(...) is lazy, and so is the .map call that comes after it. map transforms a list one element at a time by calling the code object that's passed to it. And that is done lazily as well. So SQLite only retrieves rows from the database file when elements of the resulting list are actually accessed.

Using the Storage Backend

With the storage API in place, it's time to use it:

multi sub MAIN(*@cmd, :$timeout, :$jobname is copy,
               :$statefile='silent-cron.sqlite3', Int :$tries = 3) {
    $jobname //= @cmd.Str;
    my $result = run-with-timeout(@cmd, :$timeout);
    my $repo = ExecutionResultRepository.new(:$jobname, :$statefile);
    $repo.insert($result);

    my @runs = $repo.tail($tries);

    unless $result.is-success or @runs.grep({.is-success}) {
        say "The last @runs.elems() runs of @cmd[] all failed, the last execution ",
            $result.timed-out ?? "ran into a timeout"
                              !! "exited with code $result.exitcode()";

        print "Output:\n", $result.output if $result.output;
    }
    exit $result.exitcode // 2;
}

Now a job that succeeds a few times, and then fails up to two times in a row doesn't produce any error output, and only the third failed execution in a row produces output. You can override that on the command line with --tries=5.

Summary

We've discussed DBIish, a database API with pluggable backend, and explored using it with SQLite to store persistent data. In the process we also came across lazy lists and a new form of string literals called heredocs.

Subscribe to the Perl 6 book mailing list

* indicates required

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Perl isn't dieing

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Ocean of Awareness: Top-down parsing is guessing

Top-down parsing is guessing. Literally. Bottom-up parsing is looking.

The way you'll often hear that phrased is that top-down parsing is looking, starting at the top, and bottom-up parsing is looking, starting at the bottom. But that is misleading, because the input is at the bottom -- at the top there is nothing to look at. A usable top-down parser must have a bottom-up component, even if that component is just lookahead.

A more generous, but still accurate, way to describe the top-down component of parsers is "prediction". And prediction is, indeed, a very useful component of a parser, when used in combination with other techniques.

Of course, if a parser does nothing but predict, it can predict only one input. Top-down parsing must always be combined with a bottom-up component. This bottom-up component may be as modest as lookahead, but it must be there or else top-down parsing is really not parsing at all.

So why is top-down parsing used so much?

Top-down parsing may be unusable in its pure form, but from one point of view that is irrelevant. Top-down parsing's biggest advantage is that it is highly flexible -- there's no reason to stick to its "pure" form.

A top-down parser can be written as a series of subroutine calls -- a technique called recursive descent. Recursive descent allows you to hook in custom-written bottom-up logic at every top-down choice point, and it is a technique which is completely understandable to programmers with little or no training in parsing theory. When dealing with recursive descent parsers, it is more useful to be a seasoned, far-thinking programmer than it is to be a mathematician. This makes recursive descent very appealing to seasoned, far-thinking programmers, and they are the audience that counts.

Switching techniques

You can even use the flexibility of top-down to switch away from top-down parsing. For example, you could claim that a top-down parser could do anything my own parser (Marpa) could do, because a recursive descent parser can call a Marpa parser.

A less dramatic switchoff, and one that still leaves the parser with a good claim to be basically top-down, is very common. Arithmetic expressions are essential for a computer language. But they are also among the many things top-down parsing cannot handle, even with ordinary lookahead. Even so, most computer languages these days are parsed top-down -- by recursive descent. These recursive descent parsers deal with expressions by temporarily handing control over to an bottom-up operator precedence parser. Neither of these parsers is extremely smart about the hand-over and hand-back -- it is up to the programmer to make sure the two play together nicely. But used with caution, this approach works.

Top-down parsing and language-oriented programming

But what about taking top-down methods into the future of language-oriented programming, extensible languages, and grammars which write grammars? Here we are forced to confront the reality -- that the effectiveness of top-down parsing comes entirely from the foreign elements that are added to it. Starting from a basis of top-down parsing is literally starting with nothing. As I have shown in more detail elsewhere, top-down techniques simply do not have enough horsepower to deal with grammar-driven programming.

Perl 6 grammars are top-down -- PEG with lots of extensions. These extensions include backtracking, backtracking control, a new style of tie-breaking and lots of opportunity for the programmer to intervene and customize everything. But behind it all is a top-down parse engine.

One aspect of Perl 6 grammars might be seen as breaking out of the top-down trap. That trick of switching over to a bottom-up operator precedence parser for expressions, which I mentioned above, is built into Perl 6 and semi-automated. (I say semi-automated because making sure the two parsers "play nice" with each other is not automated -- that's still up to the programmer.)

As far as I know, this semi-automation of expression handling is new with Perl 6 grammars, and it may prove handy for duplicating what is done in recursive descent parsers. But it adds no new technique to those already in use. And features like

  • mulitple types of expression, which can be told apart based on their context,
  • n-ary expressions for arbitrary n, and
  • the autogeneration of multiple rules, each allowing a different precedence scheme, for expressions of arbitrary arity and associativity,

all of which are available and in current use in Marpa, are impossible for the technology behind Perl 6 grammars.

I am a fan of the Perl 6 effort. Obviously, I have doubts about one specific set of hopes for Perl 6 grammars. But these hopes have not been central to the Perl 6 effort, and I will be an eager student of the Perl 6 team's work over the coming months.

Comments

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Ocean of Awareness: Parsing: an expanded timeline

The fourth century BCE: In India, Pannini creates a sophisticated description of the Sanskrit language, exact and complete, and including pronunciation. Sanskrit could be recreated using nothing but Pannini's grammar. Pannini's grammar is probably the first formal system of any kind, predating Euclid. Even today, nothing like it exists for any other natural language of comparable size or corpus. Pannini is the object of serious study today. But in the 1940's and 1950's Pannini is almost unknown in the West. His work has no direct effect on the other events in this timeline.

1943: Emil Post defines and studies a formal rewriting system using productions. With this, the process of reinventing Pannini in the West begins.

1948: Claude Shannon publishes the foundation paper of information theory. Andrey Markov's finite state processes are used heavily.

1952: Grace Hopper writes a linker-loader and describes it as a "compiler". She seems to be the first person to use this term for a computer program. Hopper uses the term "compiler" in its original sense: "something or someone that brings other things together".

1954: At IBM, a team under John Backus begins working on the language which will be called FORTRAN. The term "compiler" is still being used in Hopper's looser sense, instead of its modern one. In particular, there is no implication that the output of a "compiler" is ready for execution by a computer. The output of one 1954 "compiler", for example, produces relative addresses, which need to be translated by hand before a machine can execute them.

1955: Noam Chomsky is awarded a Ph.D. in linguistics and accepts a teaching post at MIT. MIT does not have a linguistics department and Chomsky, in his linguistics course, is free to teach his own approach, highly original and very mathematical.

1956: Chomsky publishes the paper which is usually considered the foundation of Western formal language theory. The paper advocates a natural language approach that involves

  • a bottom layer, using Markov's finite state processes;
  • a middle, syntactic layer, using context-free grammars and context-sensitive grammars; and
  • a top layer, which involves mappings or "transformations" of the output of the syntactic layer.

These layers resemble, and will inspire, the lexical, syntactic and AST transformation phases of modern parsers. For finite state processes, Chomsky acknowledges Markov. The other layers seem to be Chomsky's own formulations -- Chomsky does not cite Post's work.

1957: Steven Kleene discovers regular expressions, a very handy notation for Markov's processes. Regular expressions turn out to describe exactly the mathematical objects being studied as finite state automata, as well as some of the objects being studied as neural nets.

1957: Noam Chomsky publishes Syntactic Structures, one of the most influential books of all time. The orthodoxy in 1957 is structural linguistics which argues, with Sherlock Holmes, that "it is a capital mistake to theorize in advance of the facts". Structuralists start with the utterances in a language, and build upward.

But Chomsky claims that without a theory there are no facts: there is only noise. The Chomskyan approach is to start with a grammar, and use the corpus of the language to check its accuracy. Chomsky's approach will soon come to dominate linguistics.

1957: Backus's team makes the first FORTRAN compiler available to IBM customers. FORTRAN is the first high-level language that will find widespread implementation. As of this writing, it is the oldest language that survives in practical use. FORTRAN is a line-by-line language and its parsing is primitive.

1958: John McCarthy's LISP appears. LISP goes beyond the line-by-line syntax -- it is recursively structured. But the LISP interpreter does not find the recursive structure: the programmer must explicitly indicate the structure herself, using parentheses.

1959: Backus invents a new notation to describe the IAL language (aka ALGOL). Backus's notation is influenced by his study of Post -- he seems not to have read Chomsky until later.

1960: Peter Naur improves the Backus notation and uses it to describe ALGOL 60. The improved notation will become known as Backus-Naur Form (BNF).

1960: The ALGOL 60 report specifies, for the first time, a block structured language. ALGOL 60 is recursively structured but the structure is implicit -- newlines are not semantically significant, and parentheses indicate syntax only in a few specific cases. The ALGOL compiler will have to find the structure. It is a case of 1960's optimism at its best. As the ALGOL committee is well aware, a parsing algorithm capable of handling ALGOL 60 does not yet exist. But the risk they are taking will soon pay off.

1960: A.E. Gleenie publishes his description of a compiler-compiler. Glennie's "universal compiler" is more of a methodology than an implementation -- the compilers must be written by hand. Glennie credits both Chomsky and Backus, and observes that the two notations are "related". He also mentions Post's productions. Glennie may have been the first to use BNF as a description of a procedure instead of as the description of a Chomsky grammar. Glennie points out that the distinction is "important".

Chomskyan BNF and procedural BNF: BNF, when used as a Chomsky grammar, describes a set of strings, and does not describe how to parse strings according to the grammar. BNF notation, if used to describe a procedure, is a set of instructions, to be tried in some order, and used to process a string. Procedural BNF describes a procedure first, and a language only indirectly.

Both procedural and Chomskyan BNF describe languages, but usually not the same language. That is,

  • Suppose D is some BNF description.
  • Let P(D) be D interpreted as a procedure,
  • Let L(P(D)) be the language which the procedure P(D) parses.
  • Let G(D) be D interpreted as a Chomsky grammar.
  • Let L(G(D)) be the language which the grammar G(D) describes.
  • Then, usually, L(P(D)) != L(G(D)).

The pre-Chomskyan approach, using procedural BNF, is far more natural to someone trained as a computer programmer. The parsing problem appears to the programmer in the form of strings to be parsed, exactly the starting point of procedural BNF and pre-Chomsky parsing.

Even when the Chomskyan approach is pointed out, it does not at first seem very attractive. With the pre-Chomskyan approach, the examples of the language more or less naturally lead to a parser. In the Chomskyan approach the programmer has to search for an algorithm to parse strings according to his grammar -- and the search for good algorithms to parse Chomskyan grammars has proved surprisingly long and difficult. Handling semantics is more natural with a Chomksyan approach. But, using captures, semantics can be added to a pre-Chomskyan parser and, with practice, this seems natural enough.

Despite the naturalness of the pre-Chomskyan approach to parsing, we will find that the first fully-described automated parsers are Chomskyan. This is a testimony to Chomsky's influence at the time. We will also see that Chomskyan parsers have been dominant ever since.

1961: In January, Ned Irons publishes a paper describing his ALGOL 60 parser. It is the first paper to fully describe any parser. The Irons algorithm is Chomskyan and top-down with a "left corner" element. The Irons algorithm is general, meaning that it can parse anything written in BNF. It is syntax-driven (aka declarative), meaning that the parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Peter Lucas publishes the first description of a purely top-down parser. This can be considered to be recursive descent, though in Lucas's paper the algorithm has a syntax-driven implementation, useable only for a restricted class of grammars. Today we think of recursive descent as a methodology for writing parsers by hand. Hand-coded approaches became more popular in the 1960's due to three factors:

  • Memory and CPU were both extremely limited. Hand-coding paid off, even when the gains were small.
  • Non-hand coded top-down parsing, of the kind Lucas's syntax-driven approach allowed, is a very weak parsing technique. It was (and still is) often necessary to go beyond its limits.
  • Top-down parsing is intuitive -- it essentially means calling subroutines. It therefore requires little or no knowledge of parsing theory. This makes it a good fit for hand-coding.

1963: L. Schmidt, Howard Metcalf, and Val Schorre present papers on syntax-directed compilers at a Denver conference.

1964: Schorre publishes a paper on the Meta II "compiler writing language", summarizing the papers of the 1963 conference. Schorre cites both Backus and Chomsky as sources for Meta II's notation. Schorre notes that his parser is "entirely different" from that of Irons 1961 -- in fact it is pre-Chomskyan. Meta II is a template, rather than something that readers can use, but in principle it can be turned into a fully automated compiler-compiler.

1965: Don Knuth invents LR parsing. The LR algorithm is deterministic, Chomskyan and bottom-up, but it is not thought to be practical. Knuth is primarily interested in the mathematics.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is Chomskyan, syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's algorithm is both top-down and bottom-up at once -- it uses dynamic programming and keeps track of the parse in tables. Earley's approach makes a lot of sense and looks very promising indeed, but there are three serious issues:

  • First, there is a bug in the handling of zero-length rules.
  • Second, it is quadratic for right recursions.
  • Third, the bookkeeping required to set up the tables is, by the standards of 1968 hardware, daunting.

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size. LALR looks practical.

1969: Ken Thompson writes the "ed" editor as one of the first components of UNIX. At this point, regular expressions are an esoteric mathematical formalism. Through the "ed" editor and its descendants, regular expressions will become an everyday part of the working programmer's toolkit.

Recognizers: In comparing algorithms, it can be important to keep in mind whether they are recognizers or parsers. A recognizer is a program which takes a string and produces a "yes" or "no" according to whether a string is in part of a language. Regular expressions are typically used as recognizers. A parser is a program which takes a string and produces a tree reflecting its structure according to a grammar. The algorithm for a compiler clearly must be a parser, not a recognizer. Recognizers can be, to some extent, used as parsers by introducing captures.

1972: Alfred Aho and Jeffrey Ullman publish a two volume textbook summarizing the theory of parsing. This book is still important. It is also distressingly up-to-date -- progress in parsing theory slowed dramatically after 1972. Aho and Ullman describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1972: Under the names TDPL and GTDPL, Aho and Ullman investigate the non-Chomksyan parsers in the Schorre lineage. They note that "it can be quite difficult to determine what language is defined by a TDPL parser". That is, GTDPL parsers do whatever they do, and that whatever is something the programmer in general will not be able to describe. The best a programmer can usually do is to create a test suite and fiddle with the GTDPL description until it passes. Correctness cannot be established in any stronger sense. GTDPL is an extreme form of the old joke that "the code is the documentation" -- with GTDPL nothing documents the language of the parser, not even the code.

GTDPL's obscurity buys nothing in the way of additional parsing power. Like all non-Chomskyan parsers, GTDPL is basically a extremely powerful recognizer. Pressed into service as a parser, it is comparatively weak. As a parser, GTDPL is essentially equivalent to Lucas's 1961 syntax-driven algorithm, which was in turn a restricted form of recursive descent.

At or around this time, rumor has it that the main line of development for GTDPL parsers is classified secret by the US government. GTDPL parsers have the property that even small changes in GTDPL parsers can be very labor-intensive. For some government contractors, GTDPL parsing provides steady work for years to come. Public interest in GTDPL fades.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed.

1979: Part of the V7 toolkit is Yet Another Compiler Compiler (YACC). YACC is LALR-powered. Despite its name, YACC is the first compiler-compiler in the modern sense. For some useful languages, the process of going from Chomskyan specification to executable is fully automated. Most practical languages, including the C language and YACC's own input language, still require manual hackery. Nonetheless, after two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses YACC and LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of the Earley algorithm.

But Earley parsing is almost forgotten. Twenty years will pass before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: John Aycock and R. Nigel Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2004: Bryan Ford publishes his paper on PEG. Implementers by now are avoiding YACC, and it seems as if there might soon be no syntax-driven algorithms in practical use. Ford fills this gap by repackaging the nearly-forgotten GTDPL. Ford adds packratting, so that PEG is always linear, and provides PEG with an attractive new syntax. But nothing has been done to change the problematic behaviors of GTDPL.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

Today: After five decades of parsing theory, the state of the art seems to be back where it started. We can imagine someone taking Ned Iron's original 1961 algorithm from the first paper ever published describing a parser, and republishing it today. True, he would have to translate its code from the mix of assembler and ALGOL into something more fashionable, say Haskell. But with that change, it might look like a breath of fresh air.

Marpa: an afterword

The recollections of my teachers cover most of this timeline. My own begin around 1970. Very early on, as a graduate student, I became unhappy with the way the field was developing. Earley's algorithm looked interesting, and it was something I returned to on and off.

The original vision of the 1960's was a parser that was

  • efficient,
  • practical,
  • general, and
  • syntax-driven.

By 2010 this vision seemed to have gone the same way as many other 1960's dreams. The rhetoric stayed upbeat, but parsing practice had become a series of increasingly desperate compromises.

But, while nobody was looking for them, the solutions to the problems encountered in the 1960's had appeared in the literature. Aycock and Horspool had solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and are probably no longer relevant in any case -- cache misses are now the bottleneck.

The programmers of the 1960's would have been prepared to trust a fully declarative Chomskyan parser. With the experience with LALR in their collective consciousness, modern programmers might be more guarded. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove." But I found it straightforward to rearrange the Earley parse engine to allow efficient event-driven handovers between procedural and syntax-driven logic. And Earley tables provide the procedural logic with full knowledge of the state of the parse so far, so that Earley's algorithm is a better platform for hand-written procedural logic than recursive descent.

References, comments, etc.

My implementation of Earley's algorithm is called Marpa. For more about Marpa, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Ocean of Awareness: What are the reasonable computer languages?

"You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?'" -- George Bernard Shaw

In the 1960's and 1970's computer languages were evolving rapidly. It was not clear which way they were headed. Would most programming be done with general-purpose languages? Or would programmers create a language for every task domain? Or even for every project? And, if lots of languages were going to be created, what kinds of languages would be needed?

It was in that context that Čulik and Cohen, in a 1973 paper, outlined what they thought programmers would want and should have. In keeping with the spirit of the time, it was quite a lot:

  • Programmers would want to extend their grammars with new syntax, including new kinds of expressions.
  • Programmers would also want to use tools that automatically generated new syntax.
  • Programmers would not want to, and especially in the case of auto-generated syntax would usually not be able to, massage the syntax into very restricted forms. Instead, programmers would create grammars and languages which required unlimited lookahead to disambiguate, and they would require parsers which could handle these grammars.
  • Finally, programmers would need to be able to rely on all of this parsing being done in linear time.

Today, we think we know that Čulik and Cohen's vision was naive, because we think we know that parsing technology cannot support it. We think we know that parsing is much harder than they thought.

The eyeball grammars

As a thought problem, consider the "eyeball" class of grammars. The "eyeball" class of grammars contains all the grammars that a human can parse at a glance. If a grammar is in the eyeball class, but a computer cannot parse it, it presents an interesting choice. Either,

  • your computer is not using the strongest practical algorithm; or
  • your mind is using some power which cannot be reduced to a machine computation.

There are some people out there (I am one of them) who don't believe that everything the mind can do reduces to a machine computation. But even those people will tend to go for the choice in this case: There must be some practical computer parsing algorithm which can do at least as well at parsing as a human can do by "eyeball". In other words, the class of "reasonable grammars" should contain the eyeball class.

Čulik and Cohen's candidate for the class of "reasonable grammars" were the grammars that a deterministic parse engine could parse if it had a lookahead that was infinite, but restricted to distinguishing between regular expressions. They called these the LR-regular, or LRR, grammars. And the LRR grammars do in fact seem to be a good first approximation to the eyeball class. They do not allow lookahead that contains things that you have to count, like palindromes. And, while I'd be hard put to eyeball every possible string for every possible regular expression, intuitively the concept of scanning for a regular expression does seem close to capturing the idea of glancing through a text looking for a telltale pattern.

So what happened?

Alas, the algorithm in the Čulik and Cohen paper turned out to be impractical. But in 1991, Joop Leo discovered a way to adopt Earley's algorithm to parse the LRR grammars in linear time, without doing the lookahead. And Leo's algorithm does have a practical implementation: Marpa.

References, comments, etc.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: cgit syntax highlighting

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Ocean of Awareness: What parser do birds use?

"Here we provide, to our knowledge, the first unambiguous experimental evidence for compositional syntax in a non-human vocal system." -- "Experimental evidence for compositional syntax in bird calls", Toshitaka N. Suzuki, David Wheatcroft & Michael Griesser Nature Communications 7, Article number: 10986

In this post I look at a subset of the language of the Japanese great tit, also known as Parus major. The above cited article presents evidence that bird brains can parse this language. What about standard modern computer parsing methods? Here is the subset -- probably a tiny one -- of the language actually used by Parus major.

      S ::= ABC
      S ::= D
      S ::= ABC D
      S ::= D ABC
    

Classifying the Parus major grammar

Grammophone is a very handy new tool for classifying grammars. Its own parser is somewhat limited, so that it requires a period to mark the end of a rule. The above grammar is in Marpa's SLIF format, which is smart enough to use the "::=" operator to spot the beginning and end of rules, just as the human eye does. Here's the same grammar converted into a form acceptable to Grammophone:

      S -> ABC .
      S -> D .
      S -> ABC D .
      S -> D ABC .
    

Grammophone tells us that the Parus major grammar is not LL(1), but that it is LALR(1).

What does this mean?

LL(1) is the class of grammar parseable by top-down methods: it's the best class for characterizing most parsers in current use, including recursive descent, PEG, and Perl 6 grammars. All of these parsers fall short of dealing with the Parus major language.

LALR(1) is probably most well-known from its implementations in bison and yacc. While able to handle this subset of Parus's language, LALR(1) has its own, very strict, limits. Whether LALR(1) could handle the full complexity of Parus language is a serious question. But it's a question that in practice would probably not arise. LALR(1) has horrible error handling properties.

When the input is correct and within its limits, an LALR-driven parser is fast and works well. But if the input is not perfectly correct, LALR parsers produce no useful analysis of what went wrong. If Parus hears "d abc d", a parser like Marpa, on the other hand, can produce something like this:

# * String before error: abc d\s
# * The error was at line 1, column 7, and at character 0x0064 'd', ...
# * here: d
    

Parus uses its language in predatory contexts, and one can assume that a Parus with a preference for parsers whose error handling is on an LALR(1) level will not be keeping its alleles in the gene pool for very long.

References, comments, etc.

Those readers content with sub-Parus parsing methods may stop reading here. Those with greater parsing ambitions, however, may wish to learn more about Marpa. A Marpa test script for parsing the Parus subset is in a Github gist. Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Ocean of Awareness: Introduction to Marpa Book in progress

What follows is a summary of the features of the Marpa algorithm, followed by a discussion of potential applications. It refers to itself as a "monograph", because it is a draft of part of the introduction to a technical monograph on the Marpa algorithm. I hope the entire monograph will appear in a few weeks.

The Marpa project

The Marpa project was intended to create a practical and highly available tool to generate and use general context-free parsers. Tools of this kind had long existed for LALR and regular expressions. But, despite an encouraging academic literature, no such tool had existed for context-free parsing. The first stable version of Marpa was uploaded to a public archive on Solstice Day 2011. This monograph describes the algorithm used in the most recent version of Marpa, Marpa::R2. It is a simplification of the algorithm presented in my earlier paper.

A proven algorithm

While the presentation in this monograph is theoretical, the approach is practical. The Marpa::R2 implementation has been widely available for some time, and has seen considerable use, including in production environments. Many of the ideas in the parsing literature satisfy theoretical criteria, but in practice turn out to face significant obstacles. An algorithm may be as fast as reported, but may turn out not to allow adequate error reporting. Or a modification may speed up the recognizer, but require additional processing at evaluation time, leaving no advantage to compensate for the additional complexity.

In this monograph, I describe the Marpa algorithm as it was implemented for Marpa::R2. In many cases, I believe there are better approaches than those I have described. But I treat these techniques, however solid their theory, as conjectures. Whenever I mention a technique that was not actually implemented in Marpa::R2, I will always explicitly state that that technique is not in Marpa as implemented.

Features

General context-free parsing

As implemented, Marpa parses all "proper" context-free grammars. The proper context-free grammars are those which are free of cycles, unproductive symbols, and inaccessible symbols. Worst case time bounds are never worse than those of Earley's algorithm, and therefore never worse than O(n**3).

Linear time for practical grammars

Currently, the grammars suitable for practical use are thought to be a subset of the deterministic context-free grammars. Using a technique discovered by Joop Leo, Marpa parses all of these in linear time. Leo's modification of Earley's algorithm is O(n) for LR-regular grammars. Leo's modification also parses many ambiguous grammars in linear time.

Left-eidetic

The original Earley algorithm kept full information about the parse --- including partial and fully recognized rule instances --- in its tables. At every parse location, before any symbols are scanned, Marpa's parse engine makes available its information about the state of the parse so far. This information is in useful form, and can be accessed efficiently.

Recoverable from read errors

When Marpa reads a token which it cannot accept, the error is fully recoverable. An application can try to read another token. The application can do this repeatedly as long as none of the tokens are accepted. Once the application provides a token that is accepted by the parser, parsing will continue as if the unsuccessful read attempts had never been made.

Ambiguous tokens

Marpa allows ambiguous tokens. These are often useful in natural language processing where, for example, the same word might be a verb or a noun. Use of ambiguous tokens can be combined with recovery from rejected tokens so that, for example, an application could react to the rejection of a token by reading two others.

Using the features

Error reporting

An obvious application of left-eideticism is error reporting. Marpa's abilities in this respect are ground-breaking. For example, users typically regard an ambiguity as an error in the grammar. Marpa, as currently implemented, can detect an ambiguity and report specifically where it occurred and what the alternatives were.

Event driven parsing

As implemented, Marpa::R2 allows the user to define "events". Events can be defined that trigger when a specified rule is complete, when a specified rule is predicted, when a specified symbol is nulled, when a user-specified lexeme has been scanned, or when a user-specified lexeme is about to be scanned. A mid-rule event can be defined by adding a nulling symbol at the desired point in the rule, and defining an event which triggers when the symbol is nulled.

Ruby slippers parsing

Left-eideticism, efficient error recovery, and the event mechanism can be combined to allow the application to change the input in response to feedback from the parser. In traditional parser practice, error detection is an act of desperation. In contrast, Marpa's error detection is so painless that it can be used as the foundation of new parsing techniques.

For example, if a token is rejected, the lexer is free to create a new token in the light of the parser's expectations. This approach can be seen as making the parser's "wishes" come true, and I have called it "Ruby Slippers Parsing".

One use of the Ruby Slippers technique is to parse with a clean but oversimplified grammar, programming the lexical analyzer to make up for the grammar's short-comings on the fly. As part of Marpa::R2, the author has implemented an HTML parser, based on a grammar that assumes that all start and end tags are present. Such an HTML grammar is too simple even to describe perfectly standard-conformant HTML, but the lexical analyzer is programmed to supply start and end tags as requested by the parser. The result is a simple and cleanly designed parser that parses very liberal HTML and accepts all input files, in the worst case treating them as highly defective HTML.

Ambiguity as a language design technique

In current practice, ambiguity is avoided in language design. This is very different from the practice in the languages humans choose when communicating with each other. Human languages exploit ambiguity in order to design highly flexible, powerfully expressive languages. For example, the language of this monograph, English, is notoriously ambiguous.

Ambiguity of course can present a problem. A sentence in an ambiguous language may have undesired meanings. But note that this is not a reason to ban potential ambiguity --- it is only a problem with actual ambiguity.

Syntax errors, for example, are undesired, but nobody tries to design languages to make syntax errors impossible. A language in which every input was well-formed and meaningful would be cumbersome and even dangerous: all typos in such a language would be meaningful, and parser would never warn the user about errors, because there would be no such thing.

With Marpa, ambiguity can be dealt with in the same way that syntax errors are dealt with in current practice. The language can be designed to be ambiguous, but any actual ambiguity can be detected and reported at parse time. This exploits Marpa's ability to report exactly where and what the ambiguity is. Marpa::R2's own parser description language, the SLIF, uses ambiguity in this way.

Auto-generated languages

In 1973, Čulik and Cohen pointed out that the ability to efficiently parse LR-regular languages opens the way to auto-generated languages. In particular, Čulik and Cohen note that a parser which can parse any LR-regular language will be able to parse a language generated using syntax macros.

Second order languages

In the literature, the term "second order language" is usually used to describe languages with features which are useful for second-order programming. True second-order languages --- languages which are auto-generated from other languages --- have not been seen as practical, since there was no guarantee that the auto-generated language could be efficiently parsed.

With Marpa, this barrier is raised. As an example, Marpa::R2's own parser description language, the SLIF, allows "precedenced rules". Precedenced rules are specified in an extended BNF. The BNF extensions allow precedence and associativity to be specified for each RHS.

Marpa::R2's precedenced rules are implemented as a true second order language. The SLIF representation of the precedenced rule is parsed to create a BNF grammar which is equivalent, and which has the desired precedence. Essentially, the SLIF does a standard textbook transformation. The transformation starts with a set of rules, each of which has a precedence and an associativity specified. The result of the transformation is a set of rules in pure BNF. The SLIF's advantage is that it is powered by Marpa, and therefore the SLIF can be certain that the grammar that it auto-generates will parse in linear time.

Notationally, Marpa's precedenced rules are an improvement over similar features in LALR-based parser generators like yacc or bison. In the SLIF, there are two important differences. First, in the SLIF's precedenced rules, precedence is generalized, so that it does not depend on the operators: there is no need to identify operators, much less class them as binary, unary, etc. This more powerful and flexible precedence notation allows the definition of multiple ternary operators, and multiple operators with arity above three.

Second, and more important, a SLIF user is guaranteed to get exactly the language that the precedenced rule specifies. The user of the yacc equivalent must hope their syntax falls within the limits of LALR.

References, comments, etc.

Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Header image by Tambako the Jaguar. Some rights reserved.