Gabor Szabo: The phases of a crowdfunding campaign: first week

A week has passed since I launched the crowdfunding campaign for the Bailador book. I did not know what to expect so I am not sure how really to evaluate the results so far. the project is supported by 56 people giving $3,050 which is slightly more than 9% of the goal. I'd love to have the backing of 1,000 people. That would show that there is substantial interest in this subject.

For the full article visit The phases of a crowdfunding campaign: first week

Perl Hacks: Amsterdam Training Questionnaire

It was back in the middle of March that I first raised the question of running some training in conjunction with the Perl Conference in Amsterdam this August. I didn’t mean to leave it so long before following-up, but I’ve a lot of real life to deal with over the last couple of months and I’m afraid a lot of my digital life got shoved to one side.

But I’m back now and we should really get something organised for Amsterdam.

So here’s a Google Form for you to fill out, to tell me what training course you’d like to see me run in Amsterdam. I’ll leave it running for a couple of weeks before making a decision.

The post Amsterdam Training Questionnaire appeared first on Perl Hacks.

Perl Foundation News: The Perl Conference 2017 (formerly known as YAPC::NA) is rapidly approaching, and we believe it will be great.

The Perl Conference, 2017 will be held this year in Washington DC, at the US Patent and Trademark Office, from June 18 through June 23rd. This is the conference that many of us have affectionately known as YAPC::NA::17.

If you haven't registered yet, please do so as soon as possible. We want to make sure we're providing the best possible experience for our participants, and to that end, accurate registration counts are helpful, plus there is still time to get the early-bird rate.

The conference website is: http://www.perlconference.us/tpc-2017-dc/

We have talks scheduled from many of the best speakers known to the Perl community; Damian Conway, Sawyer X, Randal Schwartz, Mark Jason Dominus, Ricardo Signes, and so many other strong speakers that I feel silly having mentioned the few that I did.

For those seeking additional enlightenment there are tutorials and master classes offered (by additional registration) on topics such as:

  • Perl in a Day (John Anderson)
  • Introduction to Moose (Dave Rolsky)
  • Perl Second Best Practices (Randal Schwartz)
  • Unicode and Associated Punishments (Ricardo Signes)

The conference is being held in the amazing US Patent and Trademark Office, and will feature an event in the National Inventors Hall of Fame Museum.

Early registration cost is $250, and late registration (Main event T-minus 14 days) will be $350, so there is still time to get your ticket, but you'll want to act sooner than later.

From the Perl Foundation Conferences Committee I would like to thank all of the organizers who have been working for many months on this, and who are currently neck deep in work tying up loose ends and caring for the many details. It will be a great conference because of everyone in the Perl community who attends and participates, but it couldn't be a great conference without those organizers who have devoted so much of their time and energy laying the foundation for the rest of us to build upon.

I am excited and can't wait to see everyone there.

perlbuzz.com: Perl::Critic releases its first new developer release in 21 months

I’ve just released a new developer release of Perl::Critic, the static code analysis tool for Perl, as we work toward its first new release in 21 months. This version of Perl::Critic fixes a few bugs and relies on a new release of the underlying Perl parsing library PPI, which also has had its first new […]

Mike Cardwell, Online: Investigating a JavaScript/React Performance Issue

I noticed a performance issue with a piece of work a colleague committed the other day. It’s a React component to display a table of data. When scrolling, once the header of the table hits the top of the viewport, the header becomes fixed in place, so it doesn’t scroll off the top of the page. When the table contained a lot of data, switching from fixed to not fixed, or vice versa, caused noticeable lag.

The implementation attached “scroll” and “resize” event listeners to the window on componentDidMount, and removed them on componentWillUnmount. It looked something like this:

class MyFancyComponent extends Component {

  constructor() {
    super();
    this.headerChecks = this.headerChecks.bind(this);
    this.state = {
      fixed: this.shouldBeFixed(),
    };
  }

  componentDidMount() {
    window.addEventListener('scroll', this.headerChecks);
    window.addEventListener('resize', this.headerChecks);
  }

  componentWillUnmount() {
    window.removeEventListener('scroll', this.headerChecks);
    window.removeEventListener('resize', this.headerChecks);
  }

  render() {
    return (
      <Table>
        <TableHeader fixed = { this.state.fixed }/>
        <TableBody/>
      </Table>
    );
  }

  headerChecks() {
    this.setState({ fixed: this.shouldBeFixed() });
  }

  shouldBeFixed() {
    // Does some calculations and then returns a boolean
    ...
  }

}

My first suggestion, without really investigating the code, was to make the event handlers passive. It turns out this was completely ineffective as it doesn’t work for scroll/resize events, but it’s still worth knowing about regardless. If you pass a third option to addEventListener/removeEventListener as an Object with “passive” set to “true”, then you’re telling the browser that you will definitely not be calling “preventDefault” on the event. That means that the browser doesn’t need to wait for your handler to finish before completing the action. This is useful for touch events, particularly on mobile, because you often want scrolling via touch to occur immediately, rather than after your handler has run:

element.addEventListener('touchstart',    fn, { passive: true });
element.removeEventListener('touchstart', fn, { passive: true });

My second suggestion, was to just use requestAnimationFrame to smooth out the scroll handler so it doesn’t run more frequently than the page is painted:

componentDidMount() {
  window.addEventListener('scroll', this.smoothHeaderChecks);
}
componentWillUnmount() {
  window.removeEventListener('scroll', this.smoothHeaderChecks);
  cancelAnimationFrame(this.animationFrame);
}
smoothHeaderChecks() {
  cancelAnimationFrame(this.animationFrame);
  this.animationFrame = requestAnimationFrame(this.headerChecks);
}

This didn’t fix the problem either. It probably helped performance a little, but the main bottleneck was still there. This also introduced a short delay between the header hitting the top of the viewport and it becoming fixed, meaning it jumped a bit when the threshold was passed. Nontheless, it is a useful technique to have in your box of tricks: There is no point rendering a component more frequently than the rate at which the page is painted.

After looking at the code a little, I noticed that setState was being called every time the event handler was being run. This is a bad idea. If you call “setState”, the entire component, including children, will be re-rendered (unless you prevent that with the shouldComponentDidUpdate life cycle method). You only want to call “setState” when the state actually needs changing:

headerchecks() {
  const fixed = this.shouldBeFixed();
  if (fixed !== this.state.fixed) this.setState({ fixed });
}

Now the table would only be re-rendered when the fixed state actually changed, rather than every time we scrolled slightly: potentially many times a second.

This helped. But there was still a moment of lag when switching between fixed and not fixed, or vice versa, when the table contained a lot of data. Then it occurred to me: Why are we re-rendering the entire table, instead of just the header? The ultimate fix is to just move the whole process, including state and the event listeners, inside of the “TableHeader” component, so it is only that component which is re-rendered when the fixed status changes. Alternatively, we could introduce a wrapper component around the TableHeader component which performs that task. The “shouldBeFixed” function did actually require some information about the Table it’s self in order to make it’s decision, but that information could just be passed down as props to the TableHeader component. Although he’s not done the relevant refactoring yet, I am pretty confident this will fix the problem.

After doing some more investigation, I came across an API I’d never heared of before called IntersectionObserver. What this API allows us to do is detect when an element intersects with the viewport, or another element. This is probably a much better API to use than listening for resize/scroll events, as it will only trigger our handler when the threshold is actually met, rather than whenever we scroll or resize.

“IntersectionObserver” is still a draft at the moment, but it is supported by Chrome, Edge, Opera and Android, is in development for Firefox, and there exists a polyfill for other browsers too:

componentDidMount() {
  this.headerObserver = new IntersectionObserver(this.headerChecks);
  this.headerObserver.observe(this.headerEl);
}
componentWillUnmount() {
  this.headerObserver.disconnect();
}
render() {
  return (
    <Table>
      <div ref = { el => (this.headerEl = el) }>
        <TableHeader fixed = { this.state.fixed }/>
      </div>
      <TableBody/>
    </Table>
  );
}

Mastering Javascript Learning React

Gabor Szabo: The phases of a crowdfunding campaign: launching

I am running a crowdfunding campaign to finance the writing of my book: Web Application Development in Perl 6 In this article I'll try to follow what's going on in the campaign and in my mind.

For the full article visit The phases of a crowdfunding campaign: launching

Sawyer X: My Perl Toolchain Summit

perl-toolchain-summit.jpg (Source: Flickr)

This past May was my second attendance at the Perl Toolchain Summit, and I hope I aptly justified my presence this time around.

I had a long list of topics to work on, but as usual, I gravitated towards a few focused ones and discussions I found valuable.

During the first day, I tackled two major issues: 1. Ref::Util PP and XS version, and 2. Perl 5.26.0-RC1.

Since I wrote Ref::Util, there had been a need for a Pure-Perl version of it. It would allow developers to use the general interface without forcing XS on users. Since the original version was XS, introducing a compatibility for both PP and XS was not simple. Luckily, I've had partners in crime in this, mainly Aaron Crane (ARC) who did most of the work, Karen Etheridge (ETHER who handled all the Dist::Zilla) issues, and Graham Knop (HAARG) who helped us smooth over the XS to PP transition without breaking things.

Read more about what we accomplished.

While Aaron was focusing on Ref::Util, I continued working on the first release candidate (RC) of Perl v5.26.0. I made various mistakes (luckily caught by vigilant core members) but no one ended in the hospital, so I consider it a relative success. The presence of Chris Williams (BinGOs) helped me to no end in understanding how Module::CoreList works in the release cycle and smoothing my mistakes.

I spent the rest of the evening and the next days cleaning up after my mess and discussing issues that determined we will need another release candidate. Leon Timmermans (LEONT), H. Merijn Brand (HMBRAND, A.K.A., Tux), and Lee Johnson (LEEJO) were very helpful.

After seeing the high quality of Lee's pictures, I thought of opening an Instagram. Lee and I began PerlEvents, in which we will collect photos from various Perl Events. Lee has written about this as well.

I tried to participate in many conversations, and those included the concept of a distribution and how to determine the distribution name in PAUSE, dot-in-@INC issue - read this post by Todd Rinaldo (TODDR) on it, CPAN indexing, CPAN distribution quality, PAUSE ownership issues, how to run future summits, and the security disclosure policy. Throughout the summit, people branched off to additional finer discussions, and I had participated in several of those as well.

Last two days I tried to focus on another challenge and picked the release managers guide maker. This script is what we use to generate the checklist of actions to perform. We create this list from instructions available in the Porting directory of the Perl repository.

The generator script has been bugging me for a while because it includes steps that one must not perform, even though they appear. I had written an alternative one which I will merge soon enough. I prefer it since it allows providing the version number of Perl you're releasing, instead of the type of the release. This work also helped me discover the current checklist creator apparently skips several steps in the release.

On a personal note, considering the moderate amount of attendees, it is not surprising to find yourself continuously attempting to justify your attendance. For now, I am grateful for having the access this event gives me to the other attendees, who helped me with all of these topics. If you find any of the above items valuable (let's hope at least Perl 5.26 falls under that definition), then it was made possible by all the attendees other than myself, and they have my sincere appreciation and thankfulness.

The attendance of all of us and the existence of this event are only possible with the help of our sponsors. Thank you.

Booking.com, ActiveState, cPanel, FastMail, MaxMind, Perl Careers, MongoDB, SureVoIP, Campus Explorer, Bytemark, CAPSiDE, Charlie Gonzalez, Elastic, OpusVL, Perl Services, Procura, XS4ALL, Oetiker+Partner.

Perl Foundation News: Call for Grant Proposals (May 2017 Round)

The Grants Committee is accepting grant proposals all the time. We evaluate them every two months and another evaluation period has come.

If you have an idea for doing some Perl work that will benefit the Perl community, consider sending a grant application. The application deadline for this round is 23:59 June 2nd UTC. We will publish the received applications, get community feedback and conclude acceptance by June 12th.

To apply, please read How to Write a Proposal. Rules of Operation and Running Grants List will also help you understand how the grant process works. We also got some grant ideas from the community. The format is the same as the previous rounds in 2014-2016.

We will confirm the receipt of application within 24 hours.

If you have further questions, please contact me at tpf-grants-secretary at perl-foundation.org.

Mike Cardwell, Online: Private Location Tracking/Sharing

For a while now, on and off, I’ve been looking for a way to live track my location and hook it into my home automation setup. This is a difficult task to achieve when you care about your privacy. I am not interested in giving another third party a history of my movements; it’s bad enough that my cell provider gets this information due to my need to connect to cell towers. I don’t care if your privacy policy tells me you wont share my information and your CEO tells me you “take security very seriously”. That wont help me when you get hacked or decide to change your privacy policy.

Anyway, I’ve finally found a solution which fits my needs perfectly. In fact, it comfortably exceeds my requirements by virtue of being open source. The application I’m talking about is OwnTracks (source). It comes in both Android and iOS flavours, and it allows you (in fact, encourages you) to use your own server.

The server can be either a HTTP service, or preferably an MQTT service. Despite my initial desire to go the HTTP route, I heeded the advice in the OwnTracks documentation and looked into MQTT (not being something I’d heard of before), and I’m glad I did. With a quick “apt-get install mosquitto mosquitto-clients” and a few minutes of reading documentation, I had an MQTT server installed and configured.

I believe the behaviour is different for iOS devices, but on Android devices at least, OwnTracks connects to your MQTT server, and publishes location updates as blobs of JSON. You can connect up other MQTT clients to that server and listen out for those events. I use the excellent NodeJS “mqtt“ package to do exactly this. You can also configure authentication and ACLs to control which users have access to which locations, meaning that you can optionally share location data with other friends who connect to your MQTT server. The OwnTracks app displays the locations of everyone you can see, on a map. And if your friends don’t want to connect to your MQTT server? They can connect to their own, and you can bridge them together!

Battery life? OwnTracks seems to work hard to not work hard. OwnTracks asks your device to notify it only when it’s in motion, or when you’ve entered or left a user defined waypoint. When you’re in motion, it will publish location updates to the server according to user defined time periods and distances travelled.

There was one issue with the above algorithm. If your phone is stationary for a long period of time, you don’t get updates (I believe the iOS version can be set to send on an interval). So when your last published location was 12 hours ago, you don’t know if you haven’t received any updates because you haven’t moved, or because your phone ran out of battery/network access. To deal with this, I enabled “Remote commands” in the OwnTracks interface, and set up a job on my server to run the following command every 30 minutes:

mosquitto_pub -u YourUsername -P YourPassword -t 'owntracks/mike/phone/cmd' -m '{"_type":"cmd","action":"reportLocation"}'

OwnTracks not only publishes to the MQTT server, but it subscribes to “topics” on it too, so when it sees a “reportLocation” request. It knows that it should immediately report it’s location. This means I get updates at least every half an hour regardless of whether or not I’ve moved.

And if there is no network access? OwnTracks batches up the location changes and publishes them once network access is available again.

Don’t have a static IP to stick an MQTT server on? You have a permanently running machine, but it’s behind a weird NAT setup that you can’t get around? If you have an Android phone, fear not, you can run your MQTT server behind a Tor Hidden Service, install Orbot on your device, configure Orbot to force all traffic from OwnTracks through Tor, and configure OwnTracks to talk to your hidden service hostname. How do I know that this works? Because that’s exactly what I’m doing. This has the added benefit that anybody monitoring your servers network (your ISP or hosting provider), wont be able to track your movements from IP to IP due to your regular MQTT traffic.

And to top it all, the documentation is excellent: detailed, but not dumbed down; and the developers are quick to respond to issues on Github.

Wiring the IoT MQTT Essentials - A Lightweight IoT Protocol

Sawyer X: Perl 5 Porters Mailing List Summary: May 17th-22nd

Hey everyone,

Following is the p5p (Perl 5 Porters) mailing list summary for the past week.

Enjoy!

May 17th-22nd

Issues

New issues

Discussion

Karl Williamson suggested adding an optional parameter to Unicode::UCD's num function, relating to the Unicode script runs idea.

Karl also suggests making vec() croak on values above FF code points.

Joel Berger: Perl Toolchain Summit 2017

For the second year, I have had the great privilege of attending the Perl Toolchain Summit (PTS, formerly called the QA Hackathon QAH). This year it was held in Lyon, France, and three cheers for the organizers; it was an amazing event!

Last year I unexpectedly became involved in the Meta::CPAN project, even to the point of hosting the first (annual?) Meta::Hack a few months ago. This year, I continued to work with them, however, the need was greater in the CPAN Testers realm and so there I went.

CPAN Test Reporter

On the first day I answered a challenge from Breno G. de Oliveira (garu). He is the author of several test reporting tools, including the client for the popular cpanminus install tool. Currently the reporter must parse the output log from cpanm to understand what happened during the tests. Indeed that is only possible if it makes a log, which it does not do during verbose mode (the output simply goes to the console). Given my history (exaggerated to be sure :-P) of involvement with the new Test2 testing architecture, which is based on events, he was wondering if it was possible to get a dump of all the events generated during an installation.

I am a web developer, and a Mojolicious user, so as that is my hammer, this problem looked like it needed a socket for each test to report to. Using some truly horrific hacks by the end of the day I had a mechanism by which cpanminus would collect a dump of all events produced, serialized to JSON. The worst of these hacks is that I had to overload much of the external process control so that rather than simply wait for the child process, it instead non-blockingly waits allowing the server to accept connections.

On the second day, and after a good night’s rest, I realized that the approach I had used was terribly over-engineered. I spent the majority of the day paring down the hacks and replacing bits until now each test file opens a new self-named file in a temporary directory and dumps its events as JSON there. The outer process then slurps all the files and aggregates them to a single JSON document as before. Importantly, by doing it this way, I only need to inject what is essentially an “around test” hook into cpanminus. All of the run overload become unnecessary.

With this proof of concept complete all that is needed is the ability to enable the dump of the events and the “around test” hook in cpanm. I reached out to Test2 author Chad Granum who told me that such a feature is essentially in the works anyway, but I hope to coordinate with him to ensure that the end result is useful for this reporter. Chad had shown spectacular devotion to providing features that the community needs, so I have no doubt that this should be easily accomplished.

The cpanm hook is perhaps a little more tricky. The application used to present hooks but they have since been removed. Further, cpanm development effort has essentially being directed to its upcoming successor Menlo. Menlo promises to be more extensible but it isn’t ready for prime time yet. All the same, Shoichi Kaji graciously offered to investigate the prospects of adding such a hook. For now however, it seems that this project might have to rely on the hack (as pseudo-sanctioned in the Menlo documentation) of a (fairly innocuous) monkey patch.

CPAN Testers Backend

Thanks to the new leadership of Doug Bell (preaction), the CPAN Testers service is getting a thorough cleaning both front and back-end. No slight intended to Barbie, but the age of the infrastructure was starting to show. The service was originally backed by e-mail, then later was improved to use Amazon SimpleDB; a choice that was made to support throughput. In the intervening years other technologies have advanced greatly and SimpleDB is now an expensive bottleneck, both time and money. To save on read costs (yes reads from SimpleDB cost money) the data is copied into MySQL for use locally.

The decision was reached (while I was working on the above) to remove SimpleDB; however it isn’t that easy. It isn’t known if users use that data or if there are, who and why. Therefore, in the short to medium term, the data can be replicated. As I mentioned before reads were literally expensive, so end-users haven’t had direct access to SimpleDB anyway, their access was via MySQL-backed cache. This means removing SimpleDB is an exercise in creating services that mimic the cache’s current behavior. Later once we have metrics on usage, these services may (and likely eventually will) be removed as users move to newer/friendlier APIs.

This task was handed to me, as I surfaced from cpanm hacking right at this point. For the remainder of the summit I completed two of the three boxes I needed to check while the third is waiting on me finding a few free hours to complete. Hopefully end users won’t see any affect from this work, but it will help simplify the backend and open the door for more interesting stuff. If you are interested, read more at preaction’s PTS 2017 reaction post.

CPAN Cover Job Queue

Finally, while working on that, Paul Johnson (pjcj), of Devel::Cover fame, asked if I could help build a job queue to back the generation of test coverage reports that he provides on http://cpancover.com. His intention is to quickly detect a release and have coverage reports run within minutes rather than the current hours that this process takes. As I dug into his current process it seems that a stumbling block is how he gets his list of distributions to be tested. It isn’t (as I had hoped) a progressive list of recent uploads, but instead a list of all the current modules on CPAN. I now suspect (but can’t quite confirm) that the reason it takes so long to update is that it is running all the test coverage of the current state of CPAN on each iteration. That seems like it would need even more time than he gives it though, so perhaps I am wrong.

In any case I hope to be able to help him with such a queue and make his process more efficient. I’m only just starting in on this however, so watch this space.

Additionally

I can’t completely express how valuable this summit is to the Perl toolchain. I’m sure nearly everyone will agree that having 4 days blocked out of our busy schedules to work on these important infrastructure pieces would be reason enough. The fact that we have so many skilled developers, holders of arcane (and not so arcane) knowledge, commit bit/push access holders, PAUSE admins, etc in the same room can allow development that might take months of asynchronous work instead take days, hours or even minutes.

As I am not key to any one toolchain project, I am humbled to have been invited. Yet, for my free-agency I am exposed to many different projects; last year Meta::CPAN, this year CPAN Testers. Were I not in the room with the people who can get me from zero to contributing, I might not have gotten involved or done so much more slowly. I am deeply indebted to this event for letting me take part, allowing me to help in this way.

Thanks

Of course there are many many people to thank. I want to thank my employer ServerCentral for allowing me the paid time to attend. I’d like thank the organizers again: Neil Bowers, Philippe Bruhat, and Laurent Boivin! And of course this event couldn’t take place without the support of the sponsors, thank you so very much!

Ranguard's blog: PTS 2017 - perl.org / metacpan / grep / cpancover / depreciations

This was the second year I got invited to Perl Toolchain Summit 2017 (PTS) - I think I got even more done than last year.

Having so many people all working on similar projects really does lead to an exponential amount of productivity, both across projects and between Perl 5 and Perl 6.

I could write essays on what I was involved in, but a brief summary:

Thursday

  • Launched new design for perl.org (babs did all the work)
  • Discussed and setup env for grep.metacpan.org with toddr and atoomic
  • Got Pete to start updating metacpan's vm
  • Helped Haarg test and deploy a shim so `cpanminus` clients now use the v1 metacpan api (even though they hit the old v0 API), done using Fastly user agent detection and a new site just for this back compat that haarg has written.
  • Discussion on metacpan v0 depreciation with haarg
  • Start of discussion on improving search

Friday

  • Discussed cpancover infrastructure with pjcj - metacpan is going to at least take a backup of the reports, but may do other things later.
  • Got Lee to help setup the metacpan VM fully, I also downloaded supporting files to usb so can setup more people quickly
  • Discussed cpanratings with several people - Lee to work on something
  • Installed metacpan VM on pjcj, abeltje and start of bab's laptops
  • Refactored metacpan search to clarify exactly what is going on
  • Started on refactoring mc search to use new query (which will make improving the results easier... next couple of days project)
  • Clean up of a few puppet bits for dev boxes
  • Helped deploy various bits on metacpan
  • Discussions about Swagger interface for some metacpan end points

Saturday

  • Installed metacpan VM on more laptops
  • Lots of time on the metacpan search code refactor
  • Setting up grep.metacpan.org sysadmin stuff
  • Initial review of cpancover backup stuff with pjcj

Sunday

  • Got a pull request in for the search refactor, not quite a useful as I'd hoped (new query doesn't give good enough results, so haven't switched to it yet, but at least the code will be easy for the next person to follow).
  • Discussed Meta::Hack 2017

Many thanks to the organisers and sponsors, without who this wouldn't be possible.

Thank you Perl Toolchain Summit 2017 Sponsors

:: Luca Ferrari ::: Arabic to Roman number converter in Perl 5

Puzzled by a post on the Planet KDE about GCompris and roman numbers, and needing an easy way to explain to my six-years old son about roman numbers, I thought it would be an easy task to make a simple program in Perl to convert an arabic number into its roman notation.
Not so easy, pal!
Well, it's not a problem about Perl, of course, rather I found it required a quite brainpower for me to write down rules to convert numbers, and I did not search for the web for a copy-and-paste alghoritm. Please note: if you need a rock-solid way to handle conversions, have a look at CPAN that is full of modules for this particular aim.
Here I'm going to discuss the solution I found and how I implemented it. It is not supposed to be the best one, or the faster one, it's just my solution from scratch.

The program

I split the problem of converting an arabic number into a roman one into three steps, with one dedicated subroutine for each step, so that the main loop reduces to something like the following:
say "$_ = " . $roman_string->( $reassemble->( $disassemble->( $_ ) ) )
for ( 1..30 );
that produces the following output:
1 = I
2 = II
3 = III
4 = IV
5 = V
6 = VI
7 = VII
8 = VIII
9 = IX
10 = X
11 = XI
12 = XII
13 = XIII
14 = XIV
15 = XV
16 = XVI
17 = XVII
18 = XVIII
19 = XIX
20 = XX
21 = XXI
22 = XXII
23 = XXIII
24 = XXIV
25 = XXV
26 = XXVI
27 = XXVII
28 = XXVIII
29 = XXIX
30 = XXX
The steps must be read from the inner subroutine to the outer, of course, and therefore we have:
  • disassemble that translates an arabic number into roman basis, that is computes how many units, tens, hundreds and thousands are required. In this phase there is no application of roman rules, so numbers are decomposed into a linear string of letters. As an example the number 4 is translated into IIII, which is of course a non-existent roman number.
  • reassemble applies roman rules, in particular promoting numbers so that groups are translated, when needed, into higher order letters. For instance IIII is promoted into two groups: I and V.
  • roman_string compose the promoted groups into the final string. The main difficulty of this part is to understand when a letter has to be placed on the right (addition) or on the left (subtraction) of another letter. For instance, having the groups I and V the function must understand if the output have to be VI (6) or IV (4).
To speed up the writing of the code, I placed main roman letters and their correspondance with arabic numbers into a global hash:
my $roman = {
1 => 'I',
5 => 'V',
10 => 'X',
50 => 'L',
100 => 'C',
500 => 'D',
1000 => 'M',
};
Each method references $roman when needing to convert from an arabic number to its roman letter. In order to allow method to cooperate together, they accept and return an hash keyed by a roman letter and the number of occurences such letter must appear in the final string. The following is an example of the hash for a few numbers:
# 4 (IV)
{ 'I' => 1, 'V' => 1 }
# 19 (XIX)
{ 'I' => 1, 'X' => 2 }
# 5 (V)
{ 'V' => 1 }
# 17 (XVII)
{ 'X' => 1, 'V' => 1, 'I' => 2 }

The disassemble function

The following is the code for the disassemble function, that accepts as only input the arabic number.
# Accepts the arabic number and provides an hash
# keyed by each letter, with the value of how many times
# such letter should be summed in order to obtain the
# starting number.
my $disassemble = sub{
my ( $number ) = @_;
my $items = {};

# sort the keys, that are arabic thresolds, from
# the greater to the smaller one
for my $current_value ( sort { $b <=> $a } keys $roman->%* ){

my $how_many = int( $number / $current_value );
next unless ( $how_many );

my $letter = $roman->%{ $current_value };
$items->{ $letter } = $how_many;

$number -= $current_value * $how_many;
}

return $items;
};
The first thing the method does it to create the hash $items that is what it will return to allow other methods to consume. Each key of the $roman hash is passed ordered by the bigger to the smaller (please note that sort has $b first!). In this way we can surely scompose the number from the thousands, hundreds, tens, and units in this exact order. The $how_many variable contains the integer part of each letter. For example the number 29 is processed as follows:
  1. 29 / 10 that drives $how_many to be 2 and the remaining to be a 9;
  2. 9 / 5 that makes $how_many to be 1 and the remaining to be a 4;
  3. 4 / 1 that makes $how_many to be 4 and there's nothing more to do.
At each step the roman letter and the $how_many value is inserted into the $items has, that in the above ecample becomes:
# 29 (XIX)
{ 'X' => 2,
'V' => 1,
'I' => 4
}

The reassemble method

The reassemble method takes as input the hash produced by disassemble and checks if any letter requires a promotion. Here it is the code:
# Accepts an hash with keys the letters and values the number
# of times each letter should appear.
# Traverse the hash from the smaller to the greater
# in order to "promote" smaller aggregates. For instance
# 'IIII' (4) is aggregated and therefore the hash is modified
# so there's only an 'I' and another 'V', in such case
# the quantity of the promoted letter is negative to indicate
# it has been promoted.
my $reassemble = sub{
my ( $items ) = @_;
my @sorted_thresolds = sort { $a <=> $b } keys $roman->%*;

for ( my $i = 0; $i < @sorted_thresolds; $i++ ){
my $current_value = $sorted_thresolds[ $i ];
my $key = $roman->%{ $current_value };
my $how_many = $items->%{ $key };

next unless ( $how_many );

my $greater_value = ( $i + 1 > @sorted_thresolds ? 1000 : $sorted_thresolds[ $i + 1 ] );
my $greater_key = $roman->%{ $greater_value };


my $need_to_promote = $how_many == 4
|| ( $greater_value / $current_value == $how_many );

if ( $need_to_promote ){
$items->{ $greater_key }++;
$how_many = $greater_value - $how_many * $current_value;
$items->{ $key } = $how_many * -1;

}

}

return $items;
};
The promotion must be done from the smaller letter to the greater one, so this time the letters are walked in ascending order (i.e., sort has $a first!). Since to promote a letter I need to access the following one, I need a C-style for loop.
A letter requires to be promoted if its quantity is 4 or /it is 2 and the right bigger value is exactly the double of the current one~, that is while ( $greater_value / $current_value == $how_many ). This makes, for instance IIII to be promoted (the quantity is 4), and VV to be promoted into X (because the quantity is 2 and the X is exactly the double of V). The promotion manipulates the hash increasing by one the right bigger letter and leaving a single current letter. In order to flag the promoted letter, I decided to use a negative quantity (where the absolute value is the exact one).
So for instance, the 29 hash of the previous paragraph is passed as follows:
# input to the method
{ 'X' => 2,
'V' => 1,
'I' => 4
}


# first for step (I)
{ 'X' => 2,
'V' => 2,
'I' => -1 # promoted, keep 1 and increase 'V'
}

# second step (V)
{ 'X' => 3,
'V' => 0, # promoted, increase X by one
'I' => -1
}
At the end of method we know the final string will be made by three X and one I, the point now is to understand how to render them in the correct order. This is the aim of the roman_string method.

The roman_string method

The method accepts the normalized hash (i.e., groups are already formed) and compose the final string placing letter on the left or the right of each other depending on their quantity. The following is the code of the method:
# Do the hard work of composing
# each group of letters in order to compose the roman string.
my $roman_string = sub {
my ( $items ) = @_;
my @chars;

for my $current_value ( sort { $b <=> $a } keys $roman->%* ){
my $letter = $roman->%{ $current_value };
my $how_many = $items->%{ $letter };

next unless ( $how_many );


if ( $how_many > 0 ){
push @chars, $letter for ( 1 .. $how_many );
}
else{
# this is promoted, so it has to be inserted as last-to-last
# in the previous chain
# example: @chars( X, X ) and here I've 'I' to make XIX (19)
push @chars, ( $letter, pop @chars );
}

}

return join "", @chars;
};
In order to be able to manipulate easily the final string, moving letters from left to right and vice-versa, I decided to place each single letter into the @chars array, that is then join -ed into a single string.
Let's suppose we need just to add letters: in this case we need to write letters from the greater to the smaller from left to right, and this is the order I traverse the letters of $roman (again, note that sort has $b first!). If the quantity of the letter is positive the letter has not been promoted and therefore it will not be placed to the left of another letter, so just insert into @chars the $letter for the $how_many quantity. On the other hand, if $how_many is negative, the letter has been promoted and therefore have to be printed on the left of the last printed letter. This is as easy as doing:
push @chars, ( $letter, pop @chars );
that inserts into @chars the $letter and the previous last character that has been removed via pop.
With regards to the previous example of 29 we have that:
# method input
{ 'X' => 3,
'I' => -1
}

# first step: prints X
# with quantity 3 (not promoted)
@chars = ( 'X', 'X', 'X' );

# second step: prints I
# that has been promoted
# and must be inserted ascending
# as last-to-last
@chars = ( 'X', 'X' ,
( 'I', # $letter
'X' # pop @chars
) );

Conclusions

Well, it has been much code that I expected to write. Using an object notation, instead of plain hashes, could surely make the program more robust. I'm pretty sure there's a way to shrink the code down and to avoid that ugly C-style for loop, as well as the promotion part could be simplified keeping in mind that it often reduces to -1 for the current letter and +1 for the greater one. Anyway, it does what I need and seems correct!

NEILB: My time at the toolchain summit

Last week I attended the toolchain summit in Lyon, along with 38 other members of our community. For four days we were all working to make the CPAN and Perl toolchain better. This is my log of what I did.

Dave's Free Press: Journal: Module pre-requisites analyser

Dave's Free Press: Journal: CPANdeps

Dave's Free Press: Journal: Perl isn't dieing

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 3

Dave's Free Press: Journal: Devel::CheckLib can now check libraries' contents

Ocean of Awareness: Top-down parsing is guessing

Top-down parsing is guessing. Literally. Bottom-up parsing is looking.

The way you'll often hear that phrased is that top-down parsing is looking, starting at the top, and bottom-up parsing is looking, starting at the bottom. But that is misleading, because the input is at the bottom -- at the top there is nothing to look at. A usable top-down parser must have a bottom-up component, even if that component is just lookahead.

A more generous, but still accurate, way to describe the top-down component of parsers is "prediction". And prediction is, indeed, a very useful component of a parser, when used in combination with other techniques.

Of course, if a parser does nothing but predict, it can predict only one input. Top-down parsing must always be combined with a bottom-up component. This bottom-up component may be as modest as lookahead, but it must be there or else top-down parsing is really not parsing at all.

So why is top-down parsing used so much?

Top-down parsing may be unusable in its pure form, but from one point of view that is irrelevant. Top-down parsing's biggest advantage is that it is highly flexible -- there's no reason to stick to its "pure" form.

A top-down parser can be written as a series of subroutine calls -- a technique called recursive descent. Recursive descent allows you to hook in custom-written bottom-up logic at every top-down choice point, and it is a technique which is completely understandable to programmers with little or no training in parsing theory. When dealing with recursive descent parsers, it is more useful to be a seasoned, far-thinking programmer than it is to be a mathematician. This makes recursive descent very appealing to seasoned, far-thinking programmers, and they are the audience that counts.

Switching techniques

You can even use the flexibility of top-down to switch away from top-down parsing. For example, you could claim that a top-down parser could do anything my own parser (Marpa) could do, because a recursive descent parser can call a Marpa parser.

A less dramatic switchoff, and one that still leaves the parser with a good claim to be basically top-down, is very common. Arithmetic expressions are essential for a computer language. But they are also among the many things top-down parsing cannot handle, even with ordinary lookahead. Even so, most computer languages these days are parsed top-down -- by recursive descent. These recursive descent parsers deal with expressions by temporarily handing control over to an bottom-up operator precedence parser. Neither of these parsers is extremely smart about the hand-over and hand-back -- it is up to the programmer to make sure the two play together nicely. But used with caution, this approach works.

Top-down parsing and language-oriented programming

But what about taking top-down methods into the future of language-oriented programming, extensible languages, and grammars which write grammars? Here we are forced to confront the reality -- that the effectiveness of top-down parsing comes entirely from the foreign elements that are added to it. Starting from a basis of top-down parsing is literally starting with nothing. As I have shown in more detail elsewhere, top-down techniques simply do not have enough horsepower to deal with grammar-driven programming.

Perl 6 grammars are top-down -- PEG with lots of extensions. These extensions include backtracking, backtracking control, a new style of tie-breaking and lots of opportunity for the programmer to intervene and customize everything. But behind it all is a top-down parse engine.

One aspect of Perl 6 grammars might be seen as breaking out of the top-down trap. That trick of switching over to a bottom-up operator precedence parser for expressions, which I mentioned above, is built into Perl 6 and semi-automated. (I say semi-automated because making sure the two parsers "play nice" with each other is not automated -- that's still up to the programmer.)

As far as I know, this semi-automation of expression handling is new with Perl 6 grammars, and it may prove handy for duplicating what is done in recursive descent parsers. But it adds no new technique to those already in use. And features like

  • mulitple types of expression, which can be told apart based on their context,
  • n-ary expressions for arbitrary n, and
  • the autogeneration of multiple rules, each allowing a different precedence scheme, for expressions of arbitrary arity and associativity,

all of which are available and in current use in Marpa, are impossible for the technology behind Perl 6 grammars.

I am a fan of the Perl 6 effort. Obviously, I have doubts about one specific set of hopes for Perl 6 grammars. But these hopes have not been central to the Perl 6 effort, and I will be an eager student of the Perl 6 team's work over the coming months.

Comments

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: I Love Github

Dave's Free Press: Journal: Palm Treo call db module

Dave's Free Press: Journal: Graphing tool

Dave's Free Press: Journal: Travelling in time: the CP2000AN

Dave's Free Press: Journal: XML::Tiny released

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 1

Ocean of Awareness: Parsing: an expanded timeline

The fourth century BCE: In India, Pannini creates a sophisticated description of the Sanskrit language, exact and complete, and including pronunciation. Sanskrit could be recreated using nothing but Pannini's grammar. Pannini's grammar is probably the first formal system of any kind, predating Euclid. Even today, nothing like it exists for any other natural language of comparable size or corpus. Pannini is the object of serious study today. But in the 1940's and 1950's Pannini is almost unknown in the West. His work has no direct effect on the other events in this timeline.

1943: Emil Post defines and studies a formal rewriting system using productions. With this, the process of reinventing Pannini in the West begins.

1948: Claude Shannon publishes the foundation paper of information theory. Andrey Markov's finite state processes are used heavily.

1952: Grace Hopper writes a linker-loader and describes it as a "compiler". She seems to be the first person to use this term for a computer program. Hopper uses the term "compiler" in its original sense: "something or someone that brings other things together".

1954: At IBM, a team under John Backus begins working on the language which will be called FORTRAN. The term "compiler" is still being used in Hopper's looser sense, instead of its modern one. In particular, there is no implication that the output of a "compiler" is ready for execution by a computer. The output of one 1954 "compiler", for example, produces relative addresses, which need to be translated by hand before a machine can execute them.

1955: Noam Chomsky is awarded a Ph.D. in linguistics and accepts a teaching post at MIT. MIT does not have a linguistics department and Chomsky, in his linguistics course, is free to teach his own approach, highly original and very mathematical.

1956: Chomsky publishes the paper which is usually considered the foundation of Western formal language theory. The paper advocates a natural language approach that involves

  • a bottom layer, using Markov's finite state processes;
  • a middle, syntactic layer, using context-free grammars and context-sensitive grammars; and
  • a top layer, which involves mappings or "transformations" of the output of the syntactic layer.

These layers resemble, and will inspire, the lexical, syntactic and AST transformation phases of modern parsers. For finite state processes, Chomsky acknowledges Markov. The other layers seem to be Chomsky's own formulations -- Chomsky does not cite Post's work.

1957: Steven Kleene discovers regular expressions, a very handy notation for Markov's processes. Regular expressions turn out to describe exactly the mathematical objects being studied as finite state automata, as well as some of the objects being studied as neural nets.

1957: Noam Chomsky publishes Syntactic Structures, one of the most influential books of all time. The orthodoxy in 1957 is structural linguistics which argues, with Sherlock Holmes, that "it is a capital mistake to theorize in advance of the facts". Structuralists start with the utterances in a language, and build upward.

But Chomsky claims that without a theory there are no facts: there is only noise. The Chomskyan approach is to start with a grammar, and use the corpus of the language to check its accuracy. Chomsky's approach will soon come to dominate linguistics.

1957: Backus's team makes the first FORTRAN compiler available to IBM customers. FORTRAN is the first high-level language that will find widespread implementation. As of this writing, it is the oldest language that survives in practical use. FORTRAN is a line-by-line language and its parsing is primitive.

1958: John McCarthy's LISP appears. LISP goes beyond the line-by-line syntax -- it is recursively structured. But the LISP interpreter does not find the recursive structure: the programmer must explicitly indicate the structure herself, using parentheses.

1959: Backus invents a new notation to describe the IAL language (aka ALGOL). Backus's notation is influenced by his study of Post -- he seems not to have read Chomsky until later.

1960: Peter Naur improves the Backus notation and uses it to describe ALGOL 60. The improved notation will become known as Backus-Naur Form (BNF).

1960: The ALGOL 60 report specifies, for the first time, a block structured language. ALGOL 60 is recursively structured but the structure is implicit -- newlines are not semantically significant, and parentheses indicate syntax only in a few specific cases. The ALGOL compiler will have to find the structure. It is a case of 1960's optimism at its best. As the ALGOL committee is well aware, a parsing algorithm capable of handling ALGOL 60 does not yet exist. But the risk they are taking will soon pay off.

1960: A.E. Gleenie publishes his description of a compiler-compiler. Glennie's "universal compiler" is more of a methodology than an implementation -- the compilers must be written by hand. Glennie credits both Chomsky and Backus, and observes that the two notations are "related". He also mentions Post's productions. Glennie may have been the first to use BNF as a description of a procedure instead of as the description of a Chomsky grammar. Glennie points out that the distinction is "important".

Chomskyan BNF and procedural BNF: BNF, when used as a Chomsky grammar, describes a set of strings, and does not describe how to parse strings according to the grammar. BNF notation, if used to describe a procedure, is a set of instructions, to be tried in some order, and used to process a string. Procedural BNF describes a procedure first, and a language only indirectly.

Both procedural and Chomskyan BNF describe languages, but usually not the same language. That is,

  • Suppose D is some BNF description.
  • Let P(D) be D interpreted as a procedure,
  • Let L(P(D)) be the language which the procedure P(D) parses.
  • Let G(D) be D interpreted as a Chomsky grammar.
  • Let L(G(D)) be the language which the grammar G(D) describes.
  • Then, usually, L(P(D)) != L(G(D)).

The pre-Chomskyan approach, using procedural BNF, is far more natural to someone trained as a computer programmer. The parsing problem appears to the programmer in the form of strings to be parsed, exactly the starting point of procedural BNF and pre-Chomsky parsing.

Even when the Chomskyan approach is pointed out, it does not at first seem very attractive. With the pre-Chomskyan approach, the examples of the language more or less naturally lead to a parser. In the Chomskyan approach the programmer has to search for an algorithm to parse strings according to his grammar -- and the search for good algorithms to parse Chomskyan grammars has proved surprisingly long and difficult. Handling semantics is more natural with a Chomksyan approach. But, using captures, semantics can be added to a pre-Chomskyan parser and, with practice, this seems natural enough.

Despite the naturalness of the pre-Chomskyan approach to parsing, we will find that the first fully-described automated parsers are Chomskyan. This is a testimony to Chomsky's influence at the time. We will also see that Chomskyan parsers have been dominant ever since.

1961: In January, Ned Irons publishes a paper describing his ALGOL 60 parser. It is the first paper to fully describe any parser. The Irons algorithm is Chomskyan and top-down with a "left corner" element. The Irons algorithm is general, meaning that it can parse anything written in BNF. It is syntax-driven (aka declarative), meaning that the parser is actually created from the BNF -- the parser does not need to be hand-written.

1961: Peter Lucas publishes the first description of a purely top-down parser. This can be considered to be recursive descent, though in Lucas's paper the algorithm has a syntax-driven implementation, useable only for a restricted class of grammars. Today we think of recursive descent as a methodology for writing parsers by hand. Hand-coded approaches became more popular in the 1960's due to three factors:

  • Memory and CPU were both extremely limited. Hand-coding paid off, even when the gains were small.
  • Non-hand coded top-down parsing, of the kind Lucas's syntax-driven approach allowed, is a very weak parsing technique. It was (and still is) often necessary to go beyond its limits.
  • Top-down parsing is intuitive -- it essentially means calling subroutines. It therefore requires little or no knowledge of parsing theory. This makes it a good fit for hand-coding.

1963: L. Schmidt, Howard Metcalf, and Val Schorre present papers on syntax-directed compilers at a Denver conference.

1964: Schorre publishes a paper on the Meta II "compiler writing language", summarizing the papers of the 1963 conference. Schorre cites both Backus and Chomsky as sources for Meta II's notation. Schorre notes that his parser is "entirely different" from that of Irons 1961 -- in fact it is pre-Chomskyan. Meta II is a template, rather than something that readers can use, but in principle it can be turned into a fully automated compiler-compiler.

1965: Don Knuth invents LR parsing. The LR algorithm is deterministic, Chomskyan and bottom-up, but it is not thought to be practical. Knuth is primarily interested in the mathematics.

1968: Jay Earley invents the algorithm named after him. Like the Irons algorithm, Earley's algorithm is Chomskyan, syntax-driven and fully general. Unlike the Irons algorithm, it does not backtrack. Earley's algorithm is both top-down and bottom-up at once -- it uses dynamic programming and keeps track of the parse in tables. Earley's approach makes a lot of sense and looks very promising indeed, but there are three serious issues:

  • First, there is a bug in the handling of zero-length rules.
  • Second, it is quadratic for right recursions.
  • Third, the bookkeeping required to set up the tables is, by the standards of 1968 hardware, daunting.

1969: Frank DeRemer describes a new variant of Knuth's LR parsing. DeRemer's LALR algorithm requires only a stack and a state table of quite manageable size. LALR looks practical.

1969: Ken Thompson writes the "ed" editor as one of the first components of UNIX. At this point, regular expressions are an esoteric mathematical formalism. Through the "ed" editor and its descendants, regular expressions will become an everyday part of the working programmer's toolkit.

Recognizers: In comparing algorithms, it can be important to keep in mind whether they are recognizers or parsers. A recognizer is a program which takes a string and produces a "yes" or "no" according to whether a string is in part of a language. Regular expressions are typically used as recognizers. A parser is a program which takes a string and produces a tree reflecting its structure according to a grammar. The algorithm for a compiler clearly must be a parser, not a recognizer. Recognizers can be, to some extent, used as parsers by introducing captures.

1972: Alfred Aho and Jeffrey Ullman publish a two volume textbook summarizing the theory of parsing. This book is still important. It is also distressingly up-to-date -- progress in parsing theory slowed dramatically after 1972. Aho and Ullman describe a straightforward fix to the zero-length rule bug in Earley's original algorithm. Unfortunately, this fix involves adding even more bookkeeping to Earley's.

1972: Under the names TDPL and GTDPL, Aho and Ullman investigate the non-Chomksyan parsers in the Schorre lineage. They note that "it can be quite difficult to determine what language is defined by a TDPL parser". That is, GTDPL parsers do whatever they do, and that whatever is something the programmer in general will not be able to describe. The best a programmer can usually do is to create a test suite and fiddle with the GTDPL description until it passes. Correctness cannot be established in any stronger sense. GTDPL is an extreme form of the old joke that "the code is the documentation" -- with GTDPL nothing documents the language of the parser, not even the code.

GTDPL's obscurity buys nothing in the way of additional parsing power. Like all non-Chomskyan parsers, GTDPL is basically a extremely powerful recognizer. Pressed into service as a parser, it is comparatively weak. As a parser, GTDPL is essentially equivalent to Lucas's 1961 syntax-driven algorithm, which was in turn a restricted form of recursive descent.

At or around this time, rumor has it that the main line of development for GTDPL parsers is classified secret by the US government. GTDPL parsers have the property that even small changes in GTDPL parsers can be very labor-intensive. For some government contractors, GTDPL parsing provides steady work for years to come. Public interest in GTDPL fades.

1975: Bell Labs converts its C compiler from hand-written recursive descent to DeRemer's LALR algorithm.

1977: The first "Dragon book" comes out. This soon-to-be classic textbook is nicknamed after the drawing on the front cover, in which a knight takes on a dragon. Emblazoned on the knight's lance are the letters "LALR". From here on out, to speak lightly of LALR will be to besmirch the escutcheon of parsing theory.

1979: Bell Laboratories releases Version 7 UNIX. V7 includes what is, by far, the most comprehensive, useable and easily available compiler writing toolkit yet developed.

1979: Part of the V7 toolkit is Yet Another Compiler Compiler (YACC). YACC is LALR-powered. Despite its name, YACC is the first compiler-compiler in the modern sense. For some useful languages, the process of going from Chomskyan specification to executable is fully automated. Most practical languages, including the C language and YACC's own input language, still require manual hackery. Nonetheless, after two decades of research, it seems that the parsing problem is solved.

1987: Larry Wall introduces Perl 1. Perl embraces complexity like no previous language. Larry uses YACC and LALR very aggressively -- to my knowledge more aggressively than anyone before or since.

1991: Joop Leo discovers a way of speeding up right recursions in Earley's algorithm. Leo's algorithm is linear for just about every unambiguous grammar of practical interest, and many ambiguous ones as well. In 1991 hardware is six orders of magnitude faster than 1968 hardware, so that the issue of bookkeeping overhead had receded in importance. This is a major discovery. When it comes to speed, the game has changed in favor of the Earley algorithm.

But Earley parsing is almost forgotten. Twenty years will pass before anyone writes a practical implementation of Leo's algorithm.

1990's: Earley's is forgotten. So everyone in LALR-land is content, right? Wrong. Far from it, in fact. Users of LALR are making unpleasant discoveries. While LALR automatically generates their parsers, debugging them is so hard they could just as easily write the parser by hand. Once debugged, their LALR parsers are fast for correct inputs. But almost all they tell the users about incorrect inputs is that they are incorrect. In Larry's words, LALR is "fast but stupid".

2000: Larry Wall decides on a radical reimplementation of Perl -- Perl 6. Larry does not even consider using LALR again.

2002: John Aycock and R. Nigel Horspool publish their attempt at a fast, practical Earley's parser. Missing from it is Joop Leo's improvement -- they seem not to be aware of it. Their own speedup is limited in what it achieves and the complications it introduces can be counter-productive at evaluation time. But buried in their paper is a solution to the zero-length rule bug. And this time the solution requires no additional bookkeeping.

2004: Bryan Ford publishes his paper on PEG. Implementers by now are avoiding YACC, and it seems as if there might soon be no syntax-driven algorithms in practical use. Ford fills this gap by repackaging the nearly-forgotten GTDPL. Ford adds packratting, so that PEG is always linear, and provides PEG with an attractive new syntax. But nothing has been done to change the problematic behaviors of GTDPL.

2006: GNU announces that the GCC compiler's parser has been rewritten. For three decades, the industry's flagship C compilers have used LALR as their parser -- proof of the claim that LALR and serious parsing are equivalent. Now, GNU replaces LALR with the technology that it replaced a quarter century earlier: recursive descent.

Today: After five decades of parsing theory, the state of the art seems to be back where it started. We can imagine someone taking Ned Iron's original 1961 algorithm from the first paper ever published describing a parser, and republishing it today. True, he would have to translate its code from the mix of assembler and ALGOL into something more fashionable, say Haskell. But with that change, it might look like a breath of fresh air.

Marpa: an afterword

The recollections of my teachers cover most of this timeline. My own begin around 1970. Very early on, as a graduate student, I became unhappy with the way the field was developing. Earley's algorithm looked interesting, and it was something I returned to on and off.

The original vision of the 1960's was a parser that was

  • efficient,
  • practical,
  • general, and
  • syntax-driven.

By 2010 this vision seemed to have gone the same way as many other 1960's dreams. The rhetoric stayed upbeat, but parsing practice had become a series of increasingly desperate compromises.

But, while nobody was looking for them, the solutions to the problems encountered in the 1960's had appeared in the literature. Aycock and Horspool had solved the zero-length rule bug. Joop Leo had found the speedup for right recursion. And the issue of bookkeeping overhead had pretty much evaporated on its own. Machine operations are now a billion times faster than in 1968, and are probably no longer relevant in any case -- cache misses are now the bottleneck.

The programmers of the 1960's would have been prepared to trust a fully declarative Chomskyan parser. With the experience with LALR in their collective consciousness, modern programmers might be more guarded. As Lincoln said, "Once a cat's been burned, he won't even sit on a cold stove." But I found it straightforward to rearrange the Earley parse engine to allow efficient event-driven handovers between procedural and syntax-driven logic. And Earley tables provide the procedural logic with full knowledge of the state of the parse so far, so that Earley's algorithm is a better platform for hand-written procedural logic than recursive descent.

References, comments, etc.

My implementation of Earley's algorithm is called Marpa. For more about Marpa, there is the semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thanks, Yahoo!

Dave's Free Press: Journal: YAPC::Europe 2007 report: day 2

Ocean of Awareness: What are the reasonable computer languages?

"You see things; and you say 'Why?' But I dream things that never were; and I say 'Why not?'" -- George Bernard Shaw

In the 1960's and 1970's computer languages were evolving rapidly. It was not clear which way they were headed. Would most programming be done with general-purpose languages? Or would programmers create a language for every task domain? Or even for every project? And, if lots of languages were going to be created, what kinds of languages would be needed?

It was in that context that Čulik and Cohen, in a 1973 paper, outlined what they thought programmers would want and should have. In keeping with the spirit of the time, it was quite a lot:

  • Programmers would want to extend their grammars with new syntax, including new kinds of expressions.
  • Programmers would also want to use tools that automatically generated new syntax.
  • Programmers would not want to, and especially in the case of auto-generated syntax would usually not be able to, massage the syntax into very restricted forms. Instead, programmers would create grammars and languages which required unlimited lookahead to disambiguate, and they would require parsers which could handle these grammars.
  • Finally, programmers would need to be able to rely on all of this parsing being done in linear time.

Today, we think we know that Čulik and Cohen's vision was naive, because we think we know that parsing technology cannot support it. We think we know that parsing is much harder than they thought.

The eyeball grammars

As a thought problem, consider the "eyeball" class of grammars. The "eyeball" class of grammars contains all the grammars that a human can parse at a glance. If a grammar is in the eyeball class, but a computer cannot parse it, it presents an interesting choice. Either,

  • your computer is not using the strongest practical algorithm; or
  • your mind is using some power which cannot be reduced to a machine computation.

There are some people out there (I am one of them) who don't believe that everything the mind can do reduces to a machine computation. But even those people will tend to go for the choice in this case: There must be some practical computer parsing algorithm which can do at least as well at parsing as a human can do by "eyeball". In other words, the class of "reasonable grammars" should contain the eyeball class.

Čulik and Cohen's candidate for the class of "reasonable grammars" were the grammars that a deterministic parse engine could parse if it had a lookahead that was infinite, but restricted to distinguishing between regular expressions. They called these the LR-regular, or LRR, grammars. And the LRR grammars do in fact seem to be a good first approximation to the eyeball class. They do not allow lookahead that contains things that you have to count, like palindromes. And, while I'd be hard put to eyeball every possible string for every possible regular expression, intuitively the concept of scanning for a regular expression does seem close to capturing the idea of glancing through a text looking for a telltale pattern.

So what happened?

Alas, the algorithm in the Čulik and Cohen paper turned out to be impractical. But in 1991, Joop Leo discovered a way to adopt Earley's algorithm to parse the LRR grammars in linear time, without doing the lookahead. And Leo's algorithm does have a practical implementation: Marpa.

References, comments, etc.

To learn more about Marpa, there's the official web site maintained by Ron Savage. I also have a Marpa web site. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2007 travel plans

Dave's Free Press: Journal: Wikipedia handheld proxy

Dave's Free Press: Journal: Bryar security hole

Dave's Free Press: Journal: POD includes

Dave's Free Press: Journal: cgit syntax highlighting

Dave's Free Press: Journal: CPAN Testers' CPAN author FAQ

Ocean of Awareness: What parser do birds use?

"Here we provide, to our knowledge, the first unambiguous experimental evidence for compositional syntax in a non-human vocal system." -- "Experimental evidence for compositional syntax in bird calls", Toshitaka N. Suzuki, David Wheatcroft & Michael Griesser Nature Communications 7, Article number: 10986

In this post I look at a subset of the language of the Japanese great tit, also known as Parus major. The above cited article presents evidence that bird brains can parse this language. What about standard modern computer parsing methods? Here is the subset -- probably a tiny one -- of the language actually used by Parus major.

      S ::= ABC
      S ::= D
      S ::= ABC D
      S ::= D ABC
    

Classifying the Parus major grammar

Grammophone is a very handy new tool for classifying grammars. Its own parser is somewhat limited, so that it requires a period to mark the end of a rule. The above grammar is in Marpa's SLIF format, which is smart enough to use the "::=" operator to spot the beginning and end of rules, just as the human eye does. Here's the same grammar converted into a form acceptable to Grammophone:

      S -> ABC .
      S -> D .
      S -> ABC D .
      S -> D ABC .
    

Grammophone tells us that the Parus major grammar is not LL(1), but that it is LALR(1).

What does this mean?

LL(1) is the class of grammar parseable by top-down methods: it's the best class for characterizing most parsers in current use, including recursive descent, PEG, and Perl 6 grammars. All of these parsers fall short of dealing with the Parus major language.

LALR(1) is probably most well-known from its implementations in bison and yacc. While able to handle this subset of Parus's language, LALR(1) has its own, very strict, limits. Whether LALR(1) could handle the full complexity of Parus language is a serious question. But it's a question that in practice would probably not arise. LALR(1) has horrible error handling properties.

When the input is correct and within its limits, an LALR-driven parser is fast and works well. But if the input is not perfectly correct, LALR parsers produce no useful analysis of what went wrong. If Parus hears "d abc d", a parser like Marpa, on the other hand, can produce something like this:

# * String before error: abc d\s
# * The error was at line 1, column 7, and at character 0x0064 'd', ...
# * here: d
    

Parus uses its language in predatory contexts, and one can assume that a Parus with a preference for parsers whose error handling is on an LALR(1) level will not be keeping its alleles in the gene pool for very long.

References, comments, etc.

Those readers content with sub-Parus parsing methods may stop reading here. Those with greater parsing ambitions, however, may wish to learn more about Marpa. A Marpa test script for parsing the Parus subset is in a Github gist. Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: Thankyou, Anonymous Benefactor!

Dave's Free Press: Journal: Number::Phone release

Dave's Free Press: Journal: Ill

Dave's Free Press: Journal: CPANdeps upgrade

Ocean of Awareness: Introduction to Marpa Book in progress

What follows is a summary of the features of the Marpa algorithm, followed by a discussion of potential applications. It refers to itself as a "monograph", because it is a draft of part of the introduction to a technical monograph on the Marpa algorithm. I hope the entire monograph will appear in a few weeks.

The Marpa project

The Marpa project was intended to create a practical and highly available tool to generate and use general context-free parsers. Tools of this kind had long existed for LALR and regular expressions. But, despite an encouraging academic literature, no such tool had existed for context-free parsing. The first stable version of Marpa was uploaded to a public archive on Solstice Day 2011. This monograph describes the algorithm used in the most recent version of Marpa, Marpa::R2. It is a simplification of the algorithm presented in my earlier paper.

A proven algorithm

While the presentation in this monograph is theoretical, the approach is practical. The Marpa::R2 implementation has been widely available for some time, and has seen considerable use, including in production environments. Many of the ideas in the parsing literature satisfy theoretical criteria, but in practice turn out to face significant obstacles. An algorithm may be as fast as reported, but may turn out not to allow adequate error reporting. Or a modification may speed up the recognizer, but require additional processing at evaluation time, leaving no advantage to compensate for the additional complexity.

In this monograph, I describe the Marpa algorithm as it was implemented for Marpa::R2. In many cases, I believe there are better approaches than those I have described. But I treat these techniques, however solid their theory, as conjectures. Whenever I mention a technique that was not actually implemented in Marpa::R2, I will always explicitly state that that technique is not in Marpa as implemented.

Features

General context-free parsing

As implemented, Marpa parses all "proper" context-free grammars. The proper context-free grammars are those which are free of cycles, unproductive symbols, and inaccessible symbols. Worst case time bounds are never worse than those of Earley's algorithm, and therefore never worse than O(n**3).

Linear time for practical grammars

Currently, the grammars suitable for practical use are thought to be a subset of the deterministic context-free grammars. Using a technique discovered by Joop Leo, Marpa parses all of these in linear time. Leo's modification of Earley's algorithm is O(n) for LR-regular grammars. Leo's modification also parses many ambiguous grammars in linear time.

Left-eidetic

The original Earley algorithm kept full information about the parse --- including partial and fully recognized rule instances --- in its tables. At every parse location, before any symbols are scanned, Marpa's parse engine makes available its information about the state of the parse so far. This information is in useful form, and can be accessed efficiently.

Recoverable from read errors

When Marpa reads a token which it cannot accept, the error is fully recoverable. An application can try to read another token. The application can do this repeatedly as long as none of the tokens are accepted. Once the application provides a token that is accepted by the parser, parsing will continue as if the unsuccessful read attempts had never been made.

Ambiguous tokens

Marpa allows ambiguous tokens. These are often useful in natural language processing where, for example, the same word might be a verb or a noun. Use of ambiguous tokens can be combined with recovery from rejected tokens so that, for example, an application could react to the rejection of a token by reading two others.

Using the features

Error reporting

An obvious application of left-eideticism is error reporting. Marpa's abilities in this respect are ground-breaking. For example, users typically regard an ambiguity as an error in the grammar. Marpa, as currently implemented, can detect an ambiguity and report specifically where it occurred and what the alternatives were.

Event driven parsing

As implemented, Marpa::R2 allows the user to define "events". Events can be defined that trigger when a specified rule is complete, when a specified rule is predicted, when a specified symbol is nulled, when a user-specified lexeme has been scanned, or when a user-specified lexeme is about to be scanned. A mid-rule event can be defined by adding a nulling symbol at the desired point in the rule, and defining an event which triggers when the symbol is nulled.

Ruby slippers parsing

Left-eideticism, efficient error recovery, and the event mechanism can be combined to allow the application to change the input in response to feedback from the parser. In traditional parser practice, error detection is an act of desperation. In contrast, Marpa's error detection is so painless that it can be used as the foundation of new parsing techniques.

For example, if a token is rejected, the lexer is free to create a new token in the light of the parser's expectations. This approach can be seen as making the parser's "wishes" come true, and I have called it "Ruby Slippers Parsing".

One use of the Ruby Slippers technique is to parse with a clean but oversimplified grammar, programming the lexical analyzer to make up for the grammar's short-comings on the fly. As part of Marpa::R2, the author has implemented an HTML parser, based on a grammar that assumes that all start and end tags are present. Such an HTML grammar is too simple even to describe perfectly standard-conformant HTML, but the lexical analyzer is programmed to supply start and end tags as requested by the parser. The result is a simple and cleanly designed parser that parses very liberal HTML and accepts all input files, in the worst case treating them as highly defective HTML.

Ambiguity as a language design technique

In current practice, ambiguity is avoided in language design. This is very different from the practice in the languages humans choose when communicating with each other. Human languages exploit ambiguity in order to design highly flexible, powerfully expressive languages. For example, the language of this monograph, English, is notoriously ambiguous.

Ambiguity of course can present a problem. A sentence in an ambiguous language may have undesired meanings. But note that this is not a reason to ban potential ambiguity --- it is only a problem with actual ambiguity.

Syntax errors, for example, are undesired, but nobody tries to design languages to make syntax errors impossible. A language in which every input was well-formed and meaningful would be cumbersome and even dangerous: all typos in such a language would be meaningful, and parser would never warn the user about errors, because there would be no such thing.

With Marpa, ambiguity can be dealt with in the same way that syntax errors are dealt with in current practice. The language can be designed to be ambiguous, but any actual ambiguity can be detected and reported at parse time. This exploits Marpa's ability to report exactly where and what the ambiguity is. Marpa::R2's own parser description language, the SLIF, uses ambiguity in this way.

Auto-generated languages

In 1973, Čulik and Cohen pointed out that the ability to efficiently parse LR-regular languages opens the way to auto-generated languages. In particular, Čulik and Cohen note that a parser which can parse any LR-regular language will be able to parse a language generated using syntax macros.

Second order languages

In the literature, the term "second order language" is usually used to describe languages with features which are useful for second-order programming. True second-order languages --- languages which are auto-generated from other languages --- have not been seen as practical, since there was no guarantee that the auto-generated language could be efficiently parsed.

With Marpa, this barrier is raised. As an example, Marpa::R2's own parser description language, the SLIF, allows "precedenced rules". Precedenced rules are specified in an extended BNF. The BNF extensions allow precedence and associativity to be specified for each RHS.

Marpa::R2's precedenced rules are implemented as a true second order language. The SLIF representation of the precedenced rule is parsed to create a BNF grammar which is equivalent, and which has the desired precedence. Essentially, the SLIF does a standard textbook transformation. The transformation starts with a set of rules, each of which has a precedence and an associativity specified. The result of the transformation is a set of rules in pure BNF. The SLIF's advantage is that it is powered by Marpa, and therefore the SLIF can be certain that the grammar that it auto-generates will parse in linear time.

Notationally, Marpa's precedenced rules are an improvement over similar features in LALR-based parser generators like yacc or bison. In the SLIF, there are two important differences. First, in the SLIF's precedenced rules, precedence is generalized, so that it does not depend on the operators: there is no need to identify operators, much less class them as binary, unary, etc. This more powerful and flexible precedence notation allows the definition of multiple ternary operators, and multiple operators with arity above three.

Second, and more important, a SLIF user is guaranteed to get exactly the language that the precedenced rule specifies. The user of the yacc equivalent must hope their syntax falls within the limits of LALR.

References, comments, etc.

Marpa has a semi-official web site, maintained by Ron Savage. The official, but more limited, Marpa website is my personal one. Comments on this post can be made in Marpa's Google group, or on our IRC channel: #marpa at freenode.net.

Dave's Free Press: Journal: YAPC::Europe 2006 report: day 3

Header image by Tambako the Jaguar. Some rights reserved.